CN107682348A - DGA domain name Quick method and devices based on machine learning - Google Patents

DGA domain name Quick method and devices based on machine learning Download PDF

Info

Publication number
CN107682348A
CN107682348A CN201710976231.XA CN201710976231A CN107682348A CN 107682348 A CN107682348 A CN 107682348A CN 201710976231 A CN201710976231 A CN 201710976231A CN 107682348 A CN107682348 A CN 107682348A
Authority
CN
China
Prior art keywords
domain name
domain
feature
training set
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710976231.XA
Other languages
Chinese (zh)
Inventor
莫凡
范渊
刘博�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN201710976231.XA priority Critical patent/CN107682348A/en
Publication of CN107682348A publication Critical patent/CN107682348A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of DGA domain name Quick method and devices based on machine learning, it is related to technical field of network security.The DGA domain name Quick methods based on machine learning include:Structure includes the training set of multiple DGA domain names and normal domain name;Extract the domain name feature of each domain name in the training set;Domain name feature is normalized and obtains characteristic set;Vertical domain name sorter model is built jointly based on the characteristic data set.Method and device provided by the invention is extracted more rich, more representational domain name feature by the research to domain name;By the way that characteristic is normalized, training and test are can speed up, so as to improve computational efficiency;Finally characteristic set is trained using machine learning algorithm and obtains domain name sorter model, generalization ability is improved while judging nicety rate is improved.

Description

DGA domain name Quick method and devices based on machine learning
Technical field
It is quick in particular to a kind of DGA domain names based on machine learning the present invention relates to technical field of network security Method of discrimination and device.
Background technology
DGA domain names refer to utilize a series of of domain name generating algorithm (Domain Generation Algorithm) generation Random domain name.This method is common in Botnet (Botnet), such as conficker, zeus etc, and they can utilize a private Some random string generating algorithms, according to date or other random seeds, some random string domain names are generated daily, so Some of which domain name is registered afterwards, so as to be swindled, propagates the malfeasance such as Malware, distribution Pornograph.
As the technologies such as Domain-Flux, Fast-Flux are used by hacker more and more widely, entered using DGA domain names Capable network attack is more hidden and is difficult to follow the trail of.As long as it is infected machine in Botnet also to attempt according to same algorithm Generate these random domain names and then collide success, can be just controlled by hacker, and then initiate distributed denial of service, rubbish postal The network attacks such as part.
Traditional method is mainly to be detected by the experience of white cap, and this method expends substantial amounts of manpower, very The competent huge mission requirements of today of hardly possible.Another kind of method feature based construction, is triggered from similarity measurement, by calculating sample To obtaining threshold value, so that it is determined that whether domain name to be detected is DGA domain names, it uses relatively simple method for measuring similarity, examines Worry feature is more single, and Generalization Capability is poor, and accuracy rate is not also high.
The content of the invention
It is an object of the invention to provide a kind of DGA domain name Quick method and devices based on machine learning, its energy Enough it is effectively improved above mentioned problem.
What embodiments of the invention were realized in:
In a first aspect, the embodiments of the invention provide a kind of DGA domain name Quick methods based on machine learning, it is described Method includes:Structure includes the training set of multiple DGA domain names and normal domain name;Extract each domain name in the training set Domain name feature;Domain name feature is normalized and obtains characteristic set;Vertical domain is built jointly based on the characteristic data set Name sorter model.
Second aspect, the embodiment of the present invention additionally provide a kind of DGA domain name fast discriminating devices based on machine learning, its Module is built including training set, the training set of multiple DGA domain names and normal domain name is included for building;Characteristic extracting module, For extracting the domain name feature of each domain name in the training set;Module is normalized, for returning to domain name feature One changes acquisition characteristic set;Model building module, for building vertical domain name sorter model jointly based on the characteristic data set.
DGA domain name Quick method and devices provided in an embodiment of the present invention based on machine learning, first structure bag Training set containing multiple DGA domain names and normal domain name, enough samples are provided subsequently to establish domain name sorter model;Then Extract the domain name feature of each domain name in the training set, using representative domain name feature as judge domain name whether be The criterion of DGA domain names;Domain name feature is normalized again and obtains characteristic set, with unified each characteristic Dimension, improve computational efficiency;It is finally based on the characteristic data set and builds vertical domain name sorter model jointly, you can is easy to utilize the machine The domain name sorter model that device learning training obtains detects to various unknown domain names, and realization quickly and accurately judges to be measured Whether domain name is DGA domain names.Relative to prior art, the DGA domain names provided in an embodiment of the present invention based on machine learning are quick Method of discrimination and device are extracted more rich, more representational domain name feature by the research to domain name;By to characteristic According to being normalized, training and test are can speed up, so as to improve computational efficiency;Finally using machine learning algorithm to characteristic It is trained according to set and obtains domain name sorter model, generalization ability is improved while judging nicety rate is improved.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by embodiment it is required use it is attached Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to this A little accompanying drawings obtain other related accompanying drawings.
Fig. 1 is a kind of structured flowchart for the electronic equipment that can be applied in the embodiment of the present invention;
Fig. 2 is the flow chart element for the DGA domain name Quick methods based on machine learning that first embodiment of the invention provides Figure;
Fig. 3 is the sub-step FB(flow block) of step S210 in first embodiment of the invention;
Fig. 4 is the sub-step FB(flow block) of step S300 in first embodiment of the invention;
Fig. 5 is the sub-step FB(flow block) of step S310 in first embodiment of the invention;
Fig. 6 is the sub-step FB(flow block) of step S320 in first embodiment of the invention;
Fig. 7 is the sub-step FB(flow block) of step S330 in first embodiment of the invention;
Fig. 8 is the sub-step FB(flow block) of step S230 in first embodiment of the invention;
Fig. 9 is step S500, the step S510 FB(flow block) that first embodiment of the invention provides;
Figure 10 is the structural frames for the DGA domain name fast discriminating devices based on machine learning that second embodiment of the invention provides Figure.
Embodiment
Below in conjunction with accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Ground describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Generally exist The component of the embodiment of the present invention described and illustrated in accompanying drawing can be configured to arrange and design with a variety of herein.Cause This, the detailed description of the embodiments of the invention to providing in the accompanying drawings is not intended to limit claimed invention below Scope, but it is merely representative of the selected embodiment of the present invention.Based on embodiments of the invention, those skilled in the art are not doing The every other embodiment obtained on the premise of going out creative work, belongs to the scope of protection of the invention.
It should be noted that:Similar label and letter represents similar terms in following accompanying drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent accompanying drawing in individual accompanying drawing.Meanwhile the present invention's In description, term " first ", " second " etc. are only used for distinguishing description, and it is not intended that instruction or hint relative importance.
Fig. 1 shows a kind of structured flowchart for the electronic equipment 100 that can be applied in the embodiment of the present application.As shown in figure 1, Electronic equipment 100 can include memory 110, storage control 120, processor 130, display screen 140 and based on engineering The DGA domain name fast discriminating devices of habit.For example, the electronic equipment 100 can be PC (personal computer, PC), tablet personal computer, smart mobile phone, personal digital assistant (personal digital assistant, PDA) etc..
It is directly or indirectly electric between memory 110, storage control 120, processor 130,140 each element of display screen Connection, to realize the transmission of data or interaction.For example, one or more communication bus or signal can be passed through between these elements Bus realizes electrical connection.The DGA domain name Quicks method based on machine learning respectively include it is at least one can be with soft The form of part or firmware (firmware) is stored in the software function module in memory 110, such as described is based on machine learning DGA domain name the fast discriminating devices software function module or computer program that include.
Memory 110 can store various software programs and module, if the embodiment of the present application offer is based on engineering Programmed instruction/module corresponding to the DGA domain name Quick method and devices of habit.Processor 130 is stored in storage by operation Software program and module in device 110, so as to perform various function application and data processing, that is, realize the embodiment of the present application In the DGA domain name Quick methods based on machine learning.Memory 110 can include but is not limited to random access memory (Random Access Memory, RAM), read-only storage (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..
Processor 130 can be a kind of IC chip, have signal handling capacity.Above-mentioned processor can be general Processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.;It can also be digital signal processor (DSP), application specific integrated circuit (ASIC), ready-made programmable Gate array (FPGA) either other PLDs, discrete gate or transistor logic, discrete hardware components.It can To realize or perform disclosed each method, step and the logic diagram in the embodiment of the present application.General processor can be micro- Processor or the processor can also be any conventional processors etc..
Electronic equipment 100 applied in the embodiment of the present invention is DGA domain name Quick of the realization based on machine learning, Can also possess from display function, display screen 140 therein can provide one between the electronic equipment 100 and user Interactive interface (such as user interface) refers to for display image data to user.For example, it can show based on machine The data such as the domain name training set of the DGA domain names fast discriminating device foundation of study and the domain name feature of extraction.
Firstly the need of explanation before the specific embodiment of the present invention is introduced, the present invention is computer technology in information A kind of application of security technology area.In the implementation process of the present invention, the application of multiple software function modules can be related to.Shen Ask someone to think, it is existing combining such as after application documents, accurate understanding realization principle and goal of the invention of the invention is read over In the case of having known technology, those skilled in the art can use the software programming technical ability of its grasp to realize the present invention completely, The software function module that all the present patent application files refer to belongs to this category, and applicant will not enumerate.
First embodiment
Fig. 2 is refer to, a kind of DGA domain name Quick methods based on machine learning is present embodiments provided, is applied to DGA domain name fast discriminating devices based on machine learning, methods described include:
Step S200:Structure includes the training set of multiple DGA domain names and normal domain name;
In the present embodiment, the DGA domain names can be described as positive example again, and it can include what is generated by common DGA algorithms DGA domain names, and the malice domain name obtained by channel of increasing income.The normal domain name can be described as counter-example again, and it can include mesh Preceding generally acknowledged inert normal domain name, for example, in Alexa websites ranking forefront multinomial domain name.
For example, domain name " www.google.com ", it is normal domain name.
Step S210:Extract the domain name feature of each domain name in the training set;
, can be first in training set before the domain name feature of each domain name in extracting the training set in the present embodiment Each domain name pre-processed, extract principal character representative in each domain name, for example, the Main Domain of each domain name, TLD suffix (Top-Level Domain) is the last part of domain name.
For example, domain name " www.google.com ", its Main Domain is google, and its TLD suffix is com.
It is understood that in the present embodiment, the domain name feature of extraction can be single, such as only be entered by Main Domain The differentiation of row DGA domain names;The domain name feature of extraction can also be multiple, such as by extracting Main Domain, the TLD of each domain name Suffix, more features are also expanded on the Main Domain and TLD suffix to refine judgment rule, improve DGA domain names and differentiate The degree of accuracy.For example, can by the character transition probability in the length of Main Domain, the characteristic of speech sounds of Main Domain, Main Domain and The TLD suffix of domain name is extracted collectively as domain name feature.
Step S220:Domain name feature is normalized and obtains characteristic set;
In the present embodiment, the domain name feature of previous step extraction is normalized, it is special that each domain name can be unified The dimension of sign, computational efficiency is improved, is easy to follow-up machine learning training and the foundation of discrimination model.
Step S230:Vertical domain name sorter model is built jointly based on the characteristic data set.
In the present embodiment, the characteristic set can be trained using machine learning algorithm, to establish domain name Sorter model.The domain name sorter model obtained by machine learning, can fast and accurately be identified according to domain name feature DGA domain names, it can be used for being predicted unknown domain name.
It refer to Fig. 3, in the present embodiment, further, the step S210 can include following sub-step:
Step S300:Extract the length characteristic of each domain name in the training set;
In the present embodiment, the length characteristic of each domain name can be the length of Main Domain in each domain name.
For example, domain name " www.google.com ", its Main Domain google length is 6.
Step S310:Extract the n-gram features of each domain name in the training set;
In the present embodiment, the n-gram is called n gram language models, and n members represent n connected characters, its frequency occurred The characteristic of language can be embodied.
For example, when n takes 1,2,3 respectively, n phase loigature of the Main Domain " google " of domain name " www.google.com " Symbol string is as shown in table 1:
Table 1
Step S320:Extract the transition probability feature of each domain name in the training set;
Transition probability is the key concept in Markov chain, if markov chain is divided into m state composition, historical summary conversion For the sequence being made up of this m state.From any one state, by arbitrarily once shifting, necessarily go out present condition 1, 2nd ..., one in m, the transfer between this state are referred to as transition probability.Each character in Main Domain can regard horse as A state in Er Kefu chains (Markov Chain), each of which state value depend on above limited individual state, limited individual shape State generally takes 1 state.
For example, the Main Domain " google " of domain name " www.google.com ", its transition probability are:
P=p (g) × p (g → o) × p (o → o) × p (o → g) × p (g → l) × p (l → e)
Step S330:Extract the TLD suffix features of each domain name in the training set.
Under normal circumstances, DGA domain names meeting alternative costs are low and audit not tight TLD suffix, pass through extraction in the present embodiment The TLD suffix features of each domain name, can be as the foundation for differentiating DGA domain names.
It refer to Fig. 4, in the present embodiment, further, the step S300 can include following sub-step:
Step S301:Extract the Main Domain length of each domain name in the training set, and by the main domain of specific Main Domain Length characteristic of the name length as the specific Main Domain.
Because brief domain name registration is more, therefore brief domain name resources are fewer and fewer, so the Main Domain length of DGA domain names Degree has the trend for becoming big.It is used as length characteristic by extracting the Main Domain length of each domain name in the present embodiment, can be used for sentencing Other DGA domain names.For example, when the Main Domain length of some domain name to be measured exceedes a certain threshold value, it is believed that it is DGA domains that it, which has maximum probability, Name.
It refer to Fig. 5, in the present embodiment, further, the step S310 can include following sub-step:
Step S311:The frequency that n connected characters occur in all Main Domains in the training set is counted, and by described in The frequency that n connected characters occur in all Main Domains ranking from high to low;
Specifically, during n=1, count the frequency of single character appearance in all Main Domains of training set and arrange from high to low Name, P1(x1);
During n=2, the frequency that two connected characters occur and from high to low ranking are counted in all Main Domains of training set, P2(x1x2);
By that analogy, the frequency that n connected characters occur in all Main Domains of training set and from high to low ranking are counted, Pn(x1x2...xn);
Particularly, because n is bigger, intercharacter relevance gradually weakens, and the frequency reference value come out decreases, Therefore n value suggestion is n≤3, and n is integer.
Step S312:Based on the frequency ranking that n connected characters occur in all Main Domains, specific Main Domain is calculated The average and variance for the frequency ranking that middle n connected characters occur, and n connected characters in the specific Main Domain are occurred N-gram feature of the average and variance of frequency ranking as the specific Main Domain.
Specifically, it is directed to specific Main Domain A=" a1a2…an-2an-1an", n connected characters frequency of occurrences can be calculated Ranking average and variance:
During n=1, the ranking of single character occurrence frequency in all Main Domains of training set obtained according to abovementioned steps can Calculate single character ranking average and variance in its specific Main Domain A:
During n=2, the row of two connected characters frequencies of occurrences in all Main Domains of training set obtained according to abovementioned steps Name, can calculate two connected characters ranking averages and variance in its specific Main Domain A:
During n=3, the row of three connected characters frequencies of occurrences in all Main Domains of training set obtained according to abovementioned steps Name, can calculate three connected characters ranking averages and variance in its specific Main Domain A:
In the present embodiment, by regarding the ranking average of n connected characters frequency of occurrences in Main Domain and variance as the master The n-gram features of domain name, its ranking average is smaller, illustrates that the n-gram in its Main Domain occurs more frequent, then the domain name Probability for DGA domain names is lower.
It refer to Fig. 6, in the present embodiment, further, the step S320 can include following sub-step:
Step S321:All Main Domains in the training set count to obtain Markov chain transfer matrix;
Specifically, the Markov chain transfer matrix that statistics obtains is:
α1 a2 …aj
Wherein, ajFor the character that all Main Domains occur in training set;
Step S322:Based on the Markov chain transfer matrix, calculate the transition probability of specific Main Domain, and will described in Transition probability feature of the transition probability of specific Main Domain as the specific Main Domain.
Specifically, it is directed to specific Main Domain A=" a1a2…an-2an-1an", its transition probability, which can be calculated, is:
P=p (a1)×p(a1→a2)×…×p(an-2→an-1)×p(an-1→an)
Wherein, p (*) can directly obtain from transfer matrix.
Find that normal domain name is common, readable, easy to remember, and transition probability value is bigger than normal, and DGA domain names are on the contrary, its turn by research It is less than normal to move probable value.In the present embodiment, by the way that transition probability P to be used as to the feature of the Main Domain, it can be used for differentiating DGA domains Name.
It refer to Fig. 7, in the present embodiment, further, the step S330 can include following sub-step:
Step S331:Extract all different TLD suffix of each domain name in the training set, construction TLD vectors;
In the present embodiment, OneHotEncoder coded systems can be used to TLD suffix.
Specifically, construction TLD vectors (TLD1 TLD2 … TLDN)。
Step S332:For each sample TLD in the TLD vectors, value is 1 in corresponding dimension, in its codimension Value is 0 on degree, obtains TLD matrixes;
In the present embodiment, for each sample TLD extracted from training set, value in dimension is corresponded in sample TLD It is 0 for 1, in remaining dimension.The TLD matrixes of acquisition are:
Step S333:Based on the TLD matrixes, the TLD suffix features of certain domain name are obtained.
In the present embodiment, if domain name to be measured using it is non-it is famous, non-mainstream, price is low, the TLD suffix domain names that easily pass through, can Think that the probability that the domain name is DGA domain names is high.
In the present embodiment, after step S300, step S310, step S320 and step S330 is carried out, step S220 Feature normalization can be carried out to each domain name feature.
Specifically, length characteristic is normalized:
N-gram features are normalized:
During for n=1,
During for n=2,
During for n=3,
Transition probability feature is normalized:
When TLD suffix features are normalized, greatest member value is 1 in the TLD matrixes obtained due to step S332, most Small element value is 0, therefore each element therein is normalized, and value does not change.So behaviour is normalized in TLD suffix Make, it is and consistent before normalization.
Particularly, if to certain row TLD in TLD matrixes1Its maximum and minimum value is consistent, then this row can be cancelled, because Linked character can not be provided for it.
It refer to Fig. 8, in the present embodiment, further, the step S230 can include following sub-step:
Step S400:Feature Dimension Reduction is carried out to the characteristic set, obtains the characteristic set after dimensionality reduction;
In the present embodiment, sample data is converted into the spy in higher dimensional space by above-mentioned steps S210, step S220 afterwards Data acquisition system is levied, by carrying out Feature Dimension Reduction to it, the complexity of calculating can be substantially reduced, reduce redundancy and made Into identification error, improve the precision of identification.
Particularly, patent of the present invention carries out Feature Dimension Reduction, this method using PCA dimension reduction methods to the characteristic set The dimension of feature can be greatly reduced while most information is retained.
Step S410:Characteristic set after the dimensionality reduction is trained using GBDT classifier algorithms, establishes domain Name sorter model.
In the present embodiment, the characteristic set after the dimensionality reduction that step S400 is obtained uses GBDT (Gradient Boost Decision Tree) classifier algorithm is trained, and after training terminates, establishes domain name sorter model.It is described The feature of domain name to be measured can be identified for domain name sorter model, can be with so as to realize fast and effectively differentiation DGA domain names Unknown domain name is predicted.
It refer to Fig. 9, in the present embodiment, further, after the step S230, can also comprise the following steps:
Step S500:Treat detection domain name and carry out feature extraction, feature normalization and Feature Dimension Reduction successively, after obtaining dimensionality reduction Characteristic to be detected;
In the present embodiment, the side similar with step S210, step S220, step S400 can be used to domain name to be detected Method carries out feature extraction, feature normalization and Feature Dimension Reduction successively.
Step S510:The characteristic to be detected is loaded into domain name sorter model, judges the domain to be detected Whether name is DGA domain names.
The DGA domain name Quick methods based on machine learning that the present embodiment provides, first by the research to domain name, More rich more representational feature is extracted, then characteristic dimension is reduced using Principal Component Analysis Algorithm (PCA), can speed up Training and test, so as to improve computational efficiency, finally carried using machine learning algorithm while domain name differentiation accuracy rate is improved Generalization ability is risen.
Second embodiment
Figure 10 is refer to, present embodiments provides a kind of DGA domain names fast discriminating device 600 based on machine learning, its Including:
Training set builds module 610, and the training set of multiple DGA domain names and normal domain name is included for building;
Characteristic extracting module 620, for extracting the domain name feature of each domain name in the training set;
Module 630 is normalized, characteristic set is obtained for domain name feature to be normalized;
Model building module 640, for building vertical domain name sorter model jointly based on the characteristic data set.
In summary, the DGA domain name Quick method and devices provided in an embodiment of the present invention based on machine learning, it is first First structure includes the training set of multiple DGA domain names and normal domain name, is provided enough subsequently to establish domain name sorter model Sample;Then the domain name feature of each domain name in the training set is extracted, using representative domain name feature as judgement Domain name whether be DGA domain names criterion;Domain name feature is normalized again and obtains characteristic set, with unified Each characteristic dimension, improve computational efficiency;It is finally based on the characteristic data set and builds vertical domain name sorter model jointly, you can just In training the domain name sorter model obtained to detect various unknown domain names using the machine learning, realize quick and accurate Judge whether domain name to be measured is DGA domain names.It is provided in an embodiment of the present invention based on machine learning relative to prior art DGA domain name Quick method and devices are extracted more rich, more representational domain name feature by the research to domain name; By the way that characteristic is normalized, training and test are can speed up, so as to improve computational efficiency;Finally utilize machine learning Algorithm is trained to characteristic set and obtains domain name sorter model, is improved while judging nicety rate is improved extensive Ability.The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for those skilled in the art For member, the present invention can have various modifications and variations.Any modification within the spirit and principles of the invention, being made, Equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (10)

  1. A kind of 1. DGA domain name Quick methods based on machine learning, it is characterised in that methods described includes:
    Structure includes the training set of multiple DGA domain names and normal domain name;
    Extract the domain name feature of each domain name in the training set;
    Domain name feature is normalized and obtains characteristic set;
    Vertical domain name sorter model is built jointly based on the characteristic data set.
  2. 2. according to the method for claim 1, it is characterised in that the domain name for extracting each domain name in the training set is special Sign, including:
    Extract the length characteristic of each domain name in the training set;
    Extract the n-gram features of each domain name in the training set;
    Extract the transition probability feature of each domain name in the training set;
    Extract the TLD suffix features of each domain name in the training set.
  3. 3. according to the method for claim 2, it is characterised in that the length for extracting each domain name in the training set is special Sign, including:
    The Main Domain length of each domain name in the training set is extracted, and using the Main Domain length of specific Main Domain as described in The length characteristic of specific Main Domain.
  4. 4. according to the method for claim 2, it is characterised in that the n-gram for extracting each domain name in the training set is special Sign, including:
    Count in all Main Domains in the training set frequency that n connected characters occur, and by n in all Main Domains The frequency that individual connected characters occur ranking from high to low;
    Based on the frequency ranking that n connected characters occur in all Main Domains, n connected characters in specific Main Domain are calculated The average and variance of the frequency ranking of appearance, and the frequency ranking that n connected characters in the specific Main Domain are occurred is equal Value and n-gram feature of the variance as the specific Main Domain.
  5. 5. according to the method for claim 2, it is characterised in that extract the transition probability of each domain name in the training set Feature, including:
    All Main Domains in the training set count to obtain Markov chain transfer matrix;
    Based on the Markov chain transfer matrix, the transition probability of specific Main Domain is calculated, and by the specific Main Domain Transition probability feature of the transition probability as the specific Main Domain.
  6. 6. according to the method for claim 2, it is characterised in that extract the TLD suffix of each domain name in the training set Feature, including:
    Extract all different TLD suffix of each domain name in the training set, construction TLD vectors;
    For each sample TLD in the TLD vectors, value is 1 in corresponding dimension, and value is 0 in remaining dimension, is obtained Obtain TLD matrixes;
    Based on the TLD matrixes, the TLD suffix features of certain domain name are obtained.
  7. 7. according to the method for claim 1, it is characterised in that vertical domain name grader mould is built jointly based on the characteristic data set Type, including:
    Feature Dimension Reduction is carried out to the characteristic set, obtains the characteristic set after dimensionality reduction;
    Characteristic set after the dimensionality reduction is trained using GBDT classifier algorithms, establishes domain name sorter model.
  8. 8. according to the method for claim 7, it is characterised in that Feature Dimension Reduction is carried out to the characteristic set, obtained Characteristic set after dimensionality reduction, including:
    Feature Dimension Reduction is carried out to the characteristic set using PCA dimension reduction methods, obtains the characteristic set after dimensionality reduction.
  9. 9. according to the method for claim 1, it is characterised in that methods described also includes:
    Treat detection domain name and carry out feature extraction, feature normalization and Feature Dimension Reduction successively, obtain the feature to be detected after dimensionality reduction Data;
    The characteristic to be detected is loaded into domain name sorter model, judges whether the domain name to be detected is DGA domains Name.
  10. A kind of 10. DGA domain name fast discriminating devices based on machine learning, it is characterised in that including:
    Training set builds module, and the training set of multiple DGA domain names and normal domain name is included for building;
    Characteristic extracting module, for extracting the domain name feature of each domain name in the training set;
    Module is normalized, characteristic set is obtained for domain name feature to be normalized;
    Model building module, for building vertical domain name sorter model jointly based on the characteristic data set.
CN201710976231.XA 2017-10-19 2017-10-19 DGA domain name Quick method and devices based on machine learning Pending CN107682348A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710976231.XA CN107682348A (en) 2017-10-19 2017-10-19 DGA domain name Quick method and devices based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710976231.XA CN107682348A (en) 2017-10-19 2017-10-19 DGA domain name Quick method and devices based on machine learning

Publications (1)

Publication Number Publication Date
CN107682348A true CN107682348A (en) 2018-02-09

Family

ID=61141747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710976231.XA Pending CN107682348A (en) 2017-10-19 2017-10-19 DGA domain name Quick method and devices based on machine learning

Country Status (1)

Country Link
CN (1) CN107682348A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109450842A (en) * 2018-09-06 2019-03-08 南京聚铭网络科技有限公司 A kind of network malicious act recognition methods neural network based
CN109714356A (en) * 2019-01-08 2019-05-03 北京奇艺世纪科技有限公司 A kind of recognition methods of abnormal domain name, device and electronic equipment
CN110012122A (en) * 2019-03-21 2019-07-12 东南大学 A kind of domain name similarity analysis method of word-based embedded technology
CN110266647A (en) * 2019-05-22 2019-09-20 北京金睛云华科技有限公司 It is a kind of to order and control communication check method and system
CN110324273A (en) * 2018-03-28 2019-10-11 蓝盾信息安全技术有限公司 A kind of Botnet detection method combined based on DNS request behavior with domain name constitutive characteristic
WO2019223587A1 (en) * 2018-05-21 2019-11-28 新华三信息安全技术有限公司 Domain name identification
CN110798481A (en) * 2019-11-08 2020-02-14 杭州安恒信息技术股份有限公司 Malicious domain name detection method and device based on deep learning
CN110830607A (en) * 2019-11-08 2020-02-21 杭州安恒信息技术股份有限公司 Domain name analysis method and device and electronic equipment
CN111147459A (en) * 2019-12-12 2020-05-12 北京网思科平科技有限公司 C & C domain name detection method and device based on DNS request data
CN111200576A (en) * 2018-11-16 2020-05-26 慧盾信息安全科技(苏州)股份有限公司 Method for realizing malicious domain name recognition based on machine learning
CN112771523A (en) * 2018-08-14 2021-05-07 北京嘀嘀无限科技发展有限公司 System and method for detecting a generated domain
CN112839012A (en) * 2019-11-22 2021-05-25 中国移动通信有限公司研究院 Zombie program domain name identification method, device, equipment and storage medium
CN113542202A (en) * 2020-04-21 2021-10-22 深信服科技股份有限公司 Domain name identification method, device, equipment and computer readable storage medium
CN113645173A (en) * 2020-04-27 2021-11-12 北京观成科技有限公司 Malicious domain name identification method, system and equipment
CN113691489A (en) * 2020-05-19 2021-11-23 北京观成科技有限公司 Malicious domain name detection feature processing method and device and electronic equipment
CN115065567A (en) * 2022-08-19 2022-09-16 北京金睛云华科技有限公司 Plug-in execution method for DGA domain name studying and judging inference machine
US20220417261A1 (en) * 2021-06-23 2022-12-29 Comcast Cable Communications, Llc Methods, systems, and apparatuses for query analysis and classification

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104486461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Domain name classification method and device and domain name recognition method and system
CN105072214A (en) * 2015-08-28 2015-11-18 携程计算机技术(上海)有限公司 C&C domain name identification method based on domain name feature
CN105610830A (en) * 2015-12-30 2016-05-25 山石网科通信技术有限公司 Method and device for detecting domain name
CN105871619A (en) * 2016-04-18 2016-08-17 中国科学院信息工程研究所 Method for n-gram-based multi-feature flow load type detection
CN105897714A (en) * 2016-04-11 2016-08-24 天津大学 Botnet detection method based on DNS (Domain Name System) flow characteristics
CN106713312A (en) * 2016-12-21 2017-05-24 深圳市深信服电子科技有限公司 Method and device for detecting illegal domain name

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104486461A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Domain name classification method and device and domain name recognition method and system
CN105072214A (en) * 2015-08-28 2015-11-18 携程计算机技术(上海)有限公司 C&C domain name identification method based on domain name feature
CN105610830A (en) * 2015-12-30 2016-05-25 山石网科通信技术有限公司 Method and device for detecting domain name
CN105897714A (en) * 2016-04-11 2016-08-24 天津大学 Botnet detection method based on DNS (Domain Name System) flow characteristics
CN105871619A (en) * 2016-04-18 2016-08-17 中国科学院信息工程研究所 Method for n-gram-based multi-feature flow load type detection
CN106713312A (en) * 2016-12-21 2017-05-24 深圳市深信服电子科技有限公司 Method and device for detecting illegal domain name

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
佚名: "使用深度学习检测DGA(域名生成算法)", 《HTTPS://WWW.FREEBUF.COM/ARTICLES/NETWORK/139697.HTML》 *
周敏: "《制造业信息化工程学》", 31 January 2017 *
赵越: "基于DNS流量特征的僵尸网络检测方法研究", 《万方数据库》 *
陈敏: "《认知计算导论》", 31 May 2017 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110324273A (en) * 2018-03-28 2019-10-11 蓝盾信息安全技术有限公司 A kind of Botnet detection method combined based on DNS request behavior with domain name constitutive characteristic
WO2019223587A1 (en) * 2018-05-21 2019-11-28 新华三信息安全技术有限公司 Domain name identification
CN112771523A (en) * 2018-08-14 2021-05-07 北京嘀嘀无限科技发展有限公司 System and method for detecting a generated domain
CN109450842A (en) * 2018-09-06 2019-03-08 南京聚铭网络科技有限公司 A kind of network malicious act recognition methods neural network based
CN109450842B (en) * 2018-09-06 2023-06-13 南京聚铭网络科技有限公司 Network malicious behavior recognition method based on neural network
CN111200576A (en) * 2018-11-16 2020-05-26 慧盾信息安全科技(苏州)股份有限公司 Method for realizing malicious domain name recognition based on machine learning
CN109714356A (en) * 2019-01-08 2019-05-03 北京奇艺世纪科技有限公司 A kind of recognition methods of abnormal domain name, device and electronic equipment
CN110012122A (en) * 2019-03-21 2019-07-12 东南大学 A kind of domain name similarity analysis method of word-based embedded technology
CN110266647A (en) * 2019-05-22 2019-09-20 北京金睛云华科技有限公司 It is a kind of to order and control communication check method and system
CN110798481A (en) * 2019-11-08 2020-02-14 杭州安恒信息技术股份有限公司 Malicious domain name detection method and device based on deep learning
CN110830607A (en) * 2019-11-08 2020-02-21 杭州安恒信息技术股份有限公司 Domain name analysis method and device and electronic equipment
CN110830607B (en) * 2019-11-08 2022-07-08 杭州安恒信息技术股份有限公司 Domain name analysis method and device and electronic equipment
CN112839012A (en) * 2019-11-22 2021-05-25 中国移动通信有限公司研究院 Zombie program domain name identification method, device, equipment and storage medium
CN112839012B (en) * 2019-11-22 2023-05-09 中国移动通信有限公司研究院 Bot domain name identification method, device, equipment and storage medium
CN111147459A (en) * 2019-12-12 2020-05-12 北京网思科平科技有限公司 C & C domain name detection method and device based on DNS request data
CN111147459B (en) * 2019-12-12 2021-11-30 北京网思科平科技有限公司 C & C domain name detection method and device based on DNS request data
CN113542202B (en) * 2020-04-21 2022-09-30 深信服科技股份有限公司 Domain name identification method, device, equipment and computer readable storage medium
CN113542202A (en) * 2020-04-21 2021-10-22 深信服科技股份有限公司 Domain name identification method, device, equipment and computer readable storage medium
CN113645173A (en) * 2020-04-27 2021-11-12 北京观成科技有限公司 Malicious domain name identification method, system and equipment
CN113691489A (en) * 2020-05-19 2021-11-23 北京观成科技有限公司 Malicious domain name detection feature processing method and device and electronic equipment
US20220417261A1 (en) * 2021-06-23 2022-12-29 Comcast Cable Communications, Llc Methods, systems, and apparatuses for query analysis and classification
CN115065567A (en) * 2022-08-19 2022-09-16 北京金睛云华科技有限公司 Plug-in execution method for DGA domain name studying and judging inference machine
CN115065567B (en) * 2022-08-19 2022-11-11 北京金睛云华科技有限公司 Plug-in execution method for DGA domain name study and judgment inference machine

Similar Documents

Publication Publication Date Title
CN107682348A (en) DGA domain name Quick method and devices based on machine learning
CN111897970B (en) Text comparison method, device, equipment and storage medium based on knowledge graph
CN111371806B (en) Web attack detection method and device
WO2022104540A1 (en) Cross-modal hash retrieval method, terminal device, and storage medium
CN109005145B (en) Malicious URL detection system and method based on automatic feature extraction
CN110309304A (en) A kind of file classification method, device, equipment and storage medium
CN106709345A (en) Deep learning method-based method and system for deducing malicious code rules and equipment
CN113596007B (en) Vulnerability attack detection method and device based on deep learning
CN111723368A (en) Bi-LSTM and self-attention based malicious code detection method and system
CN110602113A (en) Hierarchical phishing website detection method based on deep learning
CN110830489B (en) Method and system for detecting counterattack type fraud website based on content abstract representation
CN107341143A (en) A kind of sentence continuity determination methods and device and electronic equipment
CN112183672A (en) Image classification method, and training method and device of feature extraction network
CN113691542A (en) Web attack detection method based on HTTP request text and related equipment
CN113591077B (en) Network attack behavior prediction method and device, electronic equipment and storage medium
CN112463956B (en) Text abstract generation system and method based on antagonistic learning and hierarchical neural network
CN110362995A (en) It is a kind of based on inversely with the malware detection of machine learning and analysis system
CN112328657A (en) Feature derivation method, feature derivation device, computer equipment and medium
CN111460783A (en) Data processing method and device, computer equipment and storage medium
CN116150747A (en) Intrusion detection method and device based on CNN and SLTM
CN110958244A (en) Method and device for detecting counterfeit domain name based on deep learning
CN112364198A (en) Cross-modal Hash retrieval method, terminal device and storage medium
CN114826681A (en) DGA domain name detection method, system, medium, equipment and terminal
CN112417886B (en) Method, device, computer equipment and storage medium for extracting intention entity information
CN111898570A (en) Method for recognizing text in image based on bidirectional feature pyramid network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180209