CN114422193A

CN114422193A - Botnet risk assessment method and device

Info

Publication number: CN114422193A
Application number: CN202111592272.1A
Authority: CN
Inventors: 李佚蝶
Original assignee: China Pacific Insurance Group Co Ltd CPIC
Current assignee: China Pacific Insurance Group Co Ltd CPIC
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-29

Abstract

The invention provides a botnet risk assessment method, which comprises the following steps: a. the client data are subjected to data cleaning and then are respectively substituted into a C & C model, a deep learning DGA model and a relation map fastflux model, and output result associated domain name risk suspicion degree Pc, domain name threat probability Pd and threat degree probability Pf are obtained; b. inputting at least the associated domain name risk suspicion degree Pc, the domain name threat probability Pd, the threat degree probability Pf and a marked sample y value into a traditional machine learning model LR, determining a comprehensive threat score, wherein the marked sample y is determined at least through a public data sample and a black sample of enterprise internal basic safety data, and before substituting customer data into a C & C model, at least performing feature engineering processing on the customer data to determine a plurality of derivative variables, wherein the feature engineering processing at least comprises calculation of beaconing score, transverse UA and longitudinal UA spatial degree. The invention has simple flow, convenient use and extremely high commercial value.

Description

Botnet risk assessment method and device

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a botnet risk assessment method and device.

Background

Botnets (botnets) consist of a series of malware-infected hosts that are remotely controlled by a so-called botmaster. Botnets may be used to perform a range of malicious activities, such as distributed denial of service attacks, sending spam, stealing personal information, performing distributed computing tasks, and the like. The communication method of botnets mainly includes a central type in which a botnet host (bot) communicates through a legitimate communication channel, and a P2P and hybrid type, and generally, an irc (internet relay chat) communication method is used. When a host is infected, a connection is established with the IRC server. The botmaster establishes an IRC command and control (C & C) channel to control the zombie host, and realizes issuing an intrusion command, updating the behavior of the zombie host and other instructions. The central mode has the risk of single point failure, once the IRC server is found out, the whole botnet is out of control, the flow content from the host to the server is monitored, and the botnet host is easily detected. There is a new botnet that communicates in accordance with the manner of P2P. This approach can avoid the drawback of single point failures, and thus botnets are more robust.

However, such botnets are costly to build and have a high latency for communications. Some botnets also perform C & C signaling based on the HTTP protocol so that these traffic can bypass traditional firewalls and, when encrypted, can also penetrate DPI enabled devices. It is therefore necessary to detect it according to its communication behaviour. The life cycle of a botnet comprises four stages of formation, C & C, attack and post-attack. In the forming stage, an attacker invades the host with the leakage and executes a malicious program on the host to make the host become a zombie host. Once a zombie host, the botmaster communicates with it in various ways. And then executing the attack behavior according to the instruction of the botmaster. The post-attack stage refers to the botmaster upgrading and updating the botnet. Conventional approaches to botnet detection typically utilize behavioral characteristics, such as communicated data content, that exist at the zombie host during the formation and attack stages. Some methods based on network traffic behavior analysis can detect botnets, mainly from the perspective of communication traffic characteristics, such as traffic communication periods, and can detect some encrypted botnet host traffic and novel botnets at the same time.

In the prior art coping methods for botnet risks, the following two main detection techniques for C & C identification exist in the art: one is feature matching detection based on a rule engine, for example, header information, attack features and the like of HTTP traffic are detected, but matching based on rules easily generates high false alarm rate and is difficult to detect novel attacks; and the other method is to carry out manual abnormal flow analysis and detection based on network packet capturing, and the method has low efficiency, particularly has great difficulty in processing encrypted flow and also has high false alarm rate.

Meanwhile, in the prior art, the method mainly has the defects of simple detection scene, single means, low efficiency, high false alarm rate, insufficient advanced pre-judging capability of novel detection and the like, and simultaneously, an automatic platform is lacked to fall to the ground for the detection technology, so that a closed-loop system is provided for decision support, threat early warning and intervention.

At present, a technical scheme capable of solving the technical problems does not exist, and particularly, a botnet risk assessment method and device do not exist.

Disclosure of Invention

Aiming at the technical defects in the prior art, one object of the invention is to provide a botnet risk assessment method and a device, according to one aspect of the invention, the botnet risk assessment method is provided, a fusion multi-scene multi-dimensional model is established based on enterprise internal basic safety data and public data samples, and a big data security platform is combined to carry out real-time effective assessment on the security risk, and the method comprises the following steps:

a. the client data are subjected to data cleaning and then are respectively substituted into a C & C model, a deep learning DGA model and a relation map fastflux model, and output result associated domain name risk suspicion degree Pc, domain name threat probability Pd and threat degree probability Pf are obtained;

b. inputting at least the associated domain name risk suspicion degree Pc, the domain name threat probability Pd, the threat degree probability Pf and a marked sample y value into a traditional machine learning model LR to determine a comprehensive threat score, wherein the marked sample y is determined at least by a public data sample and a black sample of enterprise internal basic safety data, wherein,

before the customer data is substituted into the C & C model, at least the customer data is subjected to characteristic engineering processing to determine a plurality of derivative variables, wherein the characteristic engineering processing at least comprises calculation of beaconing score, transverse and longitudinal UA probability degrees.

Preferably, in the step a, the establishment of the C & C model includes the steps of:

i: determining black and white HTTP proxy data samples and external intelligence data inside an enterprise;

ii: extracting data based on communication interaction between a host and a domain name as original data based on a regular expression, and deriving a plurality of complex variables after performing feature engineering on the original data;

iii: carrying out variable selection on black and white samples of the internal basic safety data of the enterprise, the public data samples and the complex variables through random forest, extracting 18 variables with the highest correlation degree, bringing the 18 variables with the highest correlation degree and the marked sample y into a catboost model for training, and determining the C & C model after carrying out model tuning by using 8-fold cross validation.

Preferably, in the deep learning DGA model of step a, the domain name threat probability Pd is determined as follows: extracting FQDN of each DNS access data based on the DNS data in the enterprise, performing text preprocessing on each FQDN, performing vector mapping based on a dictionary to generate a 1 x 78 matrix, inputting a pre-training semantic model bert-base-uncased, performing training and tuning of 10 epochs based on NLLLoss loss function and an AdamW _ HF optimizer, selecting optimal model parameters through earlyresppping, and outputting domain name threat probability Pd through an optimization model. .

Preferably, in the relationship map fast flux model in the step a, a directed graph of the IP and the domain name is established by a method of constructing a dynamic knowledge map by extracting each network traffic IP and domain name pair, a high-ranking domain name is screened out by calculating a connectivity subgraph and sequencing calculation of one-degree association, and the threat degree probability Pf is determined by a logit function.

Preferably, after the step b, the method further comprises: and checking and feeding back the comprehensive threat score through a result of inquiring the domain name by VirtusTotal.

Preferably, the beaconing score is determined by the following formula:

B_SCORE＝100*(1-(1-math.exp(-diff*diff/2sigma))*(1-math.exp(-diff*diff/2sigma)))，

wherein, B is_SCOREFor the beaconing score, the math.exp (-diff/2 sigma) is the abnormal offset, the sigma is the normal offset, the diff is the interval-

The interval is the Time Delta of continuous communication between the Source and Destination pairs.

Preferably, the UA porosity degree is determined by the following formula:

UA popularity＝TFuij*IDFuij；

wherein, TFuij is the frequency of UA in a specific HOST, IDFuij is the reverse UA frequency, and the larger the value of IDFuij is, the less UA appears in all HOSTs.

According to another aspect of the present invention, there is provided a botnet risk assessment apparatus comprising:

a first processing device: the client data are subjected to data cleaning and then are respectively substituted into a C & C model, a deep learning DGA model and a relation map fast flux model, and output result associated domain name risk suspicion degree Pc, domain name threat probability Pd and threat degree probability Pf are obtained;

a second processing device: and at least inputting the associated domain name risk suspicion degree Pc, the domain name threat probability Pd, the threat degree probability Pf and a marked sample y value into a traditional machine learning model LR to determine a comprehensive threat score.

Preferably, the method further comprises the following steps:

the first determination means: determining black and white HTTP proxy data samples and external intelligence data inside an enterprise;

a third processing device: extracting data based on communication interaction between a host and a domain name as original data based on a regular expression, and deriving a plurality of complex variables after performing feature engineering on the original data;

a fourth processing device: carrying out variable selection on black and white samples of the internal basic safety data of the enterprise, the public data samples and the complex variables through random forest, extracting 18 variables with the highest correlation degree, bringing the 18 variables with the highest correlation degree and the marked sample y into a catboost model for training, and determining the C & C model after carrying out model tuning by using 8-fold cross validation.

The method combines the advantages of deep learning and hidden mode identification, the relevance analysis of the relation map, the identification of the unique value of the relevance and the characteristics of the numerical model in the precision, and the comprehensive benefit of the existing botnet detection scheme is improved through the integrated learning scheme. Meanwhile, a large data analysis platform is used for landing a set of closed-loop system to support decision making, threat early warning and intervention, the risk assessment model fusing internal and external safety data of the enterprise can be used for more comprehensively and accurately assessing the internal botnet risk of the enterprise by combining the correlation analysis of the multi-scene multi-model, and further comprehensively and objectively comprehensively assessing the network security of the enterprise to draw pictures, so that the loss caused by network attack is reduced for particularly small and medium-sized enterprises. The invention has simple flow, convenient use and extremely high commercial value.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 is a schematic flow chart illustrating a botnet risk assessment method according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a specific flow of the C & C model building according to the first embodiment of the present invention;

FIG. 3 illustrates a block diagram of a botnet risk assessment device, in accordance with another embodiment of the present invention; and

figure 4 shows a system framework diagram of a botnet risk assessment device according to a second embodiment of the present invention.

Detailed Description

In order to better and clearly show the technical scheme of the invention, the invention is further described with reference to the attached drawings.

Fig. 1 shows a detailed flow diagram of a botnet risk assessment method according to a specific embodiment of the present invention, and the present invention discloses a botnet risk assessment method, which establishes a fusion multi-scene multi-dimensional model based on enterprise internal basic security data and public data samples, and performs real-time effective assessment of security risk by combining a big data security platform, and the present invention is a C & C botnet risk assessment method and system based on a big data AI platform, and its characteristics are as follows: and establishing a multi-scene and multi-dimensional model based on the internal basic safety data and the open data samples of the enterprise, and then combining a big data safety platform to perform real-time and effective evaluation, visual feedback and later-stage response on the safety risk.

The existing detection of botnet is based on single-scene modeling detection on DGA, C & C or fast flux, and the system has the advantages and the unique characteristics that based on a safe big data platform, a multi-scene fusion combined model is constructed to enhance the detection capability of the existing single-scene model based on multi-source isomerism, and a multi-level fusion model framework is provided to improve the detection precision. Meanwhile, a plurality of high-quality variables are derived based on the data acquisition capacity of the big data platform to improve the modeling efficiency and accuracy.

In the aspect of variable acquisition, besides extracting multidimensional characteristic indexes such as inside and outside of an enterprise, network transmission, static indexes (UA, URL, refer), WHOIS, ANS, threat intelligence and the like, a plurality of oriented complex characteristic indexes such as beacon scoring, UA probability degree and the like are generated through variable engineering.

In the aspect of modeling, the optimization characteristics of different models in a specific scene are fully mined, and the optimization characteristics comprise that DGA domain name risk score is mined by using the characteristics of neural network adaptive learning, fast flux risk detection indexes are mined by using a knowledge graph, risk score is carried out on C & C by combining a tree model, and finally, multi-level summary is carried out by using a traditional statistical model to generate a threat score card, so that the detection rate of the conventional single-scene botnet-based prediction model is improved, and the false alarm rate is reduced. Meanwhile, the system provides a set of closed-loop landing system for decision support and threat early warning and intervention based on a safety big data platform, and overcomes the defect that the existing detection model only alarms and does not intervene.

Specifically, the method comprises the following steps:

firstly, entering step S101, performing data cleaning on client data, and then respectively substituting the client data into a C & C model, a deep learning DGA model, and a relationship map fast flux model, and obtaining an output result associated with a domain name risk suspicion degree Pc, a domain name threat probability Pd, a threat degree probability Pf, and referring to fig. 4, a system framework diagram of a botnet risk assessment apparatus according to a second embodiment of the present invention is shown, wherein after the client data is accessed, before the client data is substituted into the C & C model, data cleaning and feature engineering are preferably performed, before the client data is substituted into the deep learning DGA model, data processing is preferably performed, before the relationship map fast flux model is substituted into the relationship map fast flux model, further, as shown in fig. 4, firstly, a client data access module is first performed, the system automatically performs data cleaning, and the system automatically performs feature engineering on the cleaned data, the method comprises the steps of variable screening, calculation, processing, query third-party service acquisition of relevant information and the like, data modeling, analysis, basic learning by combining multi-scene different models, output result import into a risk analysis module, scoring and output of a user interface.

The core of the present invention is to associate and combine the three models to complete the assessment of the botnet risk, so that the C & C model, the deep learning DGA model, and the relationship map fast flux model need to be substituted, and the C & C model, the deep learning DGA model, and the relationship map fast flux model need to be preferably established, which will be further described in the following detailed description, and will not be repeated herein.

Then, step S102 is performed, and at least the associated domain name risk suspicion degree Pc, the domain name threat probability Pd, the threat degree probability Pf, and the labeled sample y value are input into the conventional mechanicsIn the conventional model LR, according to LR probability formula

Training is carried out to determine a comprehensive threat score, wherein the label sample y is determined at least by public data samples and black samples of enterprise internal basic safety data, and the traditional machine learning model LR can be regarded as a single-layer single-node 'DNN', and is a wide and not deep structure, and all features directly act on the final output result. The model has the advantages of simplicity and good controllability, but the effect depends on the degree of characteristic engineering directly, and very fine characteristic processing and characteristic combination such as continuous type, discrete type, time type and the like are needed. Overfitting is typically controlled by regularization or the like.

As another core technical point of the present invention, before the client data is substituted into the C & C model, at least the client data is subjected to feature engineering processing to determine a plurality of derived variables, where the feature engineering processing at least includes calculation of beaconing score, lateral and longitudinal UA probability degrees, and in terms of variable acquisition of the tree model, besides using the above-mentioned multi-dimensional data acquisition, data engineering mining is also used to derive a plurality of related variables, such as beaconing score, calculation of lateral and longitudinal UA probability degrees, and 18 variables finally determined to enter the model are screened through variable correlation.

Further, the beaconing score is determined by the following formula:

The interval is the Time Delta of continuous communication between the Source and Destination pairs, and the UA granularity is determined by the following formula:

UA popularity＝TFuij*IDFuij；

wherein TFuij is the frequency of UA in a specific HOST, and IDFuij

For reverse UA frequencies, a larger IDFuij value indicates that UA appears more sparsely in all HOST.

Considering the particularity of UA of botnet on interaction between host and domain name, we analogize it to a weighting technique TF-IDF in information text search, and generate derived variable UA probability degree by calculating TF-IDF value of UA, i.e. TFuij ═ Nuij/sigma_uk ^NukjWhere Nuij denotes the number of times UA appears in a particular host, and TFuij denotes the frequency of UA appearance in a particular host. IDFuij ═ logH/1+ ua_iE H, where H denotes all the HOST numbers, uai denotes the HOST number containing a particular UA, and finally UA temporal score is calculated as TFuij idfuij. For the beacon score, we assume that time delta (interval) of continuous communication between Source and Destination pairs obeys a normal distribution, set an interval of two sigma as a normal interval, and obtain a value of the beacon score by a formula, wherein a higher value indicates a higher threat degree:

B_SCORE＝100*(1-(1-math.exp(-diff*diff/2sigma))*math.exp(-diff*diff/2sigma)))

wherein diff represents

Further, in the deep learning DGA model of step S101, the domain name threat probability Pd is determined as follows: extracting FQDN of each DNS access data based on the DNS data in the enterprise, performing text preprocessing on each FQDN, performing vector mapping based on a dictionary to generate a 1 x 78 matrix, inputting a pre-training semantic model bert-base-uncased, performing 10 epoch training and tuning based on NLLLoss loss function and an AdamW-HF optimizer, selecting an optimal model parameter through early stopping, and outputting domain name threat probability Pd through an optimization model.

Further, in the relationship map fast flux model in step S101, a directed graph of the IP and the domain name is established by a method of constructing a dynamic knowledge map by extracting each network traffic IP and domain name pair, a high-ranking domain name is screened out by calculating a connectivity subgraph and a ranking calculation of one-degree association, the threat degree probability Pf is determined by a logit function, and in a fast-flux detection scenario, V represents the IP and the domain name, and E represents a relationship between the IP and the domain name.

Further, after the step S102, the method further includes: in the embodiment, a second-level traditional machine learning model LR model is constructed based on three threat values Pc, Pd and Pf output at a first level and combined with a marked sample y value, a final comprehensive threat value is determined by taking 0.5 as a threshold value according to risk probability given by the traditional machine learning model LR, and meanwhile, the domain name is inquired in combination with Virtus Total, and the inquiry result is greater than 3 as a standard, and the output result of the model is checked and fed back through the VirtuS Total feedback result.

Furthermore, the DGA network doubtful degree is scored by combining a deep learning model and the C & C network doubtful degree is scored by a tree model through multiple dimensions, finally, the outer layer traditional statistical model is nested again by combining an integrated learning mode to carry out comprehensive scoring, and the output of the three polymorphic models is fused and transmitted into the outer layer traditional statistical model as a variable to be trained again, so that the comprehensive precision can be effectively improved. Through the integrated relevant zombie network scene, combine statistics + machine learning + degree of depth learning model, each model advantage is utilized to the maximize, promotes comprehensive benefit through the scheme of applying the integrated learning of multilayer to the cover at last, and this application carries out automatic deployment through combining big data security platform, and the safe risk is real-time, effectively assesses, control and visual feedback and later stage response, forms a closed loop's inclusion & response platform.

The key points to be protected in the invention are the modeling data, the modeling method and the established multi-level comprehensive model, the variables used by the model and the whole process of the big data security platform system are combined with the advantages of deep learning and hidden mode identification, the relationship map is used for analyzing the relevance, the unique value of the relevance and the characteristics of the tree model in the precision are identified, the multi-scene multi-model multi-level integration system with transverse and longitudinal directions is established, and the comprehensive benefit is improved.

Fig. 2 shows a specific flowchart of the establishment of the C & C model according to the first embodiment of the present invention, and in the step S101, the establishment of the C & C model includes the following steps:

firstly, entering step S201, determining black and white HTTP proxy data samples and external information data inside an enterprise, the invention is an enterprise C & C botnet risk assessment method and system established based on enterprise internal safety data and external public data, the system establishes a C & C botnet detection model based on the existing black and white samples inside the enterprise in combination with external public threat information, then inputs the enterprise data to be assessed into the system and inputs the established model, the threat score output by the model is 0-1, the C & C domain risk score inside the enterprise is obtained based on a threshold value (default 0.5), the invention is a C & C botnet risk assessment method and system based on a safe big data AI platform architecture, based on the special basic safety data inside the enterprise, the sample data acquired by deploying a special closed environment, meanwhile, a special data sample base is formed by combining external public information samples, so that a multi-correlation scene, multi-dimension and multi-level fusion modeling mode is realized, a multi-scene and multi-level innovation modeling mode is mainly adopted in a modeling aspect, and firstly, in the construction of a first-level model, three different models are used in parallel to respectively model the botnet DGA, C & C and FAST FLUX of three different scenes: in the C & C scenario, data cleaning is performed based on black and white HTTP proxy data samples and external intelligence data inside the enterprise.

Then, step S202 is performed, based on the regular expression, extracting data based on communication interaction between the host and the domain name as original data, and deriving a plurality of complex variables after performing feature engineering on the original data, further, based on the regular expression, extracting data based on communication interaction between the host and the domain name as original data, applying the feature engineering to clean, normalize and reduce the dimension of the data, and deriving a plurality of complex variables (e.g., following score, UA temporal score, referrer score, etc.) based on the previously described method.

And finally, entering a step S203, based on the initial variables in the steps S201 to S202, carrying out variable selection on the black and white sample of the internal basic security data of the enterprise, the public data sample and the complex variables through random forest, extracting 18 variables with the highest association degree, bringing the variables with the highest association degree and the marked sample y into a catboost model for training, and determining the C & C model after carrying out model tuning by using 8-fold cross validation. And (3) carrying out variable selection through random forest, finally extracting 18 variables with the highest association degree scores, generating y variables based on known black sample data, finally carrying out training by taking the y variables into a catboost model, carrying out model tuning by using 8-fold cross validation to prevent over-fitting, selecting optimal parameters, and finally determining the associated domain name risk suspicion degree of each visit through the risk probability output by the model, and recording the suspicion degree as Pc.

Fig. 3 is a schematic diagram illustrating a module connection of a botnet risk assessment apparatus according to another embodiment of the present invention, and the present invention provides a botnet risk assessment apparatus using the risk assessment method, including a first processing apparatus 1: the client data is subjected to data cleaning and then respectively substituted into the C & C model, the deep learning DGA model and the relationship map fast flux model, and output result associated domain name risk suspicion degree Pc, domain name threat probability Pd and threat degree probability Pf are obtained, and the working principle of the first processing device 1 may refer to the step S101, which is not described herein again.

Further, the present application also includes a second processing device 2: at least the associated domain name risk suspicion degree Pc, the domain name threat probability Pd, the threat degree probability Pf, and the labeled sample y value are input into a conventional machine learning model LR to determine a comprehensive threat score, and the working principle of the second processing device 2 may refer to the step S102, which is not described herein again.

Further, the first determination device 3 is also included: the working principle of the first determining apparatus 3 can refer to the step S201, which is not described herein again.

Further, a third processing device 4 is also included: extracting data based on communication interaction between the host and the domain name as raw data based on a regular expression, and deriving a plurality of complex variables after performing feature engineering on the raw data, wherein the working principle of the third processing device 4 may refer to the step S202, which is not described herein again.

Further, a fourth processing device 5 is also included: carry out variable selection through random forest the black and white sample of the inside basic security data of enterprise public data sample and a plurality of complex variables, extract the highest variable of 18 relevance degree values, will the highest variable of 18 relevance degree values and mark sample y and bring into the catboost model and train, utilize 8 cross verifications to carry out the model and decide after tuning, the theory of operation of fourth processing apparatus 5 can refer to aforementioned step S203, and the repetition is not given here.

It should be noted that the specific implementation of each of the above device embodiments is the same as the specific implementation of the corresponding method embodiment, and is not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some embodiments, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Furthermore, those of skill in the art will understand that although some embodiments described herein include some features included in other embodiments, not others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in an apparatus according to an embodiment of the invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A botnet risk assessment method is characterized by comprising the following steps of establishing a fusion multi-scene multi-dimensional model based on enterprise internal basic safety data and public data samples, and carrying out real-time effective assessment on safety risks by combining a big data safety platform:

a. the client data are subjected to data cleaning and then are respectively substituted into a C & C model, a deep learning DGA model and a relation map fast flux model, and output result associated domain name risk suspicion degree Pc, domain name threat probability Pd and threat degree probability Pf are obtained;

b. and at least inputting the associated domain name risk suspicion degree Pc, the domain name threat probability Pd, the threat degree probability Pf and a marked sample y value into a traditional machine learning model LR to determine a comprehensive threat score.

Wherein the marked sample y is determined at least by a public data sample and a black sample of the enterprise internal basic security data, wherein,

2. The risk assessment method according to claim 1, wherein in the step a, the establishment of the C & C model comprises the steps of:

3. The risk assessment method according to claim 1, wherein in the deep learning DGA model of step a, the domain name threat probability Pd is determined by: extracting FQDN of each DNS access data based on the DNS data in the enterprise, performing text preprocessing on each FQDN, performing vector mapping based on a dictionary to generate a 1 x 78 matrix, inputting a pre-training semantic model bert-base-uncased, performing 10 epoch training and tuning based on NLLLoss loss function and an AdamW-HF optimizer, selecting an optimal model parameter through early stopping, and outputting domain name threat probability Pd through an optimization model.

4. The risk assessment method according to claim 1, wherein in the relationship map fast flux model of step a, a dynamic knowledge map is constructed by extracting each network traffic IP and domain name pair, a directed graph of IP and domain name is established, a high-ranking domain name is screened out by calculating a connectivity subgraph and a ranking calculation of one-degree association, and the threat degree probability Pf is determined by a logit function.

5. The risk assessment method according to claim 1, further comprising, after said step b: and checking and feeding back the comprehensive threat score through a result of inquiring the domain name by VirtusTotal.

6. The risk assessment method of claim 1, wherein said beaconing score is determined by the following formula:

wherein, B is_SCOREFor the beaconing score, the math.exp (-diff/2 sigma) is the abnormal offset, the sigma is the normal offset, and the diff is the normal offset

7. The risk assessment method according to claim 1, wherein said UA probability degree is determined by the following formula:

UA popularity＝TFuij*IDFuij；

8. A botnet risk assessment device employing the risk assessment method according to any one of claims 1-7, comprising:

first treatment device (1): the client data are subjected to data cleaning and then are respectively substituted into a C & C model, a deep learning DGA model and a relation map fast flux model, and output result associated domain name risk suspicion degree Pc, domain name threat probability Pd and threat degree probability Pf are obtained;

second treatment device (2): and at least inputting the associated domain name risk suspicion degree Pc, the domain name threat probability Pd, the threat degree probability Pf and a marked sample y value into a traditional machine learning model LR to determine a comprehensive threat score.

9. The risk assessment device of claim 8, further comprising:

first determination means (3): determining black and white HTTP proxy data samples and external intelligence data inside an enterprise;

third treatment device (4): extracting data based on communication interaction between a host and a domain name as original data based on a regular expression, and deriving a plurality of complex variables after performing feature engineering on the original data;

fourth processing device (5): carrying out variable selection on black and white samples of the internal basic safety data of the enterprise, the public data samples and the complex variables through random forest, extracting 18 variables with the highest correlation degree, bringing the 18 variables with the highest correlation degree and the marked sample y into a catboost model for training, and determining the C & C model after carrying out model tuning by using 8-fold cross validation.