CN116827595A - Network attack detection method and device, storage medium and electronic equipment - Google Patents

Network attack detection method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN116827595A
CN116827595A CN202310520617.5A CN202310520617A CN116827595A CN 116827595 A CN116827595 A CN 116827595A CN 202310520617 A CN202310520617 A CN 202310520617A CN 116827595 A CN116827595 A CN 116827595A
Authority
CN
China
Prior art keywords
recognition
model
information
target
target system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310520617.5A
Other languages
Chinese (zh)
Inventor
孙锐
郭煚
丁振涛
邱偲逸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310520617.5A priority Critical patent/CN116827595A/en
Publication of CN116827595A publication Critical patent/CN116827595A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/30Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information
    • H04L63/308Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information retaining data, e.g. retaining successful, unsuccessful communication attempts, internet access, or e-mail, internet telephony, intercept related information or call content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Technology Law (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application discloses a network attack detection method and device, a storage medium and electronic equipment, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring a target information set, wherein the target information set at least comprises N pieces of target information; inputting target information in the target information set into N recognition models for recognition processing to obtain a recognition result set; determining the weight corresponding to each recognition model; and determining a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, wherein the detection result is used for indicating whether network attack exists in the target system. The application solves the problems that whether the network attack exists in the detection system based on the flow protocol in the related technology, the effect of detecting whether the network attack exists in the system is poor, and the safety of the system is affected.

Description

Network attack detection method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for detecting a network attack, a storage medium, and an electronic device.
Background
APT attacks are persistent, targeted, very covert attacks against certain institutions. Moreover, APT attacks can be data acquisition or information system destruction, and are a new, complex multi-step attack, using advanced techniques and exploiting unknown vulnerabilities. In addition, the economic loss caused by successful APT attack is also quite remarkable, and is one of the most serious threats to enterprises at present.
In the related art, the network attack detection technology is generally based on a flow protocol, and meanwhile, the network quintuple and the detail field of the message are matched, so that the network attack is researched and judged. However, this detection means has three disadvantages:
(1) The detection efficiency is high based on the characteristics, but the detection means is relatively single, and the attack can be detected only when the host is lost or after the host is lost;
(2) The report missing rate is higher for the situations such as variant attack, scattered attack disguised as common access behavior and the like;
(3) Threat detection rate and false alarm rate are often greatly influenced by the rule base scale, and the rule has more false alarms and more false alarms, and the rule has less false alarms and more false alarms.
Aiming at the problems that whether a network attack exists in a detection system based on a flow protocol in the related technology, the effect of detecting whether the network attack exists in the system is poor, and the safety of the system is further influenced, no effective solution is proposed at present.
Disclosure of Invention
The application mainly aims to provide a network attack detection method and device, a storage medium and electronic equipment, so as to solve the problem that whether the network attack exists in a detection system or not is poor in effect due to the fact that whether the network attack exists in the detection system or not is detected based on a flow protocol in the related technology, and further the safety of the system is affected.
In order to achieve the above object, according to one aspect of the present application, there is provided a method for detecting a network attack. The method comprises the following steps: obtaining a target information set, wherein the target information set at least comprises N pieces of target information, and the target information at least comprises: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer; inputting target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model; determining the weight corresponding to each recognition model; and determining a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, wherein the detection result is used for indicating whether network attack exists in the target system.
Further, determining the detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model includes: determining a first numerical value set according to the identification result output by each identification model, wherein the first numerical value set at least comprises N first numerical values, and the first numerical values are numerical values corresponding to the identification result output by the identification model; calculating to obtain a target value according to N first values in the first value set and the weight corresponding to each recognition model; judging whether the target value is larger than a first preset threshold value or not; if the target value is larger than the first preset threshold value, determining that the detection result is that network attack exists in the target system; and if the target value is not greater than the first preset threshold value, determining that the detection result is that no network attack exists in the target system.
Further, the N recognition models at least include: the first recognition model, the second recognition model, the third recognition model, the fourth recognition model and the fifth recognition model, the target information in the target information set is input into the N recognition models for recognition processing, and the obtaining of the recognition result set comprises the following steps: inputting mail information in the target system into the first recognition model for recognition processing, and outputting a first recognition result, wherein the first recognition result is used for indicating whether junk mail exists in the target system; inputting the user behavior information in the target system into the second recognition model for recognition processing, and outputting a second recognition result, wherein the second recognition result is used for representing whether the user behavior in the target system is malicious or not; inputting the program information in the target system into the third recognition model for recognition processing, and outputting a third recognition result, wherein the third recognition result is used for indicating whether a malicious program exists in the target system; inputting the network information accessing the target system into the fourth recognition model for recognition processing, and outputting a fourth recognition result, wherein the fourth recognition result is used for indicating whether the network accessing the target system is a botnet or not; inputting the flow characteristic information of the access target system into the fifth recognition model for recognition processing, and outputting a fifth recognition result, wherein the fifth recognition result is used for indicating whether the flow of the access target system is malicious or not; and summarizing the first recognition result, the second recognition result, the third recognition result, the fourth recognition result and the fifth recognition result to obtain the recognition result set.
Further, the first recognition model is obtained by: obtaining M sample mails, wherein the M sample mails at least comprise junk mails and non-junk mails, and M is a positive integer; performing feature extraction processing on the M sample mails to obtain a first data set, wherein the first data set at least comprises S pieces of first feature data, and S is a positive integer; dividing the first dataset into a first training set for training a model and a first test set for testing the model; and training the first neural network model by adopting the first training set to obtain the first recognition model.
Further, after training the first neural network model with the first training set to obtain the first recognition model, the method further includes: acquiring training time length for training the first neural network model; calculating the accuracy degree of the first identification model by using the test set; and determining a first test result for testing the first recognition model according to the training time length and the accuracy.
Further, the second recognition model is obtained by: acquiring T pieces of user behavior data, wherein T is a positive integer; obtaining a first word set according to the T pieces of user behavior data, wherein the first word set at least comprises P first words, and P is a positive integer; combining the first words in the first word set to obtain Y first sentences, wherein Y is a positive integer; obtaining a second data set based on the Y first sentences; obtaining a second training set for training a model from the second data set; and training a second neural network model by adopting the second training set to obtain the second recognition model.
Further, the third recognition model is obtained by: obtaining K program files, wherein K is a positive integer; obtaining a third data set according to the K program files; obtaining a third training set for training a model from the third data set; and training a third neural network model by adopting the third training set to obtain the third recognition model.
Further, obtaining a third data set according to the K program files includes: obtaining a second word set according to the K program files, wherein the second word set at least comprises R second words, and R is a positive integer; combining the second words in the second word set to obtain Z second sentences, wherein Z is a positive integer; determining the importance degree of each second sentence; determining a statement set based on the importance degree of each second statement, wherein the statement set at least comprises W second statements, W is a positive integer, and W is smaller than or equal to Z; and summarizing the W second sentences to obtain the third data set.
Further, inputting the network information accessing the target system into the fourth recognition model for recognition processing, and outputting a fourth recognition result includes: inputting network information accessing the target system into the fourth identification model for identification processing, and outputting V target IP addresses, wherein the target IP addresses are IP addresses corresponding to target networks, the target networks are networks accessing the target system, and V is a positive integer; determining U first IP addresses from the V target IP addresses, wherein the first IP addresses are IP addresses corresponding to a first network, the number of domain names of the first network attack is larger than a second preset threshold, and U is a positive integer; determining a second IP address from the U first IP addresses, wherein the second IP address is an IP address corresponding to a second network, and the similarity of domain names of the second network attack is greater than a third preset threshold; judging whether the number of the second IP addresses is smaller than a fourth preset threshold value or not; if the number of the second IP addresses is smaller than the fourth preset threshold value, determining that the fourth identification result is that the network accessing the target system is a botnet; and if the number of the second IP addresses is not smaller than the fourth preset threshold value, determining that the network accessing the target system is not a botnet as the fourth identification result.
Further, the fifth recognition model is obtained by: acquiring threat information data, wherein the threat information data at least comprises F flow characteristic data, and F is a positive integer; obtaining a third word set according to F flow characteristic data in the threat information data, wherein the third word set at least comprises H third words, and H is a positive integer; determining the importance degree of each third word; determining X third words from the H third words based on the importance degree of each third word, wherein X is a positive integer and X is smaller than H; summarizing the X third words to obtain a fourth data set; obtaining a fourth training set for training a model from the fourth data set; and training the fourth neural network model by adopting the fourth training set to obtain the fifth recognition model.
In order to achieve the above object, according to another aspect of the present application, there is provided a detection apparatus for network attack. The device comprises: the first acquisition unit is configured to acquire a target information set, where the target information set includes at least N pieces of target information, and the target information at least includes: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer; the first processing unit is used for inputting target information in the target information set into N recognition models to perform recognition processing to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model; the first determining unit is used for determining the weight corresponding to each recognition model; and the second determining unit is used for determining a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, wherein the detection result is used for indicating whether the network attack exists in the target system.
Further, the second determining unit includes: the first determining module is used for determining a first value set according to the identification result output by each identification model, wherein the first value set at least comprises N first values, and the first values are values corresponding to the identification result output by the identification model; the first calculation module is used for calculating a target value according to N first values in the first value set and the weight corresponding to each recognition model; the first judging module is used for judging whether the target value is larger than a first preset threshold value or not; the second determining module is configured to determine that the detection result is that a network attack exists in the target system if the target value is greater than the first preset threshold; and the third determining module is used for determining that the network attack does not exist in the target system if the target value is not greater than the first preset threshold value.
Further, the N recognition models at least include: a first recognition model, a second recognition model, a third recognition model, a fourth recognition model, and a fifth recognition model, the first processing unit comprising: the first processing module is used for inputting mail information in the target system into the first recognition model for recognition processing and outputting a first recognition result, wherein the first recognition result is used for indicating whether junk mail exists in the target system; the second processing module is used for inputting the user behavior information in the target system into the second recognition model for recognition processing and outputting a second recognition result, wherein the second recognition result is used for representing whether the user behavior in the target system is malicious or not; the third processing module is used for inputting the program information in the target system into the third recognition model for recognition processing and outputting a third recognition result, wherein the third recognition result is used for indicating whether a malicious program exists in the target system; the fourth processing module is used for inputting the network information accessing the target system into the fourth recognition model for recognition processing and outputting a fourth recognition result, wherein the fourth recognition result is used for indicating whether the network accessing the target system is a botnet or not; the fifth processing module is used for inputting the flow characteristic information of the access target system into the fifth recognition model for recognition processing and outputting a fifth recognition result, wherein the fifth recognition result is used for indicating whether the flow of the access target system is malicious or not; and the sixth processing module is used for summarizing the first recognition result, the second recognition result, the third recognition result, the fourth recognition result and the fifth recognition result to obtain the recognition result set.
Further, the first recognition model is obtained by: the second acquisition unit is used for acquiring M sample mails, wherein the M sample mails at least comprise junk mails and non-junk mails, and M is a positive integer; the second processing unit is used for carrying out feature extraction processing on the M sample mails to obtain a first data set, wherein the first data set at least comprises S pieces of first feature data, and S is a positive integer; a first dividing unit for dividing the first data set into a first training set for training a model and a first test set for testing the model; and the first training unit is used for training the first neural network model by adopting the first training set to obtain the first identification model.
Further, the apparatus further comprises: the third obtaining unit is used for obtaining training time for training the first neural network model after training the first neural network model by adopting the first training set to obtain the first identification model; the first computing unit is used for computing the accuracy degree of the first identification model by utilizing the test set; and the third determining unit is used for determining a first test result for testing the first identification model according to the training time length and the accuracy.
Further, the second recognition model is obtained by: a fourth obtaining unit, configured to obtain T pieces of user behavior data, where T is a positive integer; a fourth determining unit, configured to obtain a first word set according to the T user behavior data, where the first word set includes at least P first words, and P is a positive integer; the third processing unit is used for carrying out combination processing on the first words in the first word set to obtain Y first sentences, wherein Y is a positive integer; a fifth determining unit, configured to obtain a second data set based on the Y first statements; a fifth acquisition unit configured to acquire a second training set for training a model from the second data set; and the second training unit is used for training a second neural network model by adopting the second training set to obtain the second recognition model.
Further, the third recognition model is obtained by: a sixth obtaining unit, configured to obtain K program files, where K is a positive integer; a sixth determining unit, configured to obtain a third data set according to the K program files; a seventh acquisition unit configured to acquire a third training set for training a model from the third data set; and the third training unit is used for training a third neural network model by adopting the third training set to obtain the third recognition model.
Further, the sixth determination unit includes: a fourth determining module, configured to obtain a second word set according to the K program files, where the second word set includes at least R second words, and R is a positive integer; a seventh processing module, configured to perform a combination process on the second words in the second word set to obtain Z second sentences, where Z is a positive integer; a fifth determining module, configured to determine an importance level of each second sentence; a sixth determining module, configured to determine a sentence set based on an importance level of each second sentence, where the sentence set includes at least W second sentences, W is a positive integer, and W is less than or equal to Z; and an eighth processing module, configured to perform summarization processing on the W second sentences to obtain the third data set.
Further, the fourth processing module includes: the first processing sub-module is used for inputting the network information accessing the target system into the fourth identification model for identification processing and outputting V target IP addresses, wherein the target IP addresses are IP addresses corresponding to a target network, the target network is a network accessing the target system, and V is a positive integer; a first determining submodule, configured to determine U first IP addresses from the V target IP addresses, where the first IP addresses are IP addresses corresponding to a first network, the number of domain names of the first network attack is greater than a second preset threshold, and U is a positive integer; a second determining submodule, configured to determine a second IP address from the U first IP addresses, where the second IP address is an IP address corresponding to a second network, and a similarity degree of domain names of the second network attack is greater than a third preset threshold; the first judging submodule is used for judging whether the number of the second IP addresses is smaller than a fourth preset threshold value or not; a third determining submodule, configured to determine that the network accessing the target system is a botnet if the number of the second IP addresses is less than the fourth preset threshold; and a fourth determining submodule, configured to determine that the network accessing the target system is not a botnet if the number of the second IP addresses is not less than the fourth preset threshold.
Further, the fifth recognition model is obtained by: an eighth obtaining unit, configured to obtain threat information data, where the threat information data includes at least F flow feature data, where F is a positive integer; a seventh determining unit, configured to obtain a third word set according to F flow characteristic data in the threat intelligence data, where the third word set includes at least H third words, and H is a positive integer; an eighth determining unit configured to determine a degree of importance of each third word; a ninth determining unit configured to determine X third words from the H third words based on the importance level of each third word, where X is a positive integer, and X is smaller than H; the fourth processing unit is used for summarizing the X third words to obtain a fourth data set; a ninth acquisition unit configured to acquire a fourth training set for training a model from the fourth data set; and the fourth training unit is used for training a fourth neural network model by adopting the fourth training set to obtain the fifth recognition model.
In order to achieve the above object, according to another aspect of the present application, there is provided a computer-readable storage medium storing a program, wherein the program performs the network attack detection method of any one of the above.
In order to achieve the above object, according to another aspect of the present application, there is provided an electronic device including one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for detecting a network attack as set forth in any one of the above.
According to the application, the following steps are adopted: the method comprises the steps of obtaining a target information set, wherein the target information set at least comprises N pieces of target information, and the target information at least comprises: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer; inputting target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model; determining the weight corresponding to each recognition model; according to the identification result output by each identification model and the weight corresponding to each identification model, determining a detection result of the target system, wherein the detection result is used for indicating whether network attack exists in the target system, and the problems that whether network attack exists in the detection system based on a flow protocol in the related technology is poor in effect and further affects the safety of the system are solved. The method comprises the steps of obtaining a target information set at least comprising mail information in a target system, user behavior information in the target system, program information in the target system, network information of an access target system and flow characteristic information of the access target system, inputting the target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, and detecting whether network attack exists in the target system according to a recognition result output by each recognition model and weight corresponding to each recognition model, so that the effect of detecting whether the network attack exists in the system can be improved, and the safety of the system is further ensured.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:
fig. 1 is a flowchart of a method for detecting a network attack according to an embodiment of the present application;
fig. 2 is a flowchart of a method for detecting a network attack according to an embodiment of the present application;
fig. 3 is a flowchart two of a method for detecting a network attack according to an embodiment of the present application;
fig. 4 is a schematic diagram of a network attack detection device according to an embodiment of the present application;
fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the application herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, user behavior information, etc.) and the data (including, but not limited to, data for analysis, stored data, displayed data, user behavior data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.
For convenience of description, the following will describe some terms or terminology involved in the embodiments of the present application:
machine Learning, namely Machine Learning, involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. The purpose is to make the computer simulate or realize the learning behavior of human beings so as to acquire new knowledge or skill and reorganize the existing knowledge structure to make it continuously perfect the performance of the computer. In short, machine learning is the training of machines by providing a large amount of relevant data.
Data sets, meaning data sets used to train and test models, are typically collected from data in the real world. The data in the dataset may be structured (e.g., a table) or unstructured (e.g., an image or text). In machine learning, a dataset is used to train a model, evaluate the performance of the model, and provide predictions for unknown data.
Feature extraction is the conversion of data (e.g., text, images, etc.) into digital features that can be used for machine learning, and feature engineering is the first step in machine learning that directly affects the results of machine learning. It can be said that the data and features determine the upper limit of machine learning, and the model and algorithm only approach this upper limit. The feature engineering comprises feature extraction, feature pretreatment, feature dimension reduction and the like.
Model training, namely obtaining the required expected termination state or input data by utilizing the known data through limited times of calculation of a computer algorithm, and further accurately classifying and predicting the newly input data.
APT, a high-level persistent threat, refers to a hidden and persistent computer intrusion process that is often carefully planned by some personnel for a particular target. APT attacks are attacks against certain institutions and require a high degree of concealment to be maintained over a long period of time. APT attacks consist of three elements: advanced, long term, and threatening. High-level emphasis is given to the use of sophisticated malicious programs and techniques, as well as the exploitation of vulnerabilities in the system; long-term implied certain external forces can continuously monitor specific targets and acquire data from the specific targets; threat refers to an attack that is artificially involved in the planning.
The bag of words model is a text representation method that treats all words in a document as equally weighted features, ignoring the order and grammatical relations between them. In this model, each word can be represented independently by a vector and forms a complete vector space with the other words. Thus, in text data analysis tasks (e.g., classification, clustering, etc.), methods based on vector similarity or probability statistics, etc., are typically used to process these abstract features.
TF-IDF (Term Frequency-Inverse Document Frequency) is a basic algorithm commonly used in information retrieval and text mining. The core idea is to evaluate the importance degree of a word on a topic by calculating the ratio of the frequency of the word appearing in a document or a corpus to the frequency of the word appearing in the whole corpus.
Botnets are networks that are large-scale, in which a computer is infected and controlled by a malicious program.
The present application will be described with reference to preferred implementation steps, and fig. 1 is a flowchart of a method for detecting a network attack according to an embodiment of the present application, as shown in fig. 1, where the method includes the following steps:
step S101, acquiring a target information set, wherein the target information set at least comprises N pieces of target information, and the target information at least comprises: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer.
For example, the N pieces of target information may be a plurality of mails in the collected system, a plurality of data information about user behavior, a plurality of program files, a plurality of networks accessing the system, and a plurality of traffic information accessing the system. Moreover, the system is a system for detecting whether network attacks exist in the system.
Step S102, inputting target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model.
For example, the N recognition models may be a spam recognition model, a user behavior analysis and malicious behavior monitoring model, a malicious program classification recognition model, a botnet recognition model, and a traffic feature recognition model constructed using APT threat intelligence. Then, inputting a plurality of mails in the collected system into a junk mail identification model in N identification models, and obtaining information of whether junk mails exist in the input mails; inputting a plurality of pieces of data information about user behaviors in the collected system into user behavior analysis and malicious behavior monitoring models in N identification models, and obtaining information about whether malicious behaviors exist in the input user behaviors; the collected multiple program files in the system can be input into malicious program classification recognition models in N recognition models, and information of whether malicious programs exist in the input programs is obtained; the collected multiple networks accessing the system can be input into a botnet identification model in N identification models, and the input information of whether the botnet exists in the multiple networks accessing the system is obtained; the collected multiple traffic information accessing the system can be input into a traffic feature recognition model constructed by using APT threat information in N recognition models, and information whether malicious traffic exists in the input multiple traffic information accessing the system is obtained.
Step S103, determining the weight corresponding to each recognition model.
For example, flexible weight setting may be performed for each model (the above-described recognition model), that is, weight information corresponding to each recognition model may be determined.
Step S104, determining a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, wherein the detection result is used for indicating whether the network attack exists in the target system.
For example, whether or not a network attack exists in the system to be detected may be detected based on the recognition result output by each model and the weight set for each model.
It should be noted that, the method for detecting network attack provided by the embodiment of the application can be applied to financial scenes.
Through the steps S101 to S104, by acquiring a target information set including at least mail information in the target system, user behavior information in the target system, program information in the target system, network information of the access target system and flow characteristic information of the access target system, inputting the target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, and detecting whether a network attack exists in the target system according to the recognition result output by each recognition model and the weight corresponding to each recognition model, the effect of whether the network attack exists in the detection system can be improved, and the safety of the system is further ensured.
Fig. 2 is a flowchart of a method for detecting a network attack according to an embodiment of the present application, as shown in fig. 2, in the method for detecting a network attack according to an embodiment of the present application, N recognition models at least include: the first recognition model, the second recognition model, the third recognition model, the fourth recognition model and the fifth recognition model, the target information in the target information set is input into the N recognition models for recognition processing, and the recognition result set is obtained and comprises:
step S201, inputting mail information in a target system into a first recognition model for recognition processing, and outputting a first recognition result, wherein the first recognition result is used for indicating whether junk mail exists in the target system;
step S202, inputting user behavior information in a target system into a second recognition model for recognition processing, and outputting a second recognition result, wherein the second recognition result is used for indicating whether the user behavior in the target system is malicious or not;
step S203, inputting the program information in the target system into a third recognition model for recognition processing, and outputting a third recognition result, wherein the third recognition result is used for indicating whether a malicious program exists in the target system;
Step S204, inputting the network information of the access target system into a fourth recognition model for recognition processing, and outputting a fourth recognition result, wherein the fourth recognition result is used for indicating whether the network of the access target system is a botnet or not;
step S205, inputting the flow characteristic information of the access target system into a fifth recognition model for recognition processing, and outputting a fifth recognition result, wherein the fifth recognition result is used for indicating whether the flow of the access target system is malicious or not;
step S206, summarizing the first recognition result, the second recognition result, the third recognition result, the fourth recognition result and the fifth recognition result to obtain a recognition result set.
For example, the first recognition model may be a spam recognition model, the second recognition model may be a user behavior analysis and malicious behavior monitoring model, the third recognition model may be a malicious program classification recognition model, the fourth recognition model may be a botnet recognition model, and the fifth recognition model may be a traffic feature recognition model constructed using APT threat intelligence. Then, a plurality of mails in the collected system can be input into a junk mail identification model, and information (the first identification result) of whether junk mails exist in the input mails is obtained; inputting a plurality of pieces of data information about user behaviors in the collected system into a user behavior analysis and malicious behavior monitoring model, and obtaining information (the second identification result) of whether the input user behaviors have malicious behaviors or not; a plurality of program files in the collected system can be input into a malicious program classification recognition model, and information (the third recognition result) of whether malicious programs exist in the input programs is obtained; the collected multiple networks accessing the system can be input into a botnet recognition model, and the input information (the fourth recognition result) of whether the botnet exists in the multiple networks accessing the system is obtained; the collected plurality of traffic information accessing the system may be input into a traffic feature recognition model constructed using APT threat intelligence, and information (the fifth recognition result described above) of whether or not there is malicious traffic among the input plurality of traffic information accessing the system may be obtained. And the obtained information of whether junk mails exist in the input mails, the information of whether malicious behaviors exist in the input user behaviors, the information of whether malicious programs exist in the input programs, the information of whether botnets exist in a plurality of networks which access the system and the information of whether malicious traffic exists in a plurality of traffic information which access the system are summarized together to form the identification result set.
In summary, by using multiple recognition models, it is possible to quickly and accurately obtain whether there is spam, malicious behavior, malicious program, botnet, malicious traffic, etc. in the system to be detected.
Fig. 3 is a flowchart second of a method for detecting a network attack according to an embodiment of the present application, as shown in fig. 3, in the method for detecting a network attack according to an embodiment of the present application, determining, according to a recognition result output by each recognition model and a weight corresponding to each recognition model, a detection result of a target system includes:
step S301, determining a first value set according to the identification result output by each identification model, wherein the first value set at least comprises N first values, and the first values are values corresponding to the identification result output by the identification model;
step S302, calculating to obtain a target value according to N first values in the first value set and the weight corresponding to each recognition model;
step S303, judging whether the target value is larger than a first preset threshold value;
step S304, if the target value is larger than a first preset threshold value, determining that the detection result is that the network attack exists in the target system;
in step S305, if the target value is not greater than the first preset threshold, it is determined that the detection result is that no network attack exists in the target system.
For example, a corresponding value may be set for the recognition result output by each model (each recognition model described above) according to a certain rule, for example, the rule described above may be that if the recognition result output by the spam recognition model is that there is spam in the system to be detected, the recognition result is set to 1, if the recognition result output by the spam recognition model is that there is no spam in the system to be detected, the recognition result is set to 0, and the recognition results output by other models may also be set according to the rule described above. And the value corresponding to each recognition result (the N first values) and the weight corresponding to each recognition model can be multiplied to obtain a plurality of multiplied values, and then the plurality of multiplied values are added to obtain a final target value. For example, if the recognition result output by the spam recognition model is that there is a spam in the system to be detected, the recognition result is set to 1, and if the recognition result output by the spam recognition model is that there is no spam in the system to be detected, the recognition result is set to 0; the identification result output by the malicious program classification identification model is that a malicious program exists in the system to be detected, the identification result is set to be 1, if the identification result output by the malicious program classification identification model is that the malicious program does not exist in the system to be detected, the identification result is set to be 0, the weight corresponding to the junk mail identification model can be set to be 3, and the weight of the malicious program classification identification model can be set to be 5. If the recognition result output by the spam recognition model is that no spam exists in the system to be detected, and the recognition result output by the malicious program classification recognition model is that a malicious program exists in the system to be detected, the target value may be calculated according to formula 0×3+1×5=5, that is, the target value is 5. Then the first preset threshold value can be set to be 4, and whether the calculated target value 5 is larger than the first preset threshold value 4 is judged, if so, network attack exists in the system to be detected; if not, the network attack is not existed in the system to be detected. Therefore, according to the above example, the calculated target value 5 is greater than the preset first preset threshold value 4, so that it can be determined that the network attack exists in the system to be detected.
By the scheme, whether the network attack exists in the system to be detected can be rapidly and accurately judged according to the identification result output by each model and the preset weight for each model.
Optionally, in the method for detecting a network attack provided by the embodiment of the present application, the first recognition model is obtained by: m sample mails are obtained, wherein the M sample mails at least comprise junk mails and non-junk mails, and M is a positive integer; performing feature extraction processing on M sample mails to obtain a first data set, wherein the first data set at least comprises S pieces of first feature data, and S is a positive integer; dividing the first dataset into a first training set for training the model and a first test set for testing the model; and training the first neural network model by adopting a first training set to obtain a first recognition model.
For example, the first recognition model may be a spam recognition model. And the training construction process of the Spam recognition model can be to perform model training and verification test by using an Enorn-Spam data set (a data set for Spam recognition), perform feature extraction on the data set by using a word bag model and a TF-IDF model, and train to form the Spam recognition model (the Spam recognition model can also be called as a Spam detection model) based on an MLP deep learning algorithm (a multi-layer perceptron) (the first neural network model). Moreover, the training and constructing process of the spam recognition model can specifically comprise the following steps:
S1.1, extracting a vocabulary from a file of the Enron-Spam data set, and extracting features of the data set by using a word bag model and a TF-IDF model.
S1.2, randomly dividing the training set and the testing set.
S1.3, an MLP algorithm is instantiated, and training is carried out on a training set, so that model data are obtained.
Through the scheme, the junk mail recognition model can be conveniently learned and trained, and the trained junk mail recognition model is obtained.
Optionally, in the method for detecting a network attack provided by the embodiment of the present application, after training the first neural network model by using the first training set to obtain the first recognition model, the method further includes: acquiring training time length for training a first neural network model; calculating the accuracy of the first recognition model by using the test set; and determining a first test result for testing the first recognition model according to the training time length and the accuracy.
For example, after learning and training the spam recognition model, the trained spam recognition model may be tested by using a test set divided from the data set, for example, a time period for training the model (the training time period described above) may be obtained, prediction may be performed on the test set by using model data to obtain a prediction result, an accuracy degree of the model (the first recognition model described above) may be determined by the prediction result, and then a prediction effect of the MLP algorithm may be verified according to the time period for training the model (the training time period described above) and the accuracy degree of the model.
Through the scheme, the trained junk mail recognition model can be rapidly and accurately tested.
Optionally, in the method for detecting a network attack provided by the embodiment of the present application, the second recognition model is obtained by: acquiring T pieces of user behavior data, wherein T is a positive integer; obtaining a first word set according to the T pieces of user behavior data, wherein the first word set at least comprises P first words, and P is a positive integer; combining the first words in the first word set to obtain Y first sentences, wherein Y is a positive integer; obtaining a second data set based on the Y first sentences; obtaining a second training set for training the model from the second data set; and training the second neural network model by adopting a second training set to obtain a second recognition model.
For example, the second recognition model described above may be a user behavior analysis and malicious behavior monitoring model. The training and constructing process of the user behavior analysis and malicious behavior monitoring model can be to use an SEA dataset (Masquerading User Data, a dataset for training and detecting attacks), perform feature extraction on the user behavior through a word bag and an N-Gram model (an N-Gram language model, a natural language processing model algorithm), and train and verify the user behavior analysis and malicious behavior monitoring model based on an XGBoost algorithm (an efficient gradient lifting decision tree algorithm). Moreover, the training and constructing process of the user behavior analysis and malicious behavior monitoring model can specifically comprise the following steps:
S2.1, reading SEA data set data.
S2.2, extracting word bags. That is, the read SEA dataset content is subjected to feature extraction of the bag of words model, that is, text content is processed into a set of words, and the number of occurrences of each word is counted.
S2.3, N-Gram treatment. That is, using the N-Gram model algorithm, the ngram_range (length range of phrase segmentation) can be set to (2, 4), i.e., two-two, three-three, four-four combinations of words in the S2.2 bag of words are performed, and the rationality of the combined sentence is evaluated.
S2.4, the training set and the testing set can be manually divided.
S2.5, training on a training model by using an XGBoost algorithm.
S2.6, predicting on a training model by using an XGBoost algorithm.
S2.7, verifying the prediction effect of the XGBoost algorithm.
Through the scheme, the user behavior analysis and malicious behavior monitoring model can be conveniently subjected to learning training, and the trained user behavior analysis and malicious behavior monitoring model is obtained.
Optionally, in the method for detecting a network attack provided by the embodiment of the present application, obtaining the third data set according to K program files includes: obtaining a second word set according to the K program files, wherein the second word set at least comprises R second words, and R is a positive integer; combining the second words in the second word set to obtain Z second sentences, wherein Z is a positive integer; determining the importance degree of each second sentence; determining a sentence set based on the importance degree of each second sentence, wherein the sentence set at least comprises W second sentences, W is a positive integer, and W is smaller than or equal to Z; and summarizing the W second sentences to obtain a third data set.
For example, when training and constructing the malicious program classification recognition model, firstly, the acquired multiple program files (the K program files) may be processed, specifically, the files of the MIST dataset (Malware Instruction Set for Behaviour Analysis, a dataset for malicious software behavior analysis) may be extracted by 2-Gram (a two-by-two combined processing mode, that is, a 2-element language model, a natural language processing model algorithm, that is, a ngram_range (length range of phrase segmentation) of the 2-Gram model is set as (2, 2)), that is, the files of the MIST dataset are subjected to feature extraction and are processed by the 2-Gram model algorithm, that is, words are combined two by two, and sentence rationality after combination is evaluated; TF-IDF processing, i.e., the F-IDF model, can then be used to evaluate the importance of words in the text in the MIST dataset, i.e., when TF (word frequency) and IDF (inverse document frequency) are present, the two words are multiplied to obtain the TF-IDF value for a word. Moreover, the larger the TF-IDF of a word in an article, the more important the word will be in the article in general, so by calculating the TF-IDF of each word in the article, the top few words are the keywords of the article, ordered from big to small. Then, a plurality of sentences (the above W second sentences) obtained by processing the plurality of program files (the above K program files) are used as the above third data set.
In summary, the data in the data set can be quickly and accurately determined by processing the acquired plurality of program files.
Optionally, in the method for detecting a network attack provided by the embodiment of the present application, the third recognition model is obtained by: obtaining K program files, wherein K is a positive integer; obtaining a third data set according to the K program files; obtaining a third training set for training the model from the third data set; and training the third neural network model by adopting a third training set to obtain a third recognition model.
For example, the third recognition model described above may be a malware classification recognition model. The training and constructing process of the malicious program classification and identification model can be to classify and identify the malicious program by using a MIST data set, extract static file characteristics and dynamic program behavior characteristics based on a 2-Gram model and a TF-IDF model, and train and verify the malicious program classification and identification model by using a support vector machine algorithm. Moreover, the training and constructing process of the malicious program classification and identification model can specifically comprise the following steps:
s3.1, extracting 2-Gram from the file of the MIST data set, namely extracting the characteristics of the file of the MIST data set, processing the file by using a 2-Gram model algorithm, namely combining words in pairs and evaluating the rationality of the combined sentences.
S3.2, using TF-IDF processing, namely, a TF-IDF model is mainly used for evaluating the importance of words in the text in the MIST data set, namely, after TF (word frequency) and IDF (inverse document frequency) exist, the TF-IDF value of one word can be obtained by multiplying the two words. The larger the TF-IDF of a word in an article, the more important the word will be in the article in general, so by calculating the TF-IDF of each word in the article, the top few words are the keywords of the article, ordered from big to small.
S3.3, randomly dividing the training set and the data set.
And S3.4, training on the training set by using a support vector machine algorithm to obtain a data model.
S3.5, predicting on the test set by using the model data.
S3.6, verifying the prediction effect of the support vector machine algorithm.
Through the scheme, the malicious program classification recognition model can be conveniently subjected to learning and training, and the trained malicious program classification recognition model is obtained.
Optionally, in the method for detecting a network attack provided by the embodiment of the present application, inputting network information of an access target system into a fourth recognition model for recognition processing, and outputting a fourth recognition result includes: inputting network information of an access target system into a fourth identification model for identification processing, and outputting V target IP addresses, wherein the target IP addresses are IP addresses corresponding to target networks, the target networks are networks of the access target system, and V is a positive integer; determining U first IP addresses from the V target IP addresses, wherein the first IP addresses are IP addresses corresponding to a first network, the number of domain names of the first network attack is larger than a second preset threshold, and U is a positive integer; determining a second IP address from the U first IP addresses, wherein the second IP address is an IP address corresponding to a second network, and the similarity of domain names of the second network attack is greater than a third preset threshold; judging whether the number of the second IP addresses is smaller than a fourth preset threshold value or not; if the number of the second IP addresses is smaller than a fourth preset threshold value, determining that the network of the access target system is a botnet as a fourth identification result; and if the number of the second IP addresses is not smaller than a fourth preset threshold value, determining that the network of the access target system is not the botnet as the fourth identification result.
For example, the fourth recognition model may be a botnet recognition model. And the specific process of identifying botnets can be divided into the following steps:
s4.1, acquiring data on an open source attack data website and sorting the data into a data set.
S4.2, reading attack data row by row, and establishing a hash table according to an attack source IP, wherein a key value of the hash is an attacked domain name.
S4.3, defining a threshold R, wherein the IP with the number of attacked domain names exceeding R is listed in the statistical range
S4.4, defining a function for calculating a jarccard coefficient (Jaccard coefficient for comparing the difference and the similarity between two samples) as a way for measuring the similarity of two IP attack sets. A threshold value N is defined, and the statistical range is only included when the domain name jarccard of two IP attacks is larger than or equal to N. The IP importation graph database satisfying the threshold.
And S4.5, clustering the IP by utilizing a directed graph connected branch clustering algorithm, defining a threshold M, and displaying a result when the IP of the same clustering network is more than or equal to M.
S4.6, forming a visual clustering result, and displaying the IP directed relation, namely the botnet.
By the scheme, whether the botnet exists in the network accessing the system to be detected can be rapidly and accurately identified.
Optionally, in the method for detecting a network attack provided by the embodiment of the present application, the fifth identification model is obtained by: acquiring threat information data, wherein the threat information data at least comprises F flow characteristic data, and F is a positive integer; obtaining a third word set according to F flow characteristic data in threat information data, wherein the third word set at least comprises H third words, and H is a positive integer; determining the importance degree of each third word; determining X third words from the H third words based on the importance degree of each third word, wherein X is a positive integer and X is smaller than H; summarizing the X third words to obtain a fourth data set; obtaining a fourth training set for training the model from the fourth data set; and training the fourth neural network model by adopting a fourth training set to obtain a fifth recognition model.
For example, the fifth identification model may be a traffic feature identification model constructed using APT threat intelligence. And the training and constructing process of the flow characteristic recognition model can be to arrange the report content of the APT threat into a data set, extract characteristics of the information content by combining a word bag model with a TF-IDF model, and match threat information based on an MLP deep learning algorithm. In addition, whether the traffic of the access system is malicious or not can be judged by setting a self-defined key feature, namely, a malicious traffic feature library can be set according to the report content of the APT threat information, the traffic feature of the access system is matched with the malicious traffic feature library, and if the matching is successful, the APT attack exists in the system directly; if the matching is unsuccessful, judging that the APT attack does not exist in the system. Moreover, the training and constructing process of the flow characteristic recognition model may specifically include the following steps:
S5.1, acquiring threat information reports or data from related threat information manufacturers, and arranging the threat information reports or data to form a data set.
S5.2, carrying out word bag feature extraction on the data set file formed by the arrangement of the S5.1, namely processing text contents into word sets, and counting the occurrence times of each word.
S5.3, processing by using a TF-IDF model algorithm, evaluating the importance of each word in the text, and analyzing text keywords.
S5.4, randomly dividing the data set into a training data set and a test data set.
S5.5, defining an MLP algorithm model, instantiating, and performing multiple rounds of data training and verification.
Through the scheme, the flow characteristic recognition model can be conveniently learned and trained, the trained flow characteristic recognition model is obtained, and whether malicious flow exists in the flow accessing the system to be detected can be rapidly and accurately recognized.
According to the method provided by the embodiment of the application, for example, through training a data model, the APT attack is identified and judged from 5 aspects, including spam identification, user behavior analysis and malicious behavior monitoring, malicious program classification identification, botnet identification, APT threat information utilization, and the APT attack behavior is accurately detected through multi-dimensional comprehensive judgment.
For example, the method for detecting the APT attack based on the machine learning algorithm is characterized by comprising the following steps:
s1, identifying junk mail: APT attacks often use email boxes, and the process provides a judgment basis for comprehensive judgment of the APT attacks by detecting junk mails. In the embodiment, an Enorn-Spam data set can be used for model training and verification test, a word bag model and a TF-IDF model are utilized for extracting features of the data set, and a Spam detection model is formed based on MLP deep learning algorithm training.
S2, user behavior analysis and malicious behavior monitoring: the APT attacker is used for acquiring more valuable data or causing larger system damage, and the actions such as transverse movement, information collection, malicious command execution and the like are necessarily carried out in the attack target system, and the process provides a judgment basis for comprehensive judgment of the APT attack through analysis of the user actions. In the embodiment, the SEA data set can be utilized to perform feature extraction on the user behaviors through the word bag and the N-Gram model, and training and verification are performed on the user behavior analysis and malicious behavior monitoring model based on the XGBoost algorithm.
S3, classifying and identifying malicious programs: the most remarkable characteristic of the APT attack is that an attacker can build a remote access malicious program aiming at one or more loopholes of a target information system in a customized mode, and the process provides a judgment basis for comprehensive judgment of the APT attack through analysis of the malicious program. In the embodiment, the malicious program can be classified and identified by using the MIST data set, static file characteristics and dynamic program behavior characteristics are extracted based on the 2-Gram and TF-IDF models, and the malicious program classification and identification model is trained and verified by using a support vector machine algorithm.
S4, botnet identification: part of APT attacks can hide themselves by using the botnet, so that the botnet is more difficult to track, and the process provides a judgment basis for comprehensive judgment of the APT attacks through recognition of the botnet. In this embodiment, the zombie network can be identified by utilizing the data on the open source attack data website and sorting the data into the data set in a directed graph manner.
S5, utilizing APT threat information: the security company can issue threat information reports according to the APT attack situation every year, the reports generally contain strategic information of the APT attack, and the process can provide a judgment basis for comprehensive judgment of the APT attack by matching threat information features. In the embodiment, the APT threat information report contents can be arranged into a data set, the characteristic extraction is carried out on the information contents by combining a word bag model with a TF-IDF model, and threat information matching is carried out based on an MLP deep learning algorithm. In addition, whether the traffic of the access system is malicious or not can be judged by setting a self-defined key feature, namely, a malicious traffic feature library can be set according to the report content of the APT threat information, the traffic feature of the access system is matched with the malicious traffic feature library, and if the matching is successful, the APT attack exists in the system directly; if the matching is unsuccessful, judging that the APT attack does not exist in the system.
S6, providing a comprehensive analysis method for the 5 recognition module results, wherein the comprehensive analysis method comprises the following steps: 1. and flexibly setting the weight of the module, automatically and comprehensively judging whether the module is APT attack by the system, and obtaining a conclusion. 2. And setting up a related personnel group, and carrying out comprehensive manual judgment on the judgment result. 3. Setting a custom key feature, and directly judging that the matching is successful as APT attack, namely a first list; and setting a custom key feature, and directly judging that the matching is successful as normal service, namely a second list.
The step S1 specifically comprises the following steps:
s1.1, extracting a vocabulary from a file of the Enron-Spam data set, and extracting features of the data set by using a word bag model and a TF-IDF model.
S1.2, randomly dividing the training set and the testing set.
S1.3, an MLP algorithm is instantiated, and training is carried out on a training set, so that model data are obtained.
S1.4, predicting on a test set by using model data.
S1.5, verifying the MLP algorithm prediction effect.
The step S2 further comprises the steps of:
s2.1, reading SEA data set data.
S2.2, extracting word bags, namely extracting word bag model features of the read SEA data set content, namely processing text content into word sets, and counting the occurrence times of each word.
S2.3, N-Gram processing, namely, utilizing an N-Gram model algorithm, setting ngram_range (the length range of word group segmentation) as (2, 4), namely, combining words in the S2.2 word bag in a pairwise mode, a tri-combination mode and a tetra-combination mode, and evaluating the rationality of the combined sentences.
S2.4, the training set and the testing set can be manually divided.
S2.5, training on a training model by using an XGBoost algorithm.
S2.6, predicting on a training model by using an XGBoost algorithm.
S2.7, verifying the prediction effect of the XGBoost algorithm.
The step S3 further comprises the steps of:
s3.1, extracting 2-Gram from the file of the MIST data set, namely extracting the characteristics of the file of the MIST data set, processing the file by using a 2-Gram model algorithm, namely combining words in pairs and evaluating the rationality of the combined sentences.
S3.2, using TF-IDF processing, namely, a TF-IDF model is mainly used for evaluating the importance of words in the text in the MIST data set, namely, after TF (word frequency) and IDF (inverse document frequency) exist, the TF-IDF value of one word can be obtained by multiplying the two words. The larger the TF-IDF of a word in an article, the more important the word will be in the article in general, so by calculating the TF-IDF of each word in the article, the top few words are the keywords of the article, ordered from big to small.
S3.3, randomly dividing the training set and the data set.
And S3.4, training on the training set by using a support vector machine algorithm to obtain a data model.
S3.5, predicting on the test set by using the model data.
S3.6, verifying the prediction effect of the support vector machine algorithm.
The step S4 further comprises the steps of:
s4.1, acquiring data on an open source attack data website and sorting the data into a data set.
S4.2, reading attack data row by row, and establishing a hash table according to an attack source IP, wherein a key value of the hash is an attacked domain name.
S4.3, defining a threshold R, and listing the IP of which the number of the attacked domain names exceeds R into a statistical range.
S4.4, defining a function for calculating the jarccard coefficient as a mode for measuring the similarity of two IP attack sets. A threshold value N is defined, and the statistical range is only included when the domain name jarccard of two IP attacks is larger than or equal to N. The IP importation graph database satisfying the threshold.
And S4.5, clustering the IP by utilizing a directed graph connected branch clustering algorithm, defining a threshold M, and displaying a result when the IP of the same clustering network is more than or equal to M.
S4.6, forming a visual clustering result, and displaying the IP directed relation, namely the botnet.
The step S5 further comprises the steps of:
S5.1, acquiring threat information reports or data from related threat information manufacturers, and arranging the threat information reports or data to form a data set.
S5.2, carrying out word bag feature extraction on the data set file formed by the arrangement of the S5.1, namely processing text contents into word sets, and counting the occurrence times of each word.
S5.3, processing by using a TF-IDF model algorithm, evaluating the importance of each word in the text, and analyzing text keywords.
S5.4, randomly dividing the data set into a training data set and a test data set.
S5.5, defining an MLP algorithm model, instantiating, and performing multiple rounds of data training and verification.
In addition, the embodiment provides a machine learning algorithm model-based APT attack detection method, and aims at solving the problem that the APT attack detection accuracy is low in the detection means in the related technology, the multi-dimensional and multi-model comprehensive detection method is provided, the detection accuracy is improved, and the problem that the detection means based on the flow protocol in the related technology is difficult to find the APT attack with extremely strong concealment is solved to a certain extent.
In addition, the method provided by the embodiment of the application has the following advantages:
1. the accuracy of identifying APT attacks is improved. In this embodiment, through machine learning algorithm, through training the multi-round detection model of 5 aspects of spam recognition, user behavior analysis and malicious behavior monitoring, malicious program classification recognition, botnet recognition and threat information utilization, multidimensional analysis is performed, so that attack behaviors of multiple stages of installation implantation, communication control and the like in the APT attack process are covered, and the accuracy of APT attack recognition is greatly improved by comprehensive analysis and discrimination modes.
2. The present embodiment provides a monitoring model of capability sustained improvement. In the embodiment, the data set importing and detecting module adding interface can be supported, and good expandability is provided for perfecting and enriching subsequent functions. The identification model is continuously optimized by introducing the latest data set and the novel attack detection module, so that the detection function and accuracy in the embodiment can be continuously improved. Meanwhile, an API (interface) can be opened for other systems, and training results which can be achieved in the embodiment can be shared.
3. The detection analysis result in the embodiment can provide important support for a system administrator to analyze an attacker attack chain, and can also indicate the direction for improving the security protection capability of the information system.
In summary, in the method for detecting a network attack according to the embodiment of the present application, a target information set is obtained, where the target information set includes at least N pieces of target information, and the target information includes at least: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer; inputting target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model; determining the weight corresponding to each recognition model; according to the identification result output by each identification model and the weight corresponding to each identification model, determining a detection result of the target system, wherein the detection result is used for indicating whether network attack exists in the target system, and the problems that whether network attack exists in the detection system based on a flow protocol in the related technology is poor in effect and further affects the safety of the system are solved. The method comprises the steps of obtaining a target information set at least comprising mail information in a target system, user behavior information in the target system, program information in the target system, network information of an access target system and flow characteristic information of the access target system, inputting the target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, and detecting whether network attack exists in the target system according to a recognition result output by each recognition model and weight corresponding to each recognition model, so that the effect of detecting whether the network attack exists in the system can be improved, and the safety of the system is further ensured.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
The embodiment of the application also provides a network attack detection device, and it should be noted that the network attack detection device of the embodiment of the application can be used for executing the network attack detection method provided by the embodiment of the application. The following describes a network attack detection device provided by the embodiment of the present application.
Fig. 4 is a schematic diagram of a network attack detection apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes: a first acquisition unit 401, a first processing unit 402, a first determination unit 403, and a second determination unit 404.
Specifically, the first obtaining unit 401 is configured to obtain a set of target information, where the set of target information includes at least N pieces of target information, and the target information includes at least: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer;
The first processing unit 402 is configured to input target information in the target information set into N recognition models to perform recognition processing, so as to obtain a recognition result set, where the recognition result set at least includes a recognition result output by each recognition model;
a first determining unit 403, configured to determine a weight corresponding to each recognition model;
and a second determining unit 404, configured to determine a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, where the detection result is used to indicate whether a network attack exists in the target system.
In summary, in the network attack detection device provided by the embodiment of the present application, a target information set is acquired by the first acquiring unit 401, where the target information set includes at least N pieces of target information, and the target information includes at least: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer; the first processing unit 402 inputs target information in the target information set into the N recognition models to perform recognition processing, so as to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model; the first determining unit 403 determines a weight corresponding to each recognition model; the second determining unit 404 determines a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, where the detection result is used to indicate whether a network attack exists in the target system, which solves the problem in the related art that whether a network attack exists in the detection system based on the flow protocol, which results in poor effect of detecting whether the network attack exists in the system, and further affects the security of the system. The method comprises the steps of obtaining a target information set at least comprising mail information in a target system, user behavior information in the target system, program information in the target system, network information of an access target system and flow characteristic information of the access target system, inputting the target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, and detecting whether network attack exists in the target system according to a recognition result output by each recognition model and weight corresponding to each recognition model, so that the effect of detecting whether the network attack exists in the system can be improved, and the safety of the system is further ensured.
Optionally, in the network attack detection apparatus provided in the embodiment of the present application, the second determining unit includes: the first determining module is used for determining a first value set according to the identification result output by each identification model, wherein the first value set at least comprises N first values, and the first values are values corresponding to the identification result output by the identification model; the first calculation module is used for calculating to obtain a target value according to N first values in the first value set and the weight corresponding to each recognition model; the first judging module is used for judging whether the target value is larger than a first preset threshold value or not; the second determining module is used for determining that the network attack exists in the target system as a detection result if the target value is larger than a first preset threshold value; and the third determining module is used for determining that the network attack does not exist in the target system as a detection result if the target value is not greater than the first preset threshold value.
Optionally, in the network attack detection apparatus provided by the embodiment of the present application, the N recognition models include at least: the first recognition model, the second recognition model, the third recognition model, the fourth recognition model, and the fifth recognition model, the first processing unit includes: the first processing module is used for inputting mail information in the target system into the first recognition model for recognition processing and outputting a first recognition result, wherein the first recognition result is used for indicating whether junk mail exists in the target system; the second processing module is used for inputting the user behavior information in the target system into the second recognition model for recognition processing and outputting a second recognition result, wherein the second recognition result is used for indicating whether the user behavior in the target system is malicious or not; the third processing module is used for inputting the program information in the target system into a third recognition model for recognition processing and outputting a third recognition result, wherein the third recognition result is used for indicating whether a malicious program exists in the target system; the fourth processing module is used for inputting the network information of the access target system into the fourth recognition model for recognition processing and outputting a fourth recognition result, wherein the fourth recognition result is used for indicating whether the network of the access target system is a botnet or not; the fifth processing module is used for inputting the flow characteristic information of the access target system into a fifth recognition model for recognition processing and outputting a fifth recognition result, wherein the fifth recognition result is used for indicating whether the flow of the access target system is malicious or not; and the sixth processing module is used for summarizing the first recognition result, the second recognition result, the third recognition result, the fourth recognition result and the fifth recognition result to obtain a recognition result set.
Optionally, in the network attack detection apparatus provided by the embodiment of the present application, the first recognition model is obtained by: the second acquisition unit is used for acquiring M sample mails, wherein the M sample mails at least comprise junk mails and non-junk mails, and M is a positive integer; the second processing unit is used for carrying out feature extraction processing on the M sample mails to obtain a first data set, wherein the first data set at least comprises S pieces of first feature data, and S is a positive integer; a first dividing unit for dividing the first data set into a first training set for training the model and a first test set for testing the model; the first training unit is used for training the first neural network model by adopting a first training set to obtain a first identification model.
Optionally, in the network attack detection device provided by the embodiment of the present application, the device further includes: the third obtaining unit is used for obtaining the training time length for training the first neural network model after training the first neural network model by adopting the first training set to obtain the first identification model; the first computing unit is used for computing the accuracy degree of the first identification model by using the test set; and the third determining unit is used for determining a first test result for testing the first recognition model according to the training time length and the accuracy.
Optionally, in the network attack detection apparatus provided by the embodiment of the present application, the second recognition model is obtained by: a fourth obtaining unit, configured to obtain T pieces of user behavior data, where T is a positive integer; a fourth determining unit, configured to obtain a first word set according to T user behavior data, where the first word set includes at least P first words, and P is a positive integer; the third processing unit is used for carrying out combination processing on the first words in the first word set to obtain Y first sentences, wherein Y is a positive integer; a fifth determining unit, configured to obtain a second data set based on the Y first sentences; a fifth acquisition unit configured to acquire a second training set for training the model from the second data set; and the second training unit is used for training the second neural network model by adopting a second training set to obtain a second recognition model.
Optionally, in the network attack detection apparatus provided by the embodiment of the present application, the third recognition model is obtained by: a sixth obtaining unit, configured to obtain K program files, where K is a positive integer; a sixth determining unit, configured to obtain a third data set according to the K program files; a seventh acquisition unit configured to acquire a third training set for training the model from the third data set; and the third training unit is used for training the third neural network model by adopting a third training set to obtain a third recognition model.
Optionally, in the network attack detection apparatus provided in the embodiment of the present application, the sixth determining unit includes: a fourth determining module, configured to obtain a second word set according to the K program files, where the second word set includes at least R second words, and R is a positive integer; a seventh processing module, configured to perform a combination process on the second words in the second word set to obtain Z second sentences, where Z is a positive integer; a fifth determining module, configured to determine an importance level of each second sentence; a sixth determining module, configured to determine a sentence set based on an importance level of each second sentence, where the sentence set includes at least W second sentences, W is a positive integer, and W is less than or equal to Z; and the eighth processing module is used for summarizing the W second sentences to obtain a third data set.
Optionally, in the network attack detection apparatus provided in the embodiment of the present application, the fourth processing module includes: the first processing sub-module is used for inputting the network information of the access target system into the fourth identification model for identification processing and outputting V target IP addresses, wherein the target IP addresses are IP addresses corresponding to target networks, the target networks are networks of the access target system, and V is a positive integer; the first determining submodule is used for determining U first IP addresses from the V target IP addresses, wherein the first IP addresses are corresponding to the first network, the number of domain names of the first network attack is larger than a second preset threshold, and U is a positive integer; a second determining submodule, configured to determine a second IP address from the U first IP addresses, where the second IP address is an IP address corresponding to a second network, and a similarity degree of domain names of the second network attack is greater than a third preset threshold; the first judging submodule is used for judging whether the number of the second IP addresses is smaller than a fourth preset threshold value or not; a third determining submodule, configured to determine that the network of the access target system is a botnet if the number of the second IP addresses is smaller than a fourth preset threshold; and the fourth determining submodule is used for determining that the network of the access target system is not a botnet if the number of the second IP addresses is not smaller than a fourth preset threshold value.
Optionally, in the network attack detection apparatus provided by the embodiment of the present application, the fifth recognition model is obtained by: an eighth obtaining unit, configured to obtain threat information data, where the threat information data at least includes F flow feature data, where F is a positive integer; a seventh determining unit, configured to obtain a third word set according to F flow feature data in threat intelligence data, where the third word set includes at least H third words, and H is a positive integer; an eighth determining unit configured to determine a degree of importance of each third word; a ninth determining unit configured to determine X third words from the H third words based on the importance level of each third word, where X is a positive integer, and X is smaller than H; the fourth processing unit is used for summarizing the X third words to obtain a fourth data set; a ninth acquisition unit configured to acquire a fourth training set for training a model from the fourth data set; and the fourth training unit is used for training the fourth neural network model by adopting a fourth training set to obtain a fifth recognition model.
The network attack detection device includes a processor and a memory, where the first acquisition unit 401, the first processing unit 402, the first determination unit 403, the second determination unit 404, and the like are stored as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one, and the security of the system is ensured by adjusting the kernel parameters.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
The embodiment of the invention provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the method for detecting network attacks.
The embodiment of the invention provides a processor which is used for running a program, wherein the program runs to execute the network attack detection method.
As shown in fig. 5, an embodiment of the present invention provides an electronic device, where the device includes a processor, a memory, and a program stored in the memory and executable on the processor, and when the processor executes the program, the following steps are implemented: obtaining a target information set, wherein the target information set at least comprises N pieces of target information, and the target information at least comprises: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer; inputting target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model; determining the weight corresponding to each recognition model; and determining a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, wherein the detection result is used for indicating whether network attack exists in the target system.
The processor also realizes the following steps when executing the program: according to the recognition result output by each recognition model and the weight corresponding to each recognition model, determining the detection result of the target system comprises the following steps: determining a first numerical value set according to the identification result output by each identification model, wherein the first numerical value set at least comprises N first numerical values, and the first numerical values are numerical values corresponding to the identification result output by the identification model; calculating to obtain a target value according to N first values in the first value set and the weight corresponding to each recognition model; judging whether the target value is larger than a first preset threshold value or not; if the target value is larger than the first preset threshold value, determining that the detection result is that network attack exists in the target system; and if the target value is not greater than the first preset threshold value, determining that the detection result is that no network attack exists in the target system.
The processor also realizes the following steps when executing the program: the N recognition models at least comprise: the first recognition model, the second recognition model, the third recognition model, the fourth recognition model and the fifth recognition model, the target information in the target information set is input into the N recognition models for recognition processing, and the obtaining of the recognition result set comprises the following steps: inputting mail information in the target system into the first recognition model for recognition processing, and outputting a first recognition result, wherein the first recognition result is used for indicating whether junk mail exists in the target system; inputting the user behavior information in the target system into the second recognition model for recognition processing, and outputting a second recognition result, wherein the second recognition result is used for representing whether the user behavior in the target system is malicious or not; inputting the program information in the target system into the third recognition model for recognition processing, and outputting a third recognition result, wherein the third recognition result is used for indicating whether a malicious program exists in the target system; inputting the network information accessing the target system into the fourth recognition model for recognition processing, and outputting a fourth recognition result, wherein the fourth recognition result is used for indicating whether the network accessing the target system is a botnet or not; inputting the flow characteristic information of the access target system into the fifth recognition model for recognition processing, and outputting a fifth recognition result, wherein the fifth recognition result is used for indicating whether the flow of the access target system is malicious or not; and summarizing the first recognition result, the second recognition result, the third recognition result, the fourth recognition result and the fifth recognition result to obtain the recognition result set.
The processor also realizes the following steps when executing the program: the first recognition model is obtained by: obtaining M sample mails, wherein the M sample mails at least comprise junk mails and non-junk mails, and M is a positive integer; performing feature extraction processing on the M sample mails to obtain a first data set, wherein the first data set at least comprises S pieces of first feature data, and S is a positive integer; dividing the first dataset into a first training set for training a model and a first test set for testing the model; and training the first neural network model by adopting the first training set to obtain the first recognition model.
The processor also realizes the following steps when executing the program: after training the first neural network model by using the first training set to obtain the first recognition model, the method further includes: acquiring training time length for training the first neural network model; calculating the accuracy degree of the first identification model by using the test set; and determining a first test result for testing the first recognition model according to the training time length and the accuracy.
The processor also realizes the following steps when executing the program: the second recognition model is obtained by: acquiring T pieces of user behavior data, wherein T is a positive integer; obtaining a first word set according to the T pieces of user behavior data, wherein the first word set at least comprises P first words, and P is a positive integer; combining the first words in the first word set to obtain Y first sentences, wherein Y is a positive integer; obtaining a second data set based on the Y first sentences; obtaining a second training set for training a model from the second data set; and training a second neural network model by adopting the second training set to obtain the second recognition model.
The processor also realizes the following steps when executing the program: the third recognition model is obtained by: obtaining K program files, wherein K is a positive integer; obtaining a third data set according to the K program files; obtaining a third training set for training a model from the third data set; and training a third neural network model by adopting the third training set to obtain the third recognition model.
The processor also realizes the following steps when executing the program: obtaining a third data set according to the K program files comprises: obtaining a second word set according to the K program files, wherein the second word set at least comprises R second words, and R is a positive integer; combining the second words in the second word set to obtain Z second sentences, wherein Z is a positive integer; determining the importance degree of each second sentence; determining a statement set based on the importance degree of each second statement, wherein the statement set at least comprises W second statements, W is a positive integer, and W is smaller than or equal to Z; and summarizing the W second sentences to obtain the third data set.
The processor also realizes the following steps when executing the program: inputting the network information accessing the target system into the fourth recognition model for recognition processing, and outputting a fourth recognition result comprises: inputting network information accessing the target system into the fourth identification model for identification processing, and outputting V target IP addresses, wherein the target IP addresses are IP addresses corresponding to target networks, the target networks are networks accessing the target system, and V is a positive integer; determining U first IP addresses from the V target IP addresses, wherein the first IP addresses are IP addresses corresponding to a first network, the number of domain names of the first network attack is larger than a second preset threshold, and U is a positive integer; determining a second IP address from the U first IP addresses, wherein the second IP address is an IP address corresponding to a second network, and the similarity of domain names of the second network attack is greater than a third preset threshold; judging whether the number of the second IP addresses is smaller than a fourth preset threshold value or not; if the number of the second IP addresses is smaller than the fourth preset threshold value, determining that the fourth identification result is that the network accessing the target system is a botnet; and if the number of the second IP addresses is not smaller than the fourth preset threshold value, determining that the network accessing the target system is not a botnet as the fourth identification result.
The processor also realizes the following steps when executing the program: the fifth recognition model is obtained by: acquiring threat information data, wherein the threat information data at least comprises F flow characteristic data, and F is a positive integer; obtaining a third word set according to F flow characteristic data in the threat information data, wherein the third word set at least comprises H third words, and H is a positive integer; determining the importance degree of each third word; determining X third words from the H third words based on the importance degree of each third word, wherein X is a positive integer and X is smaller than H; summarizing the X third words to obtain a fourth data set; obtaining a fourth training set for training a model from the fourth data set; and training the fourth neural network model by adopting the fourth training set to obtain the fifth recognition model.
The device herein may be a server, PC, PAD, cell phone, etc.
The application also provides a computer program product adapted to perform, when executed on a data processing device, a program initialized with the method steps of: obtaining a target information set, wherein the target information set at least comprises N pieces of target information, and the target information at least comprises: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer; inputting target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model; determining the weight corresponding to each recognition model; and determining a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, wherein the detection result is used for indicating whether network attack exists in the target system.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: according to the recognition result output by each recognition model and the weight corresponding to each recognition model, determining the detection result of the target system comprises the following steps: determining a first numerical value set according to the identification result output by each identification model, wherein the first numerical value set at least comprises N first numerical values, and the first numerical values are numerical values corresponding to the identification result output by the identification model; calculating to obtain a target value according to N first values in the first value set and the weight corresponding to each recognition model; judging whether the target value is larger than a first preset threshold value or not; if the target value is larger than the first preset threshold value, determining that the detection result is that network attack exists in the target system; and if the target value is not greater than the first preset threshold value, determining that the detection result is that no network attack exists in the target system.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: the N recognition models at least comprise: the first recognition model, the second recognition model, the third recognition model, the fourth recognition model and the fifth recognition model, the target information in the target information set is input into the N recognition models for recognition processing, and the obtaining of the recognition result set comprises the following steps: inputting mail information in the target system into the first recognition model for recognition processing, and outputting a first recognition result, wherein the first recognition result is used for indicating whether junk mail exists in the target system; inputting the user behavior information in the target system into the second recognition model for recognition processing, and outputting a second recognition result, wherein the second recognition result is used for representing whether the user behavior in the target system is malicious or not; inputting the program information in the target system into the third recognition model for recognition processing, and outputting a third recognition result, wherein the third recognition result is used for indicating whether a malicious program exists in the target system; inputting the network information accessing the target system into the fourth recognition model for recognition processing, and outputting a fourth recognition result, wherein the fourth recognition result is used for indicating whether the network accessing the target system is a botnet or not; inputting the flow characteristic information of the access target system into the fifth recognition model for recognition processing, and outputting a fifth recognition result, wherein the fifth recognition result is used for indicating whether the flow of the access target system is malicious or not; and summarizing the first recognition result, the second recognition result, the third recognition result, the fourth recognition result and the fifth recognition result to obtain the recognition result set.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: the first recognition model is obtained by: obtaining M sample mails, wherein the M sample mails at least comprise junk mails and non-junk mails, and M is a positive integer; performing feature extraction processing on the M sample mails to obtain a first data set, wherein the first data set at least comprises S pieces of first feature data, and S is a positive integer; dividing the first dataset into a first training set for training a model and a first test set for testing the model; and training the first neural network model by adopting the first training set to obtain the first recognition model.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: after training the first neural network model by using the first training set to obtain the first recognition model, the method further includes: acquiring training time length for training the first neural network model; calculating the accuracy degree of the first identification model by using the test set; and determining a first test result for testing the first recognition model according to the training time length and the accuracy.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: the second recognition model is obtained by: acquiring T pieces of user behavior data, wherein T is a positive integer; obtaining a first word set according to the T pieces of user behavior data, wherein the first word set at least comprises P first words, and P is a positive integer; combining the first words in the first word set to obtain Y first sentences, wherein Y is a positive integer; obtaining a second data set based on the Y first sentences; obtaining a second training set for training a model from the second data set; and training a second neural network model by adopting the second training set to obtain the second recognition model.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: the third recognition model is obtained by: obtaining K program files, wherein K is a positive integer; obtaining a third data set according to the K program files; obtaining a third training set for training a model from the third data set; and training a third neural network model by adopting the third training set to obtain the third recognition model.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: obtaining a third data set according to the K program files comprises: obtaining a second word set according to the K program files, wherein the second word set at least comprises R second words, and R is a positive integer; combining the second words in the second word set to obtain Z second sentences, wherein Z is a positive integer; determining the importance degree of each second sentence; determining a statement set based on the importance degree of each second statement, wherein the statement set at least comprises W second statements, W is a positive integer, and W is smaller than or equal to Z; and summarizing the W second sentences to obtain the third data set.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: inputting the network information accessing the target system into the fourth recognition model for recognition processing, and outputting a fourth recognition result comprises: inputting network information accessing the target system into the fourth identification model for identification processing, and outputting V target IP addresses, wherein the target IP addresses are IP addresses corresponding to target networks, the target networks are networks accessing the target system, and V is a positive integer; determining U first IP addresses from the V target IP addresses, wherein the first IP addresses are IP addresses corresponding to a first network, the number of domain names of the first network attack is larger than a second preset threshold, and U is a positive integer; determining a second IP address from the U first IP addresses, wherein the second IP address is an IP address corresponding to a second network, and the similarity of domain names of the second network attack is greater than a third preset threshold; judging whether the number of the second IP addresses is smaller than a fourth preset threshold value or not; if the number of the second IP addresses is smaller than the fourth preset threshold value, determining that the fourth identification result is that the network accessing the target system is a botnet; and if the number of the second IP addresses is not smaller than the fourth preset threshold value, determining that the network accessing the target system is not a botnet as the fourth identification result.
When executed on a data processing device, is further adapted to carry out a program initialized with the method steps of: the fifth recognition model is obtained by: acquiring threat information data, wherein the threat information data at least comprises F flow characteristic data, and F is a positive integer; obtaining a third word set according to F flow characteristic data in the threat information data, wherein the third word set at least comprises H third words, and H is a positive integer; determining the importance degree of each third word; determining X third words from the H third words based on the importance degree of each third word, wherein X is a positive integer and X is smaller than H; summarizing the X third words to obtain a fourth data set; obtaining a fourth training set for training a model from the fourth data set; and training the fourth neural network model by adopting the fourth training set to obtain the fifth recognition model.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims (13)

1. A method for detecting a network attack, comprising:
obtaining a target information set, wherein the target information set at least comprises N pieces of target information, and the target information at least comprises: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer;
inputting target information in the target information set into N recognition models for recognition processing to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model;
determining the weight corresponding to each recognition model;
and determining a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, wherein the detection result is used for indicating whether network attack exists in the target system.
2. The method of claim 1, wherein determining the detection result of the target system based on the recognition result output by each recognition model and the weight corresponding to each recognition model comprises:
Determining a first numerical value set according to the identification result output by each identification model, wherein the first numerical value set at least comprises N first numerical values, and the first numerical values are numerical values corresponding to the identification result output by the identification model;
calculating to obtain a target value according to N first values in the first value set and the weight corresponding to each recognition model;
judging whether the target value is larger than a first preset threshold value or not;
if the target value is larger than the first preset threshold value, determining that the detection result is that network attack exists in the target system;
and if the target value is not greater than the first preset threshold value, determining that the detection result is that no network attack exists in the target system.
3. The method of claim 1, wherein the N recognition models include at least: the first recognition model, the second recognition model, the third recognition model, the fourth recognition model and the fifth recognition model, the target information in the target information set is input into the N recognition models for recognition processing, and the obtaining of the recognition result set comprises the following steps:
inputting mail information in the target system into the first recognition model for recognition processing, and outputting a first recognition result, wherein the first recognition result is used for indicating whether junk mail exists in the target system;
Inputting the user behavior information in the target system into the second recognition model for recognition processing, and outputting a second recognition result, wherein the second recognition result is used for representing whether the user behavior in the target system is malicious or not;
inputting the program information in the target system into the third recognition model for recognition processing, and outputting a third recognition result, wherein the third recognition result is used for indicating whether a malicious program exists in the target system;
inputting the network information accessing the target system into the fourth recognition model for recognition processing, and outputting a fourth recognition result, wherein the fourth recognition result is used for indicating whether the network accessing the target system is a botnet or not;
inputting the flow characteristic information of the access target system into the fifth recognition model for recognition processing, and outputting a fifth recognition result, wherein the fifth recognition result is used for indicating whether the flow of the access target system is malicious or not;
and summarizing the first recognition result, the second recognition result, the third recognition result, the fourth recognition result and the fifth recognition result to obtain the recognition result set.
4. A method according to claim 3, wherein the first recognition model is obtained by:
obtaining M sample mails, wherein the M sample mails at least comprise junk mails and non-junk mails, and M is a positive integer;
performing feature extraction processing on the M sample mails to obtain a first data set, wherein the first data set at least comprises S pieces of first feature data, and S is a positive integer;
dividing the first dataset into a first training set for training a model and a first test set for testing the model;
and training the first neural network model by adopting the first training set to obtain the first recognition model.
5. The method of claim 4, wherein after training a first neural network model using the first training set to obtain the first recognition model, the method further comprises:
acquiring training time length for training the first neural network model;
calculating the accuracy degree of the first identification model by using the test set;
and determining a first test result for testing the first recognition model according to the training time length and the accuracy.
6. A method according to claim 3, characterized in that the second recognition model is obtained by:
acquiring T pieces of user behavior data, wherein T is a positive integer;
obtaining a first word set according to the T pieces of user behavior data, wherein the first word set at least comprises P first words, and P is a positive integer;
combining the first words in the first word set to obtain Y first sentences, wherein,
y is a positive integer;
obtaining a second data set based on the Y first sentences;
obtaining a second training set for training a model from the second data set;
and training a second neural network model by adopting the second training set to obtain the second recognition model.
7. A method according to claim 3, characterized in that the third recognition model is obtained by:
obtaining K program files, wherein K is a positive integer;
obtaining a third data set according to the K program files;
obtaining a third training set for training a model from the third data set;
and training a third neural network model by adopting the third training set to obtain the third recognition model.
8. The method of claim 7, wherein deriving a third data set from the K program files comprises:
obtaining a second word set according to the K program files, wherein the second word set at least comprises R second words, and R is a positive integer;
combining the second words in the second word set to obtain Z second sentences, wherein,
z is a positive integer;
determining the importance degree of each second sentence;
determining a statement set based on the importance degree of each second statement, wherein the statement set at least comprises W second statements, W is a positive integer, and W is smaller than or equal to Z;
and summarizing the W second sentences to obtain the third data set.
9. The method of claim 3, wherein inputting network information for accessing the target system into the fourth recognition model for recognition processing, and outputting a fourth recognition result comprises:
inputting network information accessing the target system into the fourth identification model for identification processing, and outputting V target IP addresses, wherein the target IP addresses are IP addresses corresponding to target networks, the target networks are networks accessing the target system, and V is a positive integer;
Determining U first IP addresses from the V target IP addresses, wherein the first IP addresses are IP addresses corresponding to a first network, the number of domain names of the first network attack is larger than a second preset threshold, and U is a positive integer;
determining a second IP address from the U first IP addresses, wherein the second IP address is an IP address corresponding to a second network, and the similarity of domain names of the second network attack is greater than a third preset threshold;
judging whether the number of the second IP addresses is smaller than a fourth preset threshold value or not;
if the number of the second IP addresses is smaller than the fourth preset threshold value, determining that the fourth identification result is that the network accessing the target system is a botnet;
and if the number of the second IP addresses is not smaller than the fourth preset threshold value, determining that the network accessing the target system is not a botnet as the fourth identification result.
10. A method according to claim 3, wherein the fifth recognition model is obtained by:
acquiring threat information data, wherein the threat information data at least comprises F flow characteristic data, and F is a positive integer;
obtaining a third word set according to F flow characteristic data in the threat information data, wherein the third word set at least comprises H third words, and H is a positive integer;
Determining the importance degree of each third word;
determining X third words from the H third words based on the importance degree of each third word, wherein X is a positive integer and X is smaller than H;
summarizing the X third words to obtain a fourth data set;
obtaining a fourth training set for training a model from the fourth data set;
and training the fourth neural network model by adopting the fourth training set to obtain the fifth recognition model.
11. A network attack detection apparatus, comprising:
the first acquisition unit is configured to acquire a target information set, where the target information set includes at least N pieces of target information, and the target information at least includes: mail information in a target system, user behavior information in the target system, program information in the target system, network information for accessing the target system and flow characteristic information for accessing the target system, wherein the target system is a system to be detected, and N is a positive integer;
the first processing unit is used for inputting target information in the target information set into N recognition models to perform recognition processing to obtain a recognition result set, wherein the recognition result set at least comprises a recognition result output by each recognition model;
The first determining unit is used for determining the weight corresponding to each recognition model;
and the second determining unit is used for determining a detection result of the target system according to the identification result output by each identification model and the weight corresponding to each identification model, wherein the detection result is used for indicating whether the network attack exists in the target system.
12. A computer-readable storage medium storing a program, wherein the program performs the network attack detection method according to any one of claims 1 to 10.
13. An electronic device comprising one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of detecting a network attack of any of claims 1-10.
CN202310520617.5A 2023-05-09 2023-05-09 Network attack detection method and device, storage medium and electronic equipment Pending CN116827595A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310520617.5A CN116827595A (en) 2023-05-09 2023-05-09 Network attack detection method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310520617.5A CN116827595A (en) 2023-05-09 2023-05-09 Network attack detection method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116827595A true CN116827595A (en) 2023-09-29

Family

ID=88123014

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310520617.5A Pending CN116827595A (en) 2023-05-09 2023-05-09 Network attack detection method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116827595A (en)

Similar Documents

Publication Publication Date Title
CN109347801B (en) Vulnerability exploitation risk assessment method based on multi-source word embedding and knowledge graph
Zhu et al. OFS-NN: an effective phishing websites detection model based on optimal feature selection and neural network
Tesfahun et al. Intrusion detection using random forests classifier with SMOTE and feature reduction
Ranade et al. Generating fake cyber threat intelligence using transformer-based models
Elbaz et al. Fighting N-day vulnerabilities with automated CVSS vector prediction at disclosure
Niakanlahiji et al. A natural language processing based trend analysis of advanced persistent threat techniques
Beaver et al. A learning system for discriminating variants of malicious network traffic
Ajdani et al. Introduced a new method for enhancement of intrusion detection with random forest and PSO algorithm
Gonaygunta Machine learning algorithms for detection of cyber threats using logistic regression
Zhang et al. Cross-site scripting (XSS) detection integrating evidences in multiple stages
CN116318924A (en) Small sample intrusion detection method, system, medium, equipment and terminal
CN116015703A (en) Model training method, attack detection method and related devices
Mythreya et al. Prediction and prevention of malicious URL using ML and LR techniques for network security: machine learning
Sohrabi et al. Topic modeling and classification of cyberspace papers using text mining
Neto et al. Cyber threat hunting through automated hypothesis and multi-criteria decision making
Vishva et al. Phisher fighter: website phishing detection system based on url and term frequency-inverse document frequency values
Harbola et al. Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set
Goswami et al. Phishing detection using significant feature selection
Nguyen et al. Lightgbm-based ransomware detection using api call sequences
Selvi et al. Toward optimal LSTM neural networks for detecting algorithmically generated domain names
CN116827595A (en) Network attack detection method and device, storage medium and electronic equipment
CN114398887A (en) Text classification method and device and electronic equipment
Khan Detecting phishing attacks using nlp
Rathod et al. AI & ML Based Anamoly Detection and Response Using Ember Dataset
Babu Phishing Detection in Emails Using Multi-Convolutional Neural Network Fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination