CN116432184A - Malicious software detection method based on semantic analysis and bidirectional coding characterization - Google Patents

Malicious software detection method based on semantic analysis and bidirectional coding characterization Download PDF

Info

Publication number
CN116432184A
CN116432184A CN202310588930.2A CN202310588930A CN116432184A CN 116432184 A CN116432184 A CN 116432184A CN 202310588930 A CN202310588930 A CN 202310588930A CN 116432184 A CN116432184 A CN 116432184A
Authority
CN
China
Prior art keywords
sequence
matrix
function
attention
api
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310588930.2A
Other languages
Chinese (zh)
Inventor
赵运弢
冯永新
刘峻名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Ligong University
Original Assignee
Shenyang Ligong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Ligong University filed Critical Shenyang Ligong University
Priority to CN202310588930.2A priority Critical patent/CN116432184A/en
Publication of CN116432184A publication Critical patent/CN116432184A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Biology (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Virology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

Aiming at the problems that the traditional model detects that the malicious code has polysemous word representation and lacks context semantics, the invention provides a malicious software detection method based on semantic analysis and bidirectional coding characterization, which combines BERT with a convolution recursion network based on an external attention mechanism, uses a malicious software API function call sequence as a model learning feature, and carries out static analysis on the malicious software API function call sequence to detect the existing malicious software; using the API call function sequence to have correlation in context and semantics, using BERT for word representation task, and receiving semantic information from the sequence; the convolution neural network and the long-term memory network are respectively used for completing secondary feature extraction and chain relation mining between API functions; the attention mechanism is added after the network is memorized for a long time, so that key information in the text can be better focused, the influence of noise is reduced, and the accuracy in a text classification task is improved; the method is not influenced by the variation and deformation of malicious codes, and the accuracy reaches 98.81%.

Description

Malicious software detection method based on semantic analysis and bidirectional coding characterization
Technical Field
The invention belongs to the field of malware detection of computer security technology, and particularly relates to a malware detection method based on semantic analysis and bidirectional coding characterization.
Background
With the rapid development and popularization of information technology, computers have become an integral part of modern society. However, with the increasing complexity of computer application scenes, the security problem is also becoming more and more prominent, the types and the quantity of malicious software are rapidly expanding, and the propagation modes are continuously changing. Many problems including intrusion detection, virus classification, spam analysis, and phishing prevention have made network security a nightmare.
In recent years, advanced viruses and advanced persistent threat attacks against industrial control systems have become more frequent, and the detection of such viruses by a large number of variants has become more laborious based on fixed features, and the problem of industrial control system information security has also become more prominent. As network attacks become more complex, a variety of new malware, including Trojan horses, botnets, advertising software, and spyware, become more damaging and challenging. Virus species are also rapidly produced and updated, posing a greater threat to the internet. Atlas VPN team estimated 190 ten thousand malware for Linux in 2022, a 50% increase over the previous year. In the third quarter of the last year, 75841 malware samples aiming at Linux are increased by 91% in a same way; fourth quarter, there were 164697 samples, with a comparable increase of 117%. Unfortunately, classical security technologies such as anti-viruses cannot cope with the rapidly growing diversity of malware, which leaves people with doubt about the effectiveness and trustworthiness of the methods currently in use.
Today, globalization is possible in which everyone's computer becomes a victim. Moreover, the development of the internet of things has enabled everything to be connected to each other and exchange information over a network, but this also allows for a large spread of malware across multiple platforms of interconnected devices, and the internet of things ecosystem is also extremely vulnerable to a large amount of malware attacks by traditional computers and smartphones. In addition, the process of detecting malware attacks has become a challenging task due to the rapid adoption of Android platforms in mobile devices. In order to fundamentally solve the crisis caused by malicious software, the sustainability and the safety of the technical development of the Internet of things can be ensured only by continuously searching for new solutions and strengthening safety measures. Therefore, it is necessary to solve the shortcomings of the conventional malware analysis methods, develop a more effective solution, and provide an intelligent analysis method that is efficient and practical and can cope with malware changes.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a malicious software detection method based on semantic analysis and bidirectional coding characterization. From the detection efficiency of malicious codes, semantic analysis and bidirectional coding characterization are combined, so that the accuracy of detecting deformed malicious codes is improved, and the robustness of the model can be ensured when the model runs in different environments, platforms and operating systems; meanwhile, the process of manually marking data is omitted, the intrusion behavior of malicious codes can be accurately identified by utilizing the detection means based on the semantic relation and the context information of the data, and the safety and the stability of a computer system are ensured.
The malicious software detection method based on semantic analysis and bidirectional coding characterization comprises the following steps:
step 1: acquiring a malicious software data set, storing the malicious software data set in a CSV file form, and extracting an API function call sequence in the data set;
first, a malware dataset is downloaded, the dataset containing basic information for a plurality of malware, each basic information comprising the following features: sha256 hash value, label, header information, import function library, export function library, section information, character string information, sliding window entropy calculation, linker version, submission size, system version and subsystem version; wherein the imported function library contains a malicious software API function;
after the data set is obtained, aiming at basic information of each piece of malicious software, extracting an API function of each piece of malicious software from an import function library by using a Python third party library, and simultaneously reserving the sequence of the API function in the import function library to obtain a sequence composed of the API functions, namely, one piece of malicious software corresponds to one API function call sequence, and finally saving two fields of a family name of the malicious software and the corresponding API function call sequence into a CSV file;
step 2: performing word vectorization on the API function call sequence obtained in the step 1 by adopting a BERT model, so as to generate the feature of the word embedding type;
The BERT model consists of a plurality of convertors layers, and the BERT model based on the bidirectional coding characterization can treat the API function call sequence of each malicious sample as a text sentence with context semantics;
step 2.1: firstly, carrying out Unicode standardization on a malicious software API call sequence in a CSV file, then carrying out token operation on the API call sequence, dividing the sequence into single characters or some combined characters, and then adopting a word segmentation algorithm to segment the text after token;
step 2.2: constructing an input sequence of the BERT model; adding special tags to the malware API call sequence, the tags including [ CLS ] and [ SEP ], wherein the [ CLS ] tag represents the beginning of the sequence, and the [ SEP ] tag is used for separating different sentences or paragraphs; then converting the marked sequence into 768-dimensional embedded vectors, adding position codes to each embedded vector, representing the position of each API function in the calling sequence, and then randomly replacing the vector value corresponding to a specific API function in the embedded vector with the position codes by using a special MASK mark; finally, all the embedded vectors are formed into a plurality of batches and sent into a transducer model for further processing;
Step 2.3: adding a multi-head attention mechanism to all the embedded vectors obtained in the step 2.2;
step 2.3.1: generating a transformation matrix based on the embedded vector;
the BERT divides the embedded vector added with the mark into h parts, namely h attention heads, and each attention head calculates an attention weight matrix used for representing the correlation among different words in input; specifically, each attention header generates a query matrix Q, a key matrix K, and a value matrix V by using three linear transforms of the query vector, the key vector, and the value vector, specifically:
Q=EW Q ,K=EW K ,V=EW V
wherein W is Q 、W K And W is V The vector matrix is a linear transformation matrix of different vectors, and E is a vector matrix formed by splicing all input vectors;
step 2.3.2: calculating 1 part of embedded vector, namely a single attention mechanism corresponding to 1 attention head;
the vector matrix E is used for calculation with the query matrix Q, the key matrix K and the value matrix V, so that single self-attention output is obtained, specifically:
Figure BDA0004244830970000031
wherein QK T Is query vector Q and all key vectors K T Similarity score between penalty factor d k Is to prevent QK T The Softmax function is to normalize each row vector after operation in order to calculate the importance of each word to other words;
Step 2.3.3: calculating a multi-head attention mechanism of all embedded vectors;
after calculating the output of a single self-attention mechanism, the output of a multi-head attention mechanism is obtained as follows:
Attention i =Attention(QW i Q ,KW i K ,VW i V )
i.e. by means of a attentional mechanism, a weighted average is calculated, i.e. Attention i The method comprises the steps of carrying out a first treatment on the surface of the Wherein the ith element represents the query matrix QW i Q In the ith row and key matrix KW i K All of (3)The score obtained by the inner product of the row is normalized by a Softmax function to obtain a weight, and finally the weight weighting summation is used to obtain the Attention i
Step 2.3.4: splicing the output of the multi-head attention mechanism;
all the Attention matrixes are connected together according to columns to form a larger matrix, and the final output is obtained by linear transformation of the weight matrix, wherein the formula is as follows:
Y=Multi(Q,K,V)=concat(Attention 1 ,Attention 2 ,...,Attention 8 )·W
wherein, multi (Q, K, V) represents performing Multi-headed attention calculation on the input query matrix Q, key matrix K and value matrix V; the condition (attach) represents that the attach matrix obtained by multi-head calculation is connected by columns to form a larger matrix; w represents a weight matrix; multiplying the weight matrix by the multi-head Attention matrix after concat to obtain a weighted and summed query vector;
step 2.3.5: carrying out layer standardization processing on the weighted and summed query vectors; the method comprises the following two steps: the first step is to perform translation operation, namely adding a bias term bias to the result after linear transformation; the second step is to perform scaling operation, i.e. dividing by one standard deviation;
Finally multiplying the normalized vector with a value matrix to obtain an attention score, and multiplying the attention score with the value matrix to obtain a final output vector;
step 2.3.6: residual connection is carried out on the output vector and the input embedded vector, and the output vector after residual connection is used as the input of the next step;
step 3: constructing a ConvLSTM neural network architecture;
combining a convolutional neural network CNN, a long and short-term memory network LSTM and an external attention mechanism to construct a ConvLSTM neural network architecture, taking the embedded vector after the residual connection in the step 2 as the input of the ConvLSTM architecture, and training the ConvLSTM architecture;
step 3.1: establishing a plurality of one-dimensional convolutional neural networks CNN for extracting local features;
the formula of CNN for local feature extraction is as follows:
Figure BDA0004244830970000041
wherein p (i) represents the value of the ith node in the cell matrix; wi (w) x,y The method comprises the steps of representing the weight of a filter input node (x, y) of an ith node in an output unit node matrix, and representing a bias term parameter corresponding to the ith output node by bi; c x,y Is the value of the node (x, y) in the filter; f is an activation function; the unit vectors of all p (i) are feature maps obtained from the convolutional layer and are denoted as p, the input of the LSTM network architecture;
Step 3.2: after the CNN network is built, an LSTM network architecture is further introduced to process the long sequence, and the LSTM uses a gating unit to realize the latest memory of the stored API call sequence, wherein the latest memory comprises a forgetting gate, an input gate and an output gate component; the gating unit decides which gate is used according to the sequence extracted by the CNN, and adjusts the triggering of the gate by utilizing an S-shaped activation function;
step 3.3: adding an external attention mechanism, weighting the output value of each time step, and reinforcing the weight of useful information; the CNN, LSTM and external attention mechanisms are fused, a ConvLSTM neural network architecture is finally constructed, and a softmax function is applied to complete detection and classification of malicious software;
step 4: training and optimizing ConvLSTM neural network architecture;
step 4.1: adding a cross-validation optimization model aiming at a ConvLSTM neural network architecture; firstly, randomly dividing an API function call data set after word vector into a training set and a testing set, and adding a cross verification method; in the cross validation process, firstly, randomly dividing an API function call sequence data set into a plurality of subsets with equal size, wherein one subset is used as a validation set, and the rest subset is used as a training set; the remaining subsets are then used to train the model for each subset and tests are performed on that subset; this process is repeated a plurality of times, and different subsets are selected as test sets each time, so that model performance indexes tested by using the different subsets are finally obtained;
Step 4.2: EDA data enhancement method is added for the data set; EDA generates new training data by randomly transforming the original API sequence data; multiple transformations were performed on each sample, specifically using the following three transformations:
step 4.2.1: randomly inserting; randomly selecting an API function from a certain API call sequence, and inserting an automatically generated API function at the position;
step 4.2.2: randomly deleting; randomly selecting a function from the API call sequence and deleting the function from the function;
step 4.2.3: random exchange; randomly selecting two adjacent API functions and exchanging their positions;
step 4.2.4: generating and storing a new data set; mixing the transformed sample with the original sample to form a new data set to be stored in a file;
step 5: evaluating the ConvLSTM neural network architecture optimized in the step 4, and testing the efficiency and accuracy of malware detection;
three evaluation indexes for classifying problems are selected to evaluate the ConvLSTM neural network architecture, specifically, accuracy, F1 score and loss value;
step 6: constructing a malicious software API call sequence detection system, and visualizing a detection result; specifically, a related library used for constructing a visual interface in Python is used for constructing a visual system platform;
Step 6.1: designing a system interface; detecting system requirements and functions according to the API call sequence, and constructing a user interface meeting the system requirements; then, performing interface layout design by using a visualization technology in Python;
step 6.2: adding system functions; the functions comprise selecting an uploading file, selecting a model, displaying a predicted result image, and displaying a single-sequence predicted window;
step 6.2.1: selecting an uploading file function to realize; firstly, creating a dialogue box of a file by using a visualization component in python, setting the type of the uploaded file, then setting an action button for opening the dialogue box of the file, acquiring the path and the name of the selected file, and finally visualizing the data of the file by using a presentation function;
step 6.2.2: realizing a model selection function; firstly, placing a trained model under a designated folder, displaying the name of the model by using a drop-down frame component, and completing model selection by mouse selection;
step 6.2.3: realizing a prediction result display window function; after the data set and the model are selected, a visual component is used for creating a predicted result display window, a predicted result display button is created, and the model loss value and the accuracy are obtained, so that the predicted result button is clicked to obtain the model loss value and the model accuracy and display the model loss value and the model accuracy in the window;
Step 6.2.4: the function of a prediction result image display window is realized; creating a predicted result image display window by using a visualization component, creating a display result image button, and acquiring a loss value and an accuracy image, so that clicking the result image button displays the loss value and the accuracy image;
step 6.2.5: realizing a single sequence prediction window function; a visualization component is used to create a single sequence prediction display window and create a display single sequence prediction result display button, and single sequence predictions are obtained so that the single sequence prediction button displays the corresponding tag and malicious family category of the sequence.
The invention has the beneficial technical effects that:
the method analyzes the importance degree and actual requirement of the context-based semantic analysis on the detection of the malicious code API call sequence, and knows that the API call sequence is irrelevant to a specific virus form and an execution environment and has great universality; secondly, according to the related technology based on semantic analysis models and malicious code detection in the past, comparing the advantages and disadvantages of different semantic analysis models, and providing a detection model based on context semantic analysis and bidirectional coding characterization; on the basis, constructing a ConvLSTM model to finish the detection of the API call sequence of the malicious code; finally, a malicious code API call sequence detection system is built by using a PyQt technology, and a more visual malicious code detection effect is provided.
Drawings
FIG. 1 is a flowchart of a method for malware detection based on semantic analysis and bi-directional coding characterization in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of an input sequence for BERT construction according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an API call sequence for vectorization of BERT generation words in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a multi-head attention mechanism added by an embodiment of the present invention;
FIG. 5 is a schematic diagram of a ConvLSTM neural network architecture according to an embodiment of the present invention;
FIG. 6 is a diagram of a malware API call sequence detection system interface in accordance with an embodiment of the present invention;
FIG. 7 is a single sequence prediction window interface diagram of a malware API call sequence detection system in accordance with an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples;
the method for detecting the malicious software based on semantic analysis and bidirectional coding characterization is shown in the whole flow chart as shown in fig. 1, and comprises the following steps:
step 1: acquiring a malicious software data set, and extracting an API function call sequence in the data set;
firstly downloading a malicious software data set related to an API function call sequence from a malicious software functional network, wherein the data set contains basic information of a plurality of malicious software, and each basic information contains the following characteristics such as sha256 hash value, label, header information, import function library, export function library, section information, character string information, sliding window entropy calculation, linker version, submission size, system version and subsystem version; the imported function library comprises a large number of malicious software API functions;
After the data set is downloaded and acquired, extracting the API function of each malicious software from the import function library by using a Python third party library, and simultaneously reserving the sequence of the API function in the import function library to obtain a sequence composed of the API functions, namely, one malicious software corresponds to one API function call sequence, and finally saving two fields of the family name of the malicious software and the corresponding API function call sequence into a CSV file;
step 2: and carrying out word vectorization on the API function call sequence by adopting the BERT model to generate the feature of the word embedding type.
As shown in fig. 3;
the BERT model consists of a plurality of convertors layers, and the BERT model based on the bidirectional coding characterization can treat the API function call sequence of each malicious sample as a text statement with context semantics;
step 2.1: firstly, carrying out Unicode standardization on a malicious software API call sequence in a CSV file, then carrying out token operation on the API call sequence, dividing the sequence into single characters or some combined characters, and then adopting a word segmentation algorithm to segment the text after token;
step 2.2: constructing an input sequence of the BERT model, as shown in figure 2; in order for the BERT model to understand the incoming malicious code API call sequence, special tags need to be added to the sequence, including [ CLS ] and [ SEP ], where the [ CLS ] tag indicates the beginning of the sequence and the [ SEP ] tag is used to separate different sentences or paragraphs; then converting the marked sequence into 768-dimensional embedded vectors, adding position codes to each embedded vector, wherein the position codes are used for representing the positions of each API function in the calling sequence, and adding the position codes to the embedded vectors can enable the subsequent transducer model to have the capability of learning word sequences, namely enabling the transducer model to capture the relative positions of the API functions in the input sequence; then using special MASK marks to randomly replace vector values corresponding to a specific API function in the embedded vector added with the position codes; finally, all the embedded vectors are formed into a plurality of batches and sent into a transducer model for further processing;
Step 2.3: adding a multi-headed attention mechanism to the embedded vector as shown in fig. 4;
step 2.3.1: generating a transformation matrix based on the embedded vector;
the BERT divides the labeled embedded vector into h parts, i.e., h attention heads, each of which computes an attention weight matrix that characterizes the correlation between different words in the input. Specifically, each attention header uses three linear transformations (i.e., query vector, key vector, value vector) to generate a query matrix Q, a key matrix K, and a value matrix V, as follows:
Q=EW Q ,K=EW K ,V=EW V
wherein W is Q 、W K And W is V Is a linear transformation matrix of different vectors, and E is a vector matrix formed by splicing all input vectors.
Step 2.3.2: a single attention mechanism is calculated.
Then, the vector matrix E and 3 attention weight matrices are used for calculation, so that the self-attention output is obtained, and the formula is as follows:
Figure BDA0004244830970000071
wherein QK T Is query vector Q and all key vectors K T Similarity score between penalty factor d k Is to prevent QK T The Softmax function is normalized for each line vector after the operation in order to calculate the importance of each word to other words.
Step 2.3.3: a multi-headed attention mechanism is calculated.
After calculating the output of a single self-attention mechanism, the output of a multi-head attention mechanism is obtained as follows:
Attention i =Attention(QW i Q ,KW i K ,VW i V )
specifically, a weighted average, i.e., attention, can be calculated by the Attention mechanism i . Wherein the ith element represents the query matrix QW i Q In the ith row and key matrix KW i K The score obtained by the inner product of all the rows in the table is normalized by a Softmax function to obtain a weight, and finally the weight weighting summation is used to obtain the Attention i
Step 2.3.4: output of spliced multi-head attention mechanism
Finally, all the Attention matrixes are connected together according to columns to form a larger matrix, and the final output is obtained by linear transformation of the weight matrix, wherein the formula is as follows:
Y=Multi(Q,K,V)=concat(Attention 1 ,Attention 2 ,...,Attention 8 )·W
wherein, multi (Q, K, V) represents performing Multi-headed attention calculation on the input query matrix Q, key matrix K and value matrix V; the condition (attach) represents that the attach matrix obtained by multi-head calculation is connected by columns to form a larger matrix; w represents the weight matrix, and the final output matrix Y is obtained by linear transformation on the connected matrix.
Step 2.3.5: layer standardization treatment;
then carrying out layer standardization processing on the weighted and summed query vectors; this process is divided into two steps: the first step is to perform translation operation, namely adding a bias term bias to the result after linear transformation to further emphasize the expression capacity of the model; the second step is to perform a scaling operation, i.e. dividing by one standard deviation, so that the outputs have the same variance; the process improves the convergence speed and stability of the model, so that different input data can be better dealt with;
Finally multiplying the normalized vector with a value matrix to obtain an attention score, and multiplying the attention score with the value matrix to obtain a final output vector;
step 2.3.6: residual connection;
the output vector is in residual connection with the input embedding, so that the model can learn information of different layers and different aspects in the input better. And performing normalization operations after residual connection, including batch normalization and residual normalization, to improve the stability and training effect of the BERT model.
Step 3: constructing ConvLSTM neural network architecture.
The invention designs that the BERT model realizes the increment fine adjustment by combining with ConvLSTM architecture, and as shown in figure 5, a ConvLSTM neural network architecture schematic diagram is described.
Step 3.1: a plurality of one-dimensional convolutional neural networks CNN are established for extracting local features. On one hand, a plurality of convolution kernels are used for fully extracting the features with stronger discrimination; on the other hand, the parameters of the convolution layer are reduced, and the running time is shortened.
The API function call sequence of malware may be considered an operational instruction sequence that may be attempted to be processed using a network model of text classification and emotion analysis. CNNs are able to summarize feature local predictors in a given structure, combining them to generate a feature matrix representing the structure. It is able to extract local features of different sizes by setting the size of the different filter kernels. The output vector matrix T of the BERT layer is used as an input to the CNN and the convolution kernel slides over the sentence word vector matrix. The convolution kernels are multiplied by the corresponding elements on the corresponding sentence word vector matrix window and then summed, their values being used as eigenvalues of the final eigenvector matrix, all eigenvalues forming the eigenvector graph.
The present design uses a multi-kernel approach with three filters of sizes 2, 3 and 4, respectively, to fuse the convolution layers, and the number of convolution kernels for each size is selected to be 128 in order to extract different text features. The filter converts the 3 x 1 node matrix into an identity node matrix. The formula of CNN for local feature extraction is as follows:
Figure BDA0004244830970000091
wherein p (i) represents the value of the ith node in the cell matrix; wi (Wi) x,y Weights of filter input nodes (x, y) for representing the ith node in the output cell node matrix, and representing the bias corresponding to the ith output node using biSetting item parameters; c x,y Is the value of the node (x, y) in the filter; f is the activation function. The unit vectors of all p (i) are feature maps obtained from the convolutional layer and denoted as p, which is the input of the next layer.
Step 3.2: after the CNN network is built, an LSTM network architecture is further introduced to process long sequences, so that the problems of gradient disappearance, gradient explosion and the like in the traditional cyclic neural network are solved. Unlike normal neurons, LSTM uses gating units to implement the latest memory to store API call sequences, including forget gates, input gates, and output gates, among other components. The gating unit decides which gate to use according to the sequence extracted by the CNN, and adjusts the triggering of the gate by using an S-shaped activation function, thereby realizing the accurate control of the data and information flow inside the unit.
In the design of the invention, after the LSTM network is added, the weight and the bias of the LSTM are initialized randomly. For each time step, the LSTM network first calculates an input gate to determine which information should be passed to the LSTM unit. Next, a forget gate is passed to determine which history information should be retained. Then, by multiplying the results of the input gate and the forget gate and adding the input of the current time step, a new memory cell state can be obtained. This state is preserved and passed on to the next time step. Finally, the LSTM calculates an output gate to determine what the output of the time step should be. Using this result, the memory cell state is passed to a tanh activation function to obtain a new output. When parameter tuning is performed, the difference between the predicted output and the actual output is calculated first, and the weight and bias of the LSTM are updated using a back propagation algorithm. Thus, the LSTM model gradually learns the pattern of the API call sequence data.
Step 3.3: an external attention mechanism is added after the LSTM architecture, weighting the output value for each time step and reinforcing the weight of the useful information. Finally, the software function is applied to complete the detection and classification process of the malicious software.
Step 4: the ConvLSTM neural network architecture is optimized.
Step 4.1: adding cross-validation optimizes the ConvLSTM neural network architecture. Firstly, randomly dividing an API function call data set after word vector into a training set and a testing set according to the proportion of 8:2, and adding a ten-fold cross-validation method to improve the performance of a ConvLSTM neural network architecture. In the cross validation process, the API function call sequence data set is firstly divided into 10 subsets with equal size randomly, wherein one subset is used as a validation set, and the rest subset is used as a training set. The remaining subsets are then used to train the ConvLSTM neural network architecture for each subset and test is performed on that subset. This process was repeated 10 times, selecting different subsets each time as test sets, resulting in performance metrics for 10 ConvLSTM neural network architectures.
Step 4.2: EDA data enhancement methods were added. The EDA generates new training data by randomly transforming the original API sequence data. Here, each sample is transformed multiple times, and the present invention is designed to use the following three transforms.
Step 4.2.1: randomly inserting. An API function is randomly selected from a sequence of API calls and an automatically generated API function is inserted at that location.
Step 4.2.2: and randomly deleting. A function is randomly selected from the sequence of API calls and deleted therefrom.
Step 4.2.3: and (5) random exchange. Two API functions that are adjacent are randomly selected and their locations are swapped.
Step 4.2.4: a new data set is generated and stored. And mixing the transformed samples with the original samples to form a new data set, and storing the new data set in a file.
Step 5: and evaluating the ConvLSTM neural network architecture, and testing the efficiency and accuracy of malware detection.
The invention designs and selects the following three evaluation indexes for classifying problems to evaluate, namely accuracy, F1 score and loss value. The present invention defines the following numbers: TP (true positive), FN (false negative), FP (false positive), and TN (true negative).
Accuracy is an evaluation index in classification tasks and is used for measuring the Accuracy degree of model prediction. It represents the ratio of the number of samples correctly predicted by the model to the total number of samples.
Figure BDA0004244830970000101
The F1 score is a comprehensive evaluation index in the classification task and combines two indexes of Precision and Recall of the model. F1 score is the harmonic mean of accuracy and recall. Where Precision refers to how many of the positive samples that the model predicts are true positive samples, and Recall refers to how many of the samples that are actually positive samples are correctly predicted by the model. The F1 score is the harmonic mean of these two indicators,
The calculation formula of the F1 score is as follows:
Figure BDA0004244830970000102
for the loss function, a cross entropy loss function is used. The method can measure the performance of the classification model and can represent the difference between the real sample label and the prediction probability. The smaller its difference, the better the prediction effect. The cross entropy H (p, q) of the probability distribution p compared to the probability distribution q is calculated as follows:
Figure BDA0004244830970000103
in order to verify the effectiveness of the design method of the invention, different classification models are compared in the same experimental environment, and the final results are shown in table 1.
Table 1 model evaluation result diagram
Figure BDA0004244830970000111
Model 1 is the result of directly connecting the output of BERT to the fully connected layer, which is then input to the Softmax classifier; comparison with model 2 shows that after 10 times of cross-validation, the accuracy of the model is greatly improved, which indicates that cross-validation is necessary; under the same detection classifier framework, the BERT model can solve the problem of the representation of ambiguities compared with static embedded models such as Word2Vec and Embedding. The method uses a transducer model with strong text fusion capability as a substructure of a pre-training model, thereby greatly improving the model analysis capability. Compared with the single feature extraction model of BERT, the mixed model used in the method can better extract text features, has higher classification precision and improves various evaluation indexes. The final experimental results show that the BERT-based ConvLSTM model proposed herein is the best in terms of accuracy and loss function value, with an accuracy of 98.81% and a loss value of 0.03.
Step 6: and constructing a malicious software API call sequence detection system, and visualizing a detection result. The invention designs a system construction by using a PyQt5 library, and an interface of a malicious software API call sequence detection system is shown in FIG. 6.
Step 6.1: and designing a system interface and selecting proper layout and control. And detecting system requirements and functions according to the API call sequence, selecting QComboBox, QPushButton, QLabel and other suitable controls in the PyQt, and helping to construct a user interface meeting the system requirements. The PyQt framework is then used for interface layout design. The present invention contemplates using QWIdget to create the main window and then placing other controls in the main window. The window is divided into different areas according to the system requirements and functions, and corresponding controls are placed in each area. And then adding an event processing program for the PyQt control, so that the PyQt control can respond to the user operation and realize corresponding interaction functions. And finally, after the interface design is completed, the initial test and layout optimization of the visual interface are required.
Step 6.2: system functions are added. Mainly adding a selected uploading file, model selection, a predicted result display window, a predicted result image display window, a single-sequence predicted window and the like.
Step 6.2.1: and selecting the realization of the file uploading function. Firstly, a QF ileDialog module is required to be imported into a Python code, then QF ileDialog is used for creating a dialogue box of a file, a CSV file to be uploaded is selected in the dialogue box, then a file dialogue box is opened by using a getOpenFileName () method, the path and the name of the selected file are obtained, and finally the API function call sequence of the file is visualized by using a show function.
Step 6.2.2: the model selection function is implemented. Firstly, putting the trained models under a designated folder, then displaying the names of the models by using a drop-down frame in PyQt, and completing model selection through mouse selection.
Step 6.2.3: and the prediction result shows the window function realization. After the data set and model are selected, QWIdget is used to create a predicted result display window, signals and slots are used to connect the 'predicted result display' button with the corresponding result display function, and the predicted result is displayed. The name of the current model, the data size, the loss value of the test set and the accuracy can be obtained.
Step 6.2.4: and realizing a prediction result image display window function. The QWIdget is also used to create a predictive results image display window, and a response event is set so that the test set accuracy map and loss value map of the current model can be obtained after clicking the "image display" button.
Step 6.2.5: single sequence prediction window function implementation as shown in fig. 7, a single sequence prediction window interface diagram is presented. Creating the same uploading file function as in the step 6.2.1, connecting the event 'predicting the line' with the corresponding slot function by using signals and slots to finish the selection of a single sequence and display the API function contained in the single sequence, and finally detecting the single API sequence by using the model selected in the step 6.2.2 to give the corresponding label and the malicious family category.

Claims (8)

1. The malicious software detection method based on semantic analysis and bidirectional coding characterization is characterized by comprising the following steps of:
step 1: acquiring a malicious software data set, storing the malicious software data set in a CSV file form, and extracting an API function call sequence in the data set;
step 2: performing word vectorization on the API function call sequence obtained in the step 1 by adopting a BERT model, so as to generate the feature of the word embedding type;
step 3: constructing a ConvLSTM neural network architecture;
step 4: training and optimizing ConvLSTM neural network architecture;
step 5: evaluating the ConvLSTM neural network architecture optimized in the step 4, and testing the efficiency and accuracy of malware detection;
three evaluation indexes for classifying problems are selected to evaluate the ConvLSTM neural network architecture, specifically, accuracy, F1 score and loss value;
Step 6: constructing a malicious software API call sequence detection system, and visualizing a detection result; and building a visual system platform by using a related library used for building a visual interface in Python.
2. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 1, wherein step 1 is specifically:
first, a malware dataset is downloaded, the dataset containing basic information for a plurality of malware, each basic information comprising the following features: sha256 hash value, label, header information, import function library, export function library, section information, character string information, sliding window entropy calculation, linker version, submission size, system version and subsystem version; wherein the imported function library contains a malicious software API function;
after the data set is obtained, aiming at basic information of each piece of malicious software, using a Python third party library to extract the API function of each piece of malicious software from the import function library, and simultaneously reserving the sequence of the API function in the import function library to obtain a sequence composed of the API functions, namely, one piece of malicious software corresponds to one piece of API function call sequence, and finally saving two fields of the family name of the malicious software and the corresponding API function call sequence into a CSV file.
3. The method for detecting malware based on semantic analysis and bi-directional coding characterization according to claim 1, wherein the BERT model in step 2 is composed of a plurality of convertors layers, and the bi-directional coding characterization based BERT model treats API function call sequences of each malicious sample as text sentences with context semantics; the step 2 is specifically as follows:
step 2.1: firstly, carrying out Unicode standardization on a malicious software API call sequence in a CSV file, then carrying out token operation on the API call sequence, dividing the sequence into single characters or some combined characters, and then adopting a word segmentation algorithm to segment the text after token;
step 2.2: constructing an input sequence of the BERT model; adding special tags to the malware API call sequence, the tags including [ CLS ] and [ SEP ], wherein the [ CLS ] tag represents the beginning of the sequence, and the [ SEP ] tag is used for separating different sentences or paragraphs; then converting the marked sequence into 768-dimensional embedded vectors, adding position codes to each embedded vector, representing the position of each API function in the calling sequence, and then randomly replacing the vector value corresponding to a specific API function in the embedded vector with the position codes by using a special MASK mark; finally, all the embedded vectors are formed into a plurality of batches and sent into a transducer model for further processing;
Step 2.3: a multi-head attention mechanism is added to all the embedded vectors obtained in step 2.2.
4. A method for malware detection based on semantic analysis and bi-directional coding characterization according to claim 3, wherein step 2.3 is specifically:
step 2.3.1: generating a transformation matrix based on the embedded vector;
the BERT divides the embedded vector added with the mark into h parts, namely h attention heads, each attention head calculates an attention weight matrix, and the relevance among different words in the input is represented; specifically, each attention header generates a query matrix Q, a key matrix K, and a value matrix V by using three linear transforms of the query vector, the key vector, and the value vector, specifically:
Q=EW Q ,K=EW K ,V=EW V
wherein W is Q 、W K And W is V The vector matrix is a linear transformation matrix of different vectors, and E is a vector matrix formed by splicing all input vectors;
step 2.3.2: calculating 1 part of embedded vector, namely a single attention mechanism corresponding to 1 attention head;
the vector matrix E is used for calculation with the query matrix Q, the key matrix K and the value matrix V, so that single self-attention output is obtained, specifically:
Figure FDA0004244830950000021
wherein QK T Is query vector Q and all key vectors K T Similarity score between penalty factor d k Is to prevent QK T The Softmax function is to normalize each row vector after operation and calculate the importance of each word to other words;
step 2.3.3: calculating a multi-head attention mechanism of all embedded vectors;
after calculating the output of a single self-attention mechanism, the output of a multi-head attention mechanism is obtained as follows:
Attention i =Attention(QW i Q ,KW i K ,VW i V )
i.e. by means of a attentional mechanism, a weighted average is calculated, i.e. Attention i The method comprises the steps of carrying out a first treatment on the surface of the Wherein the ith element represents the query matrix QW i Q In the ith row and key matrix KW i K The score obtained by the inner product of all the rows in the table is normalized by a Softmax function to obtain a weight, and finally the weight weighting summation is used to obtain the Attention i
Step 2.3.4: splicing the output of the multi-head attention mechanism;
all the Attention matrixes are connected together according to columns to form a larger matrix, and the final output is obtained by linear transformation of the weight matrix, wherein the formula is as follows:
Y=Multi(Q,K,V)=concat(Attention 1 ,Attention 2 ,...,Attention 8 )·W
wherein, multi (Q, K, V) represents performing Multi-headed attention calculation on the input query matrix Q, key matrix K and value matrix V; the condition (attach) represents that the attach matrix obtained by multi-head calculation is connected by columns to form a larger matrix; w represents a weight matrix; multiplying the weight matrix by the multi-head Attention matrix after concat to obtain a weighted and summed query vector;
Step 2.3.5: carrying out layer standardization processing on the weighted and summed query vectors; the method comprises the following two steps: the first step is to perform translation operation, namely adding a bias term bias to the result after linear transformation; the second step is to perform scaling operation, i.e. dividing by one standard deviation;
finally multiplying the normalized vector with a value matrix to obtain an attention score, and multiplying the attention score with the value matrix to obtain a final output vector;
step 2.3.6: and carrying out residual connection on the output vector and the input embedded vector, wherein the output vector after residual connection is used as the input of the next step.
5. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 1, wherein the constructing a ConvLSTM neural network architecture in step 3 specifically comprises:
combining a convolutional neural network CNN, a long and short-term memory network LSTM and an external attention mechanism to construct a ConvLSTM neural network architecture, taking the embedded vector after the residual connection in the step 2 as the input of the ConvLSTM architecture, and training the ConvLSTM architecture;
step 3.1: establishing a plurality of one-dimensional convolutional neural networks CNN for extracting local features;
the formula of CNN for local feature extraction is as follows:
Figure FDA0004244830950000031
Wherein p (i) represents the value of the ith node in the cell matrix; wi (w) x,y The method comprises the steps of representing the weight of a filter input node (x, y) of an ith node in an output unit node matrix, and representing a bias term parameter corresponding to the ith output node by bi; c x,y Is the value of the node (x, y) in the filter; f is an activation function; the unit vectors of all p (i) are feature maps obtained from the convolutional layer and are denoted as p, the input of the LSTM network architecture;
step 3.2: after the CNN network is built, an LSTM network architecture is further introduced to process the long sequence, and the LSTM uses a gating unit to realize the latest memory of the stored API call sequence, wherein the latest memory comprises a forgetting gate, an input gate and an output gate component; the gating unit decides which gate is used according to the sequence extracted by the CNN, and adjusts the triggering of the gate by utilizing an S-shaped activation function;
step 3.3: adding an external attention mechanism, weighting the output value of each time step, and reinforcing the weight of useful information; and finally constructing a ConvLSTM neural network architecture by fusing CNN, LSTM and an external attention mechanism, and completing detection and classification of the malicious software by applying a softmax function.
6. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 1, wherein step 4 is specifically:
Step 4.1: adding a cross-validation optimization model aiming at a ConvLSTM neural network architecture; firstly, randomly dividing an API function call data set after word vector into a training set and a testing set, and adding a cross verification method; in the cross validation process, firstly, randomly dividing an API function call sequence data set into a plurality of subsets with equal size, wherein one subset is used as a validation set, and the rest subset is used as a training set; the remaining subsets are then used to train the model for each subset and tests are performed on that subset; this process is repeated a plurality of times, and different subsets are selected as test sets each time, so that model performance indexes tested by using the different subsets are finally obtained;
step 4.2: EDA data enhancement method is added for the data set; EDA generates new training data by randomly transforming the original API sequence data; multiple transformations were performed on each sample, specifically using the following three transformations:
step 4.2.1: randomly inserting; randomly selecting an API function from a certain API call sequence, and inserting an automatically generated API function at the position;
step 4.2.2: randomly deleting; randomly selecting a function from the API call sequence and deleting the function from the function;
Step 4.2.3: random exchange; randomly selecting two adjacent API functions and exchanging their positions;
step 4.2.4: generating and storing a new data set; and mixing the transformed samples with the original samples to form a new data set, and storing the new data set in a file.
7. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 1, wherein step 6 is specifically:
step 6.1: designing a system interface; detecting system requirements and functions according to the API call sequence, and constructing a user interface meeting the system requirements; then, performing interface layout design by using a visualization technology in Python;
step 6.2: adding system functions; the functions comprise selecting an uploading file, selecting a model, displaying a predicted result image and displaying a single-sequence predicted window.
8. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 7, wherein step 6.2 is specifically:
step 6.2.1: selecting an uploading file function to realize; firstly, creating a dialogue box of a file by using a visualization component in python, setting the type of the uploaded file, then setting an action button for opening the dialogue box of the file, acquiring the path and the name of the selected file, and finally visualizing the data of the file by using a presentation function;
Step 6.2.2: realizing a model selection function; firstly, placing a trained model under a designated folder, displaying the name of the model by using a drop-down frame component, and completing model selection by mouse selection;
step 6.2.3: realizing a prediction result display window function; after the data set and the model are selected, a visual component is used for creating a predicted result display window, a predicted result display button is created, and the model loss value and the accuracy are obtained, so that the predicted result button is clicked to obtain the model loss value and the model accuracy and display the model loss value and the model accuracy in the window;
step 6.2.4: the function of a prediction result image display window is realized; creating a predicted result image display window by using a visualization component, creating a display result image button, and acquiring a loss value and an accuracy image, so that clicking the result image button displays the loss value and the accuracy image;
step 6.2.5: realizing a single sequence prediction window function; a visualization component is used to create a single sequence prediction display window and create a display single sequence prediction result display button, and single sequence predictions are obtained so that the single sequence prediction button displays the corresponding tag and malicious family category of the sequence.
CN202310588930.2A 2023-05-24 2023-05-24 Malicious software detection method based on semantic analysis and bidirectional coding characterization Pending CN116432184A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310588930.2A CN116432184A (en) 2023-05-24 2023-05-24 Malicious software detection method based on semantic analysis and bidirectional coding characterization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310588930.2A CN116432184A (en) 2023-05-24 2023-05-24 Malicious software detection method based on semantic analysis and bidirectional coding characterization

Publications (1)

Publication Number Publication Date
CN116432184A true CN116432184A (en) 2023-07-14

Family

ID=87087521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310588930.2A Pending CN116432184A (en) 2023-05-24 2023-05-24 Malicious software detection method based on semantic analysis and bidirectional coding characterization

Country Status (1)

Country Link
CN (1) CN116432184A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117354067A (en) * 2023-12-06 2024-01-05 南京先维信息技术有限公司 Malicious code detection method and system
CN117807603A (en) * 2024-02-29 2024-04-02 浙江鹏信信息科技股份有限公司 Software supply chain auditing method, system and computer readable storage medium
CN118171273A (en) * 2024-03-11 2024-06-11 北京中科网芯科技有限公司 Malicious code detection method and system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117354067A (en) * 2023-12-06 2024-01-05 南京先维信息技术有限公司 Malicious code detection method and system
CN117354067B (en) * 2023-12-06 2024-02-23 南京先维信息技术有限公司 Malicious code detection method and system
CN117807603A (en) * 2024-02-29 2024-04-02 浙江鹏信信息科技股份有限公司 Software supply chain auditing method, system and computer readable storage medium
CN117807603B (en) * 2024-02-29 2024-04-30 浙江鹏信信息科技股份有限公司 Software supply chain auditing method, system and computer readable storage medium
CN118171273A (en) * 2024-03-11 2024-06-11 北京中科网芯科技有限公司 Malicious code detection method and system
CN118171273B (en) * 2024-03-11 2024-08-09 北京中科网芯科技有限公司 Malicious code detection method and system

Similar Documents

Publication Publication Date Title
CN112487807B (en) Text relation extraction method based on expansion gate convolutional neural network
CN107506414A (en) A kind of code based on shot and long term memory network recommends method
CN110598206A (en) Text semantic recognition method and device, computer equipment and storage medium
CN116432184A (en) Malicious software detection method based on semantic analysis and bidirectional coding characterization
CN110532353A (en) Text entities matching process, system, device based on deep learning
CN113010683B (en) Entity relationship identification method and system based on improved graph attention network
CN111062036A (en) Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
CN111931935B (en) Network security knowledge extraction method and device based on One-shot learning
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
Nowotny Two challenges of correct validation in pattern recognition
CN114297079A (en) XSS fuzzy test case generation method based on time convolution network
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
CN115776401A (en) Method and device for tracing network attack event based on few-sample learning
CN112613032B (en) Host intrusion detection method and device based on system call sequence
CN112132269B (en) Model processing method, device, equipment and storage medium
US20210326664A1 (en) System and Method for Improving Classification in Adversarial Machine Learning
Sha et al. Rationalizing predictions by adversarial information calibration
CN117879934A (en) SQL injection attack detection method based on network data packet context
CN117171746A (en) Malicious code homology analysis method and device, electronic equipment and storage medium
CN113822018B (en) Entity relation joint extraction method
US11755570B2 (en) Memory-based neural network for question answering
CN115018627A (en) Credit risk evaluation method and device, storage medium and electronic equipment
Zhang et al. MTSCANet: Multi temporal resolution temporal semantic context aggregation network
Dai et al. [Retracted] Anticoncept Drift Method for Malware Detector Based on Generative Adversarial Network
CN112860573A (en) Smartphone malicious software detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination