Disclosure of Invention
The invention provides a malicious software detection method, a malicious software detection device, electronic equipment, a malicious software detection medium and a malicious software detection product, which are used for solving the defect that different risk levels of the same API function under different operating environments cannot be detected in the prior art.
The invention provides a malicious software detection method, which comprises the following steps: preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested;
performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested;
extracting dynamic characteristics of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested;
inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
According to the malware detection method provided by the invention, the dynamic features comprise an API category vector and an API semantic vector, and the API semantic vector at least comprises one or more of a file path information vector, a registry information vector and a network behavior information vector.
According to the malware detection method provided by the invention, the dynamic feature extraction is performed on the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic feature of each API in the software to be tested, and the method comprises the following steps:
obtaining API category vectors of all APIs in the software to be tested according to API name type information in the dynamic behavior API sequence of all APIs in the software to be tested;
performing semantic analysis on parameters in the dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested; wherein the parameters at least comprise one or more of file path information, registry modified location information and IP address information and/or domain name information;
and splicing the API category vector of each API in the software to be tested with the API semantic vector of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested.
According to the malware detection method provided by the invention, semantic analysis is performed on parameters in a dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested, and the method comprises the following steps:
under the condition that the API has a file modification function, acquiring file path information from the dynamic behavior API sequence;
performing character level segmentation on the file path information;
inputting the segmented file path information into a pre-trained file path extraction model to obtain a file path information vector;
the file path extraction model is obtained by training based on file path information after character level segmentation of the sample file and label information of the sample file;
and/or the presence of a gas in the gas,
under the condition that the API has a registry modifying function, acquiring the position information of the registry modification from the dynamic behavior API sequence;
performing character level segmentation on the position information modified by the registry;
inputting the modified position information of the segmented registry into a pre-trained registry extraction model to obtain an information vector of the registry;
the registry extraction model is obtained by training based on the position information modified by the registry after character level segmentation of the sample registry and the label information of the sample registry;
and/or the presence of a gas in the atmosphere,
under the condition that the API has a network access behavior, acquiring IP address information and/or domain name information from the dynamic behavior API sequence;
performing character level segmentation on the IP address information and/or the domain name information;
inputting the segmented IP address information and/or domain name information into a pre-trained network behavior extraction model to obtain the network behavior information vector;
the network behavior extraction model is obtained by training based on character level segmentation results of sample IP address information and/or sample domain name information and label information of the sample IP address information and/or the sample domain name information.
According to the malware detection method provided by the invention, the obtaining of the API category vector of each API in the software to be tested according to the API name type information in the dynamic behavior API sequence of each API in the software to be tested comprises the following steps:
and converting the API name type information in the dynamic behavior API sequence of each API in the software to be tested into the API category vector of each API in the software to be tested by using a word vector model.
According to the malware detection method provided by the invention, before the preprocessing of the behavior log of the software to be tested, the method further comprises the following steps:
and running the software to be tested in the sandbox environment to obtain a behavior log of the software to be tested.
The present invention also provides a malware detection apparatus, including: the log preprocessing module is used for preprocessing the behavior log of the software to be tested to obtain the static attribute information of the software to be tested and the dynamic behavior API sequence of each API in the software to be tested;
the static feature extraction module is used for carrying out static feature extraction on the static attribute information of the software to be tested to obtain the static features of the software to be tested;
the dynamic characteristic extraction module is used for extracting dynamic characteristics of dynamic behavior API sequences of all APIs in the software to be tested to obtain the dynamic characteristics of all APIs in the software to be tested;
the software detection module is used for inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the above malware detection methods.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the malware detection methods described above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the malware detection method as described in any one of the above.
The invention provides a malicious software detection method, a device, an electronic device, a medium and a product, wherein the malicious software detection method obtains static attribute information and dynamic behavior API sequences by preprocessing a behavior log of software to be detected, then respectively extracts characteristics to obtain static characteristics and each API dynamic characteristic, inputs the static characteristics and each API dynamic characteristic into a malicious software detection model to obtain a detection result of whether the software to be detected is the malicious software, utilizes the time sequence information of the API sequences, and analyzes and extracts parameter information contained in the API functions, for example, aiming at the same API function, different actual meanings can be obtained when different files are operated, under the condition, characteristics representing different risk levels can be extracted from the parameter information, so that the malicious software can be more accurately detected, the false alarm rate of the malicious sample detection model is reduced, and the detection rate is improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the prior art, API call features extracted by deep learning-based dynamic behavior analysis are all based on information such as API call times, API call sequences, and the like, under this condition, although dynamic detection can detect risk levels corresponding to different API functions, it cannot detect different risk levels of the same API function in different application environments, for example: when the API name is an NtReadFile method, the API function is to read data from an opened file, and when the read file is a common file, the danger degree is low; when the read file path is a user privacy file, the behavior of the read file path is likely to be that user privacy data is stolen, and the risk degree is high. Resulting in inaccurate detection results.
In view of the above problem, an embodiment of the present invention provides a method for detecting malware, which is specifically described below with reference to fig. 1 of the accompanying drawings.
Fig. 1 is a flowchart illustrating a malware detection method according to an embodiment of the present invention; as shown in fig. 1, a malware detection method provided in an embodiment of the present invention includes the following steps:
step 101, preprocessing a behavior log of software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested.
In the step, the behavior log of the software to be tested is obtained by running the software to be tested in the simulation environment, the behavior log is preprocessed, and static attribute information and a dynamic behavior API sequence are extracted from the behavior log.
The behavior log is json structure data, API information called by software to be tested in the running process is recorded in the json data, and the type of the API and related parameters in the calling process are included.
The static attribute information refers to static attribute information of the PE file, and specifically includes PE file node table information, PE file resource information, PE file import/export table information, PE file PDB information, Office file macro code information, mail content information, PDF file information, picture information, release file information, and the like. The static attribute information may be obtained through a malware analysis tool or through sandbox automated analysis, which is not limited in this embodiment.
The dynamic behavior API sequence refers to an API call sequence obtained by recording an API call of software to be tested in an operation process, and the dynamic behavior API sequence may be directly obtained by an API monitoring tool of a system where the dynamic behavior API sequence is located, or may be obtained by other related technologies (for example, an API hooking technology), which is not limited in this embodiment.
And 102, performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested.
In this step, the PE file features are extracted as static features from static attribute information such as PE file section table information, PE file resource information, PE file import/export table information, PE file PDB information, Office file macro code information, mail content information, PDF file information, picture information, release file information, and the like.
And 103, performing dynamic feature extraction on the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic features of each API in the software to be tested.
In this step, the behavior log based on the description above includes not only the called API information, but also the type of the API and the related parameters in the calling process, so that the API name type information can be directly obtained from the dynamic behavior API sequence (obtained by preprocessing the behavior log) during the dynamic feature extraction, and the API category vector can be extracted from the API name type information.
When extracting the characteristics of the relevant parameters in the calling process, each dynamic behavior API sequence needs to be detected, whether the API comprises a file operation behavior, a network connection behavior or a registry operation behavior is detected, after determining whether the API has a file modification function, a network access function or a registry modification function, the dynamic behavior API sequences corresponding to the APIs with different functions are correspondingly processed, and the dynamic characteristics with semantic information are extracted from the dynamic behavior API sequences.
And 104, inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested.
In this step, the malware detection model is obtained by training the label information of the sample software based on the static features of the sample software and the dynamic features of each API in the sample software.
In this embodiment, the malware detection model is a simple classifier for detecting whether the software to be tested is malware, and the detection result is malware or not, and correspondingly, in the malware detection model training process, the label of the sample software is that the sample software is normal software or malware, and the specific training process is to obtain the sample software and the label corresponding to the sample software in advance, construct a malware detection network, perform feature extraction on the sample software as in steps S101 to S103, thereby obtaining static features and dynamic features corresponding to the sample software, and input the static features and dynamic features corresponding to the sample software into the malware detection network for training until the malware detection network converges, thereby obtaining a trained malware detection model.
The malicious software detection method provided by the embodiment of the invention obtains static attribute information and dynamic behavior API sequences by preprocessing the behavior logs of the software to be detected, then respectively extracts the characteristics to obtain the static characteristics and the dynamic characteristics of each API, inputs the static characteristics and the dynamic characteristics of each API into the malicious software detection model to obtain the detection result of whether the software to be detected is the malicious software, not only utilizes the time sequence information of the API sequences, but also analyzes and extracts the parameter information contained in the API function, for example, different actual meanings can be provided when different files are operated aiming at the same API function, under the condition, the characteristics representing different risk levels can be extracted from the parameter information, thereby more accurately detecting the malicious software and reducing the false alarm rate of the malicious sample detection model, the detection rate is improved.
In this embodiment, as shown in fig. 2, the extracted static features of the software to be tested and the dynamic features of each API in the software to be tested may also be applied to a malicious family classification task, so as to achieve an effect of accurately determining risk levels corresponding to different behaviors of the same API. Specifically, the static characteristics and the dynamic characteristics are input into a pre-trained malicious family classification model to obtain which class of the malicious family the software to be tested belongs to. Training of the malicious family classification model is carried out on the basis of static features and dynamic features extracted from sample software and a malicious family type label corresponding to the sample software. The malicious family types may include, among others, the macrovirus family, the CIH virus family, the worm virus family, the trojan horse virus family, and the like. The malicious family classification model is a neural network model, such as a convolutional neural network, a cyclic neural network and the like.
Further, the dynamic features include an API category vector and an API semantic vector including at least one or more of a file path information vector, a registry information vector, and a network behavior information vector.
The API category vector is obtained by extracting the characteristics of the API types in the behavior logs, and the API semantic vector is obtained by extracting the characteristics of related parameters generated in the calling process. Determining the API semantic vector based on the function corresponding to the API called in the running of the software to be tested, if only the API with the network access function is called, only the network access parameter is contained in the corresponding related parameter, and then extracting the API semantic vector only containing the network behavior information vector; if the running process of the software to be tested calls the API with the network access function and the API with the file modification function, the corresponding related parameters necessarily comprise the network access parameters and the file modification parameters, and at the moment, the semantic vector of the API comprises the file path information vector and the network behavior information vector.
According to the malicious software detection method provided by the embodiment of the invention, the API category vector and the API semantic vector are obtained, and the API semantic vector at least comprises one or more of the file path information vector, the registry information vector and the network behavior information vector, so that the API category and the semantic information corresponding to the API of the category in different environments can be determined, and the type and the semantic information are combined, and whether the software to be detected is malicious software can be detected more accurately.
Further, the dynamic feature extraction of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic feature of each API in the software to be tested includes:
obtaining API category vectors of all APIs in the software to be tested according to API name type information in the dynamic behavior API sequence of all APIs in the software to be tested;
performing semantic analysis on parameters in the dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested; wherein the parameters at least comprise one or more of file path information, registry modified location information and IP address information and/or domain name information;
and splicing the API category vector of each API in the software to be tested with the API semantic vector of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested.
Specifically, the operation log includes an API name category, and the API name category is directly converted into an API category vector.
In the running process of the software to be tested, different APIs (application program interfaces) are called for realizing different functions, and further, the generated parameter contents are different. For example, if the software to be tested only performs network access during the running process, the parameters in the dynamic behavior API sequence only include network connection behavior parameters (only include IP address information or only include domain name information or both), and after performing semantic analysis on the parameters in the API sequence, a network behavior information vector is obtained. When the software to be tested calls the APIs with different functions at the same time, the parameters comprise information corresponding to the APIs, and semantic information of the same API can be obtained when the same API realizes different functions.
In this embodiment, as shown in fig. 3, the API category vector and the API semantic vector are spliced to obtain a dynamic feature having both API category information and API semantic information, which will not be the same when the categories are the same but the operating environments are different.
The malicious software detection method provided by the invention can accurately judge different risk levels of the same API under different operating environments based on the dynamic characteristics, thereby improving the detection accuracy and reducing the false detection rate.
Further, performing semantic analysis on parameters in the dynamic behavior API sequence of each API in the software to be tested to obtain an API semantic vector of each API in the software to be tested, including:
under the condition that the API has a file modification function, acquiring file path information from the dynamic behavior API sequence;
performing character level segmentation on the file path information;
inputting the segmented file path information into a pre-trained file path extraction model to obtain a file path information vector;
the file path extraction model is obtained by training based on file path information after character level segmentation of the sample file and label information of the sample file;
and/or the presence of a gas in the gas,
under the condition that the API has a registry modifying function, acquiring the position information of the registry modification from the dynamic behavior API sequence;
performing character level segmentation on the position information modified by the registry;
inputting the modified position information of the segmented registry into a pre-trained registry extraction model to obtain an information vector of the registry;
the registry extraction model is obtained by training based on the position information modified by the registry after character level segmentation of the sample registry and the label information of the sample registry;
and/or the presence of a gas in the gas,
under the condition that the API has a network access behavior, acquiring IP address information and/or domain name information from the dynamic behavior API sequence;
performing character level segmentation on the IP address information and/or the domain name information;
inputting the segmented IP address information and/or domain name information into a pre-trained network behavior extraction model to obtain the network behavior information vector;
the network behavior extraction model is obtained by training based on character level segmentation results of sample IP address information and/or sample domain name information and label information of the sample IP address information and/or the sample domain name information.
Specifically, if the API relates to file modification, segmenting the file operation path information into character levels, and inputting the segmented file path information into a pre-trained file path extraction model to obtain the file path information vector.
The file path extraction model may be a recurrent neural network RNN model, a long-short term memory network LSTM model, a variant-threshold recurrent unit GRU network model of LSTM, or a variant of other LSTM, which is not limited in this embodiment. The file path extraction model is obtained by training based on file path information after character level segmentation of the sample file and label information of the sample file.
If the API relates to registry modification, segmenting the position information of registry modification into character levels, inputting the segmented position information of registry modification into a pre-trained registry extraction model, and obtaining the registry information vector.
The registry extraction model is similar to the file path extraction model, and may be a recurrent neural network RNN model, a long-short term memory network LSTM model, a LSTM variant-threshold cycle unit GRU network model, or a variant of other LSTM, which is not limited in this embodiment. The registry extraction model is obtained by training based on the position information of the modified registry after the character level segmentation of the sample registry and the label information of the sample registry.
If the API relates to network access, performing character-level segmentation on the IP address and the domain name (if the API only relates to the IP address, only performing character-level segmentation on the IP address; similarly, if the API only relates to the domain name, only performing character-level segmentation on the domain name; and if the API and the domain name have the IP address and the domain name at the same time, performing character-level segmentation at the same time), and inputting the segmented IP address information and/or the domain name information into a pre-trained network behavior extraction model to obtain the network behavior information vector.
The network behavior extraction model is similar to the file path extraction model and the registry extraction model, and may be a recurrent neural network RNN model, a long-short term memory network LSTM model, a LSTM variant-threshold cycle unit GRU network model, or a variant of other LSTM, which is not limited in this embodiment. The network behavior extraction model is obtained by training label information of sample IP address information and/or sample domain name information based on character level segmentation results of the sample IP address information and/or the sample domain name information.
The three models of the file path extraction model, the registry extraction model and the network behavior extraction model may be the same neural network or different neural networks, which is not limited in this embodiment.
In this embodiment, as shown in fig. 4, the parameters recorded in the behavior log are generated by calling an API with a registry modification function, an API with a file modification function, and an API with a network access function, performing character-level segmentation on a registry key name, a file path, an IP address, and a url domain name, and inputting the segmented data into an intelligent semantic analysis model (i.e., a file path extraction model, a registry extraction model, and a network behavior extraction model), thereby obtaining a behavior information vector (i.e., an API semantic vector) of each API.
The malicious software detection method provided by the invention can comprehensively cover all conditions influencing the risk level of the API by extracting the characteristics of the API with different functions, and obtains the semantic information of the API under different operating environments from the parameters generated in the calling process of the API, thereby improving the accuracy of malicious software detection.
Further, the obtaining of the API category vector of each API in the software to be tested according to the API name type information in the dynamic behavior API sequence of each API in the software to be tested includes:
and converting the API name type information in the dynamic behavior API sequence of each API in the software to be tested into the API category vector of each API in the software to be tested by using a word vector model.
Wherein, the word vector model can be any one of word2vec, glove, ELMo or BERT, and is used for converting words in natural language into dense vectors. In this embodiment, API name type information is converted into an API category vector by a word2vec method.
By the malicious software detection method provided by the invention, path parameter information contained in the API sequence can be extracted, dynamic behavior semantic vectors are output, different behaviors of the same API are represented by dynamic characteristics, and the accuracy of malicious software detection is improved.
Further, before the preprocessing the behavior log of the software to be tested, the method further comprises: and running the software to be tested in the sandbox environment to obtain a behavior log of the software to be tested.
In this embodiment, a simulation environment in which software to be tested runs is constructed by a sandbox. The sandbox is a virtual system program used for testing the behavior of an untrusted file or an application program and the like.
According to the malicious software detection method, the behavior log is obtained in the sandbox environment, and static characteristic and dynamic characteristic extraction is performed on the basis of the behavior log, so that the software to be tested is processed when the malicious software is detected, further influence on a system is avoided, and the safety of the system is improved.
In this embodiment, as shown in fig. 5, before inputting the static features of the software to be tested and the dynamic features of each API in the software to be tested into a pre-trained malware detection model, the preprocessing of the static features is further required, which specifically includes: performing mapping dictionary conversion on the discrete value features to obtain numerical value type features; carrying out standardization processing on the continuous numerical type characteristics (namely mapping numerical values to an interval of 0-1); the missing values are replaced by mode or mean values.
The missing data is data which should be extracted and is not extracted due to the abnormality of the software to be tested in the operation process, and can be replaced by the data mean value after a large amount of data statistics.
After the preprocessed static features and the spliced dynamic features are obtained, the preprocessed static features and the spliced dynamic features are integrated to obtain software features of the software to be tested, and the software features are input into a malicious software detection model to obtain a detection result corresponding to the software to be tested.
In this embodiment, as shown in fig. 2, if the parameters in the dynamic behavior API sequence include other information in addition to the file path information, the location information of the registry modification, the IP address information, and the domain name information, the features of the other information are also spliced with the API category vector and the API semantic vector to obtain the dynamic features.
In the following, the malware detection apparatus provided by the present invention is described, and the malware detection apparatus described below and the malware detection method described above may be referred to in correspondence with each other.
Fig. 6 is a schematic structural diagram of a malware detection apparatus according to an embodiment of the present invention; as shown in fig. 6, a malware detection apparatus includes:
the log preprocessing module 610 is configured to preprocess the behavior log of the software to be tested, and obtain static attribute information of the software to be tested and a dynamic behavior API sequence of each API in the software to be tested.
Specifically, the log preprocessing module 610 runs the software to be tested in the simulation environment to obtain a behavior log of the software to be tested, preprocesses the behavior log, and extracts static attribute information and a dynamic behavior API sequence from the behavior log.
The static attribute information refers to static attribute information of the PE file, and specifically includes PE file node table information, PE file resource information, PE file import/export table information, PE file PDB information, Office file macro code information, mail content information, PDF file information, picture information, release file information, and the like. The static attribute information may be obtained through a malware analysis tool or through sandbox automated analysis, which is not limited in this embodiment.
The dynamic behavior API sequence refers to an API call sequence obtained by recording an API call of software to be tested in an operation process, and the dynamic behavior API sequence may be directly obtained by an API monitoring tool of a system where the dynamic behavior API sequence is located, or may be obtained by other related technologies (for example, an API hooking technology), which is not limited in this embodiment.
And the static feature extraction module 620 is configured to perform static feature extraction on the static attribute information of the software to be tested, so as to obtain a static feature of the software to be tested.
Specifically, the static feature extraction module 620 extracts the PE file features as static features from static attribute information such as PE file section table information, PE file resource information, PE file import/export table information, PE file PDB information, Office file macro code information, mail content information, PDF file information, picture information, release file information, and the like.
The dynamic feature extraction module 630 is configured to perform dynamic feature extraction on the dynamic behavior API sequence of each API in the software to be tested, so as to obtain the dynamic feature of each API in the software to be tested.
Specifically, the dynamic feature extraction module 630 directly obtains API name type information from the dynamic behavior API sequence (obtained by preprocessing the behavior log), and extracts an API category vector from the API name type information.
When extracting the characteristics of the relevant parameters in the calling process, each dynamic behavior API sequence needs to be detected, whether the API comprises a file operation behavior, a network connection behavior or a registry operation behavior is detected, after determining whether the API has a file modification function, a network access function or a registry modification function, the dynamic behavior API sequences corresponding to the APIs with different functions are correspondingly processed, and the dynamic characteristics with semantic information are extracted from the dynamic behavior API sequences.
The software detection module 640 is configured to input the static features of the software to be tested and the dynamic features of each API in the software to be tested into a pre-trained malware detection model, so as to obtain a detection result corresponding to the software to be tested.
The malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
Specifically, the malware detection model is a simple classifier for detecting whether the software to be tested is malware, the detection result is that the malware is malicious software or not, correspondingly, in the training process of the malware detection model, the label of the sample software is that the sample software is normal software or malicious software, the specific training process is to obtain the sample software and the label corresponding to the sample software in advance, construct a malicious software detection network, the sample software is subjected to feature extraction through the log preprocessing module 610, the static feature extraction module 620 and the dynamic feature extraction module 630, and inputting the static characteristics and the dynamic characteristics corresponding to the sample software into the malicious software detection network for training until the malicious software detection network converges, thereby obtaining a trained malicious software detection model.
The malware detection device provided by the embodiment of the invention obtains static attribute information and dynamic behavior API sequences by preprocessing the behavior logs of the software to be detected, then respectively extracts the characteristics to obtain static characteristics and various API dynamic characteristics, inputs the static characteristics and various API dynamic characteristics into the malware detection model, thereby obtaining the detection result of whether the software to be detected is the malware, not only utilizes the time sequence information of the API sequences, but also analyzes and extracts the parameter information contained in the API functions, for example, different actual meanings may be provided when different files are operated aiming at the same API function, under the condition, the characteristics representing different risk levels can be extracted from the parameter information, thereby more accurately detecting the malware and reducing the false alarm rate of the malicious sample detection model, the detection rate is improved.
In this embodiment, the dynamic features include an API category vector and an API semantic vector, where the API semantic vector includes at least one or more of a file path information vector, a registry information vector, and a network behavior information vector.
The API category vector is obtained by extracting the characteristics of the API types in the behavior logs, and the API semantic vector is obtained by extracting the characteristics of related parameters generated in the calling process. Determining the API semantic vector based on the function corresponding to the API called in the running of the software to be tested, if only the API with the network access function is called, only the network access parameter is contained in the corresponding related parameter, and then extracting the API semantic vector only containing the network behavior information vector; if the running process of the software to be tested calls the API with the network access function and the API with the file modification function, the corresponding related parameters necessarily comprise the network access parameters and the file modification parameters, and at the moment, the semantic vector of the API comprises the file path information vector and the network behavior information vector.
According to the malicious software detection device provided by the embodiment of the invention, the API category vector and the API semantic vector are obtained, and the API semantic vector at least comprises one or more of the file path information vector, the registry information vector and the network behavior information vector, so that the API category and the semantic information corresponding to the API of the category in different environments can be determined, and the type and the semantic information are combined, so that whether the software to be detected is malicious software can be detected more accurately.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a malware detection method comprising: preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested;
performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested;
extracting dynamic characteristics of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested;
inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the method for malware detection provided by the above methods, the method comprising: preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested;
performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested;
extracting dynamic characteristics of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested;
inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the methods provided above to perform a malware detection method, the method comprising: preprocessing a behavior log of the software to be tested to obtain static attribute information of the software to be tested and dynamic behavior API sequences of all APIs in the software to be tested;
performing static characteristic extraction on the static attribute information of the software to be tested to obtain the static characteristics of the software to be tested;
extracting dynamic characteristics of the dynamic behavior API sequence of each API in the software to be tested to obtain the dynamic characteristics of each API in the software to be tested;
inputting the static characteristics of the software to be tested and the dynamic characteristics of each API in the software to be tested into a pre-trained malicious software detection model to obtain a detection result corresponding to the software to be tested;
the malicious software detection model is obtained by training label information of sample software based on static characteristics of the sample software and dynamic characteristics of each API in the sample software.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.