CN114579965A

CN114579965A - Malicious code detection method and device and computer readable storage medium

Info

Publication number: CN114579965A
Application number: CN202111674113.6A
Authority: CN
Inventors: 姚刚; 陈奋; 陈荣有; 孙晓波; 龚利军
Original assignee: Xiamen Fuyun Information Technology Co ltd
Current assignee: Xiamen Fuyun Information Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-06-03

Abstract

The invention provides a method for detecting malicious codes, which comprises the following steps: acquiring running information of a malicious code file to be detected; inputting the operation information into a malicious code detection model which is trained by using the characteristics of the heterogeneous network in advance, and outputting the category of a file to be detected with malicious codes; wherein the malicious code detection model is trained by the following steps: s1, acquiring a file sample as a training set; s2, extracting the operation information of the file sample; s3, constructing a heterogeneous network; s4, obtaining a relation adjacency matrix of the heterogeneous network aiming at each heterogeneous network paradigm according to the heterogeneous network paradigm, and obtaining a random walk vector; s5, constructing and training a corresponding word vector model and a corresponding classification model by using random walk information; and S6, carrying out principal angle weighting on the classification result to determine the class to which the malicious code file to be detected belongs. By utilizing the technical scheme, the environmental information of the malicious codes is fully utilized, and the accuracy of classification of the malicious code files is improved.

Description

Malicious code detection method and device and computer readable storage medium

Technical Field

The present invention relates to the field of computer security, and in particular, to a method and an apparatus for detecting malicious codes, and a computer-readable storage medium.

Background

Binary malicious code is a generic term for various types of malware, including viruses, trojan horses, backdoors, worms, and the like. Malicious code has posed a significant threat to the data security and property security of internet enterprises and individual users. With the development of various development tools, malicious codes are generated more and more simply, and anti-detection capability is stronger, so that large anti-virus and security manufacturers face huge challenges.

In the process of gaming with malicious code, a malicious code detection method based on feature codes is the most common analysis means. The feature code detection method is a method for acquiring code features from malicious codes and detecting codes by using the features. Almost all mainstream antivirus software, such as caspasky, siamese, macrofen, etc., contain malicious code detection functionality based on feature codes. The method is high in speed and accuracy, but the recall rate and the extraction of the feature codes have a large relation, so that not only is a technician required to have rich experience, but also a huge virus library is required. The Sametik applies heuristic rules to the detection of malicious code, and the heuristic scanning technology detects whether suspicious functions exist in a program or not based on a defined scanning technology and given judgment rules and judges the malicious code. Behavior-based detection techniques are also an important research direction for malicious code analysis, and are usually combined with artificial intelligence and data mining techniques.

Conventional binary malicious code analysis methods are classified into a static analysis method, a dynamic analysis method, and a machine learning analysis method. The static analysis method extracts software static information by directly analyzing binary file data to detect software similarity. Some functional features and external features of the software are often used for research in the early stages of research, such as software size, software workflow, etc. In subsequent studies, researchers have also used static instruction frequency, character string sets, control flow, and other features. Static analysis does not need to execute software, can be analyzed only by reading a binary file, and has higher analysis speed and safety, but the static analysis method cannot analyze software processed by deformation technologies such as confusion and shelling and needs to be matched with technologies such as software anti-confusion and shelling. The dynamic analysis method analyzes the similarity of the software based on the actual data of software operation, and can effectively resist software deformation technologies such as code confusion and software shell adding. Since the dynamic analysis method needs to execute a binary program, and there is a certain risk in executing programs, especially malicious programs, the analysis needs to be performed in a secure environment. The machine learning detection is to extract the characteristic information of the malicious code, predict through a classifier, collect a large amount of samples and characteristics to train, and predict an unknown sample. While methods using machine learning tend to be prone to usability and interpretability, with the advent of a large number of efficient classifiers, the interpretative part of the feature extraction and analysis scheme is ignored.

Disclosure of Invention

To solve the above problems in the prior art, embodiments of the present invention provide a method and an apparatus for detecting malicious code, and a computer-readable storage medium.

In one aspect, a method for detecting malicious code is provided, which is used for detecting a category to which binary malicious code belongs, and includes:

acquiring operation information of a malicious code file to be detected, wherein the operation information comprises API calling information or API calling information and DLL calling information of the malicious code file to be detected;

inputting the acquired running information into a malicious code detection model which is trained by using the characteristics of a heterogeneous network in advance, and outputting the category of the file of the malicious code to be detected by the malicious code detection model;

the malicious code detection model comprises a word vector model and a classification model, and is trained through the following steps:

s1, acquiring malicious code file samples with a preset number as a training set;

s2, extracting the running information of the malicious code file sample, wherein the running information of the malicious code file sample comprises API calling information or API calling information and DLL calling information of the malicious code file to be detected;

s3, constructing a heterogeneous network according to the extracted operation message, wherein nodes of the heterogeneous network comprise file names and APIs of the malicious code file samples or file names, APIs and DLLs of the malicious code files to be detected;

s4, obtaining a relation adjacent matrix of the heterogeneous network for each heterogeneous network paradigm according to a plurality of preset different heterogeneous network paradigms, and obtaining a random walk vector of each sample in the malicious code file samples in the heterogeneous network for each heterogeneous network paradigm according to each relation adjacent matrix, wherein the random walk vector shows an incidence relation between a selected node and nodes around the selected node, and the heterogeneous network paradigms define relations between the nodes;

s5, utilizing the random walk information obtained by the file sample aiming at each heterogeneous network paradigm to construct and train a word vector model and a classification model corresponding to each heterogeneous network paradigm, wherein the input of the word vector model is a corresponding random walk vector, the output of the word vector model is a processed random walk characteristic vector, the input of the classification model is a random walk characteristic vector, and the output of the classification model is a classification result of the corresponding word vector model;

and S6, performing principal angle weighting on a plurality of obtained classification results corresponding to the classification models to determine the category of the malicious code file to be detected.

In the method, the malicious code file sample is an executable file, and in step S2, the running information is obtained by parsing the executable file through a sandbox.

In the method, the running trust of the malicious code file sample further includes one or more of the following information: co-occurrence probability of files and network calling information; the heterogeneous network also indicates one or more of the following associations between the malicious code file samples: co-occurrence associated information and network associated information; the nodes of the heterogeneous network further comprise one or more of: folders in which the multiple malicious code file samples appear, compressed packages in which they appear, websites visited, and network requests generated.

Wherein the method, the different heterogeneous network paradigms include one or more of the following paradigms MID1 through MID 4:

as shown in fig. 5

F represents a malicious code file to be detected; a represents an API; d represents a DLL; i represents the inclusion relation of API association; b denotes the belonging relationship of DLL association.

According to the method, a relation adjacency matrix corresponding to the heterogeneous network paradigm MID3 is in an order of i x j, wherein i represents the total number of malicious code file samples, and j represents the total number of all APIs; the value of each element in the i x j order matrix indicates the number of times that the API corresponding to the column appears in the malicious code file sample corresponding to the row; wherein obtaining random walk information of a malicious code file sample for the MID3 comprises:

s7, the malicious code file sample walks in the relational adjacency matrix according to rows, when encountering a column with an element value not being zero, a first API associated with the file sample is obtained, then walks in the columns until encountering the row with the element value not being zero, a second file sample associated with the first API is obtained, and a first group of vectors containing F → A → F is obtained;

s8, taking the second file sample as an initial file, executing the S7 to obtain a second API associated with the second file sample and a third file sample associated with the second API, and obtaining a second group of vectors containing F → A → F;

the third file sample continues to walk, looping through S6 and S7 until a predetermined number of F → A → F vectors are obtained.

In the method, the step S4 includes repeating the random walk for each file sample until a predetermined number of walks is reached.

In the method, the Word vector model is a Word2vec Word vector model, and the Word2vec Word vector model reduces the dimension of the input random walk information to a predetermined dimension.

In the method, the step S6 includes:

performing principal angle analysis on vectors output by the word vector models;

determining a weight α of the classification result corresponding to the plurality of word vector models using the following formula_i(i 1.., m), wherein m is the number of word vector models;

wherein d (Y)_i,Y_j) Vector model Y for different words_iAnd Y_jThe geometric distance of the corresponding vector space.

In another aspect, an apparatus for detecting malicious code is provided, which includes a memory and a processor, where the memory stores at least one program, and the at least one program is executed by the processor to implement the method for detecting malicious code as described above.

In yet another aspect, a computer-readable storage medium is provided, in which at least one program is stored, the at least one program being executed by the processor to implement the method for detecting malicious code as described above. The technical scheme has the following technical effects:

according to the scheme of the embodiment of the invention, the operation information of the malicious code is acquired, illustratively, the operation information is acquired through the sandbox, and the heterogeneous network with a plurality of structures is constructed; information under different attention areas is obtained in a heterogeneous network according to a paradigm, embedded word vector expression is skillfully carried out on environmental information, namely context information, of nodes by means of word vectors such as word2vec models, and flexible and powerful feature families are extracted; combining each vector by means of principal angle analysis, and skillfully integrating each advantage of the normal form so as to determine the category of the malicious code; according to the scheme of the embodiment of the invention, the analysis of the malicious codes in the heterogeneous network space is realized, and the classification performance of the malicious codes with the accuracy rate of about 96% can be obtained. Compared with a static detection method, the scheme of the embodiment of the invention does not need huge feature library support, the scheme of the invention forms a scheme for rapidly and effectively acquiring features of the heterogeneous network of the malicious code by adopting a mode of combining dynamic analysis and machine learning, through machine learning tools such as a processing method of the heterogeneous network, word vectors and the like and through the idea of dividing and treating the features, and the problem that the environmental information of the malicious code is not fully utilized is effectively solved.

Drawings

Fig. 1 is a flowchart illustrating a malicious code detection method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart illustrating a process of training a malicious code detection model in the malicious code detection method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating an overall design of a malicious code detection method according to another embodiment of the present invention;

fig. 4 is an example of a heterogeneous network used in the malicious code detection method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating four exemplary patterns used in a malicious code detection method according to an embodiment of the present invention;

FIG. 6 is a heterogeneous network abstraction diagram used in the malicious code detection method according to an embodiment of the present invention;

fig. 7 to 10 are schematic diagrams respectively illustrating matching situations of the MID1 to MID4 in the heterogeneous network shown in fig. 6 in the malicious code detection method according to an embodiment of the present invention;

FIG. 11 is a partial diagram of a adjacency matrix of relationship I obtained according to the canonical form MID3 in the malicious code detection method according to an embodiment of the invention;

fig. 12 is a schematic diagram of a training structure of a word2vec word vector model used in the method for detecting malicious codes according to an embodiment of the present invention;

fig. 13 is a schematic flow chart illustrating a final classification result of a malicious code obtained by using 4 word vector models in the malicious code detection method according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a malicious code detection apparatus according to an embodiment of the present invention.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. With these references, one of ordinary skill in the art will appreciate other possible implementations and advantages of the present invention. Elements in the figures are not drawn to scale and like reference numerals are generally used to indicate like elements.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The first embodiment is as follows:

fig. 1 is a flowchart illustrating a malicious code detection method according to an embodiment of the present invention. Referring to fig. 1, a method for detecting malicious code according to an embodiment of the present invention is used for detecting a category of binary malicious code, and includes: acquiring running information of a malicious code file to be detected, wherein the running information comprises API (application programming interface) calling information of the malicious code file to be detected, or the running information comprises the API calling information and DLL (dynamic link library) calling information; and inputting the acquired running information into a malicious code detection model which is trained by using the characteristics of the heterogeneous network in advance, and outputting the category of the file of the malicious code to be detected by the malicious code detection model.

Fig. 2 is a schematic flowchart illustrating a process of training a malicious code detection model in the malicious code detection method according to an embodiment of the present invention. In the embodiment of the invention, the malicious code detection model comprises different word vector models and classification models corresponding to different heterogeneous network models. FIG. 2, wherein the malicious code detection model is trained by:

s3, constructing a heterogeneous network according to the extracted operation message, wherein nodes of the heterogeneous network comprise file names and APIs (application programming interfaces) of malicious code file samples or file names, APIs and DLLs (dynamic Link libraries) of a plurality of malicious code files to be detected;

s4, obtaining a relation adjacent matrix of the heterogeneous network aiming at each heterogeneous network paradigm according to a plurality of preset different heterogeneous network paradigms, and obtaining a random walk vector of each sample in the malicious code file sample aiming at each heterogeneous network paradigm in the heterogeneous network according to each relation adjacent matrix, wherein the random walk vector shows an incidence relation between a selected node and nodes around the selected node, and the heterogeneous network paradigms define relations between the nodes;

s5, constructing and training a word vector model and a classification model corresponding to each heterogeneous network paradigm by using the random walk information obtained by the file sample aiming at each heterogeneous network paradigm, wherein the input of the word vector model is a corresponding random walk vector, the output of the word vector model is a processed random walk characteristic vector, the input of the classification model is a random walk characteristic vector, and the output of the classification model is a classification result of the corresponding word vector model;

and S6, performing principal angle weighting on a plurality of obtained classification results corresponding to the classification models to determine the category to which the malicious code file to be detected belongs.

In a specific implementation manner, the malicious code file sample is an executable file, and in step S2, the running information is obtained by parsing the executable file through a sandbox. Further, the running information of the malicious code file sample also comprises one or more of the following information: co-occurrence probability of files and network calling information; the heterogeneous network also indicates one or more of the following associations between malicious code file samples: co-occurrence associated information and network associated information; the nodes of the heterogeneous network further comprise one or more of: folders in which multiple malicious code file samples appear, compressed packages in which they appear, websites visited, and network requests generated.

Example two:

overall design

Fig. 3 is a schematic general design flow diagram of a malicious code detection method according to another embodiment of the present invention. As shown in fig. 3, in this embodiment, data to be used is obtained by Cuckoo sandbox parsing malicious code test samples, and the obtained information includes API information, dynamic link DLL information, relevant text information, and the like. The detection method of the embodiment mainly relates to contents in aspects of a random walk scheme, word vector training, classifier design and the like.

In the detection method of the embodiment of the invention, a malicious code test sample is input, and the class of the sample is output. Wherein the malicious test code test sample is an executable file. As shown in fig. 3, the detection method according to the embodiment of the present invention includes:

step 301, analyzing a test sample by using a cuckoo sandbox to obtain API calling information; other sandboxes with similar functionality may also be used to parse the test sample;

step 302, obtaining a relation adjacency matrix and random walk information;

step 303, inputting random walk information into word2vec word vector models of the trained four heterogeneous network normal form subgraphs, illustratively, CBOW word vector models, and obtaining corresponding random walk characteristic vectors; the four heterogeneous network normal subgraphs are selected exemplary heterogeneous network normal subgraphs according to the embodiment of the invention and will be described in detail below;

step 304, inputting feature vectors output by the four word vector models and corresponding to the normal form subgraphs into corresponding classifiers to generate four classification results, in this case, classification probability distribution;

and 305, obtaining the final classification result of the sample to be detected, namely the malicious code category of the sample to be detected, through the classification probability distribution generated by the four sub-graphs weighted by the main angle.

Fig. 3 is a flowchart illustrating the overall design of the detection method according to the embodiment of the present invention. The details of the protocol will be described below.

Heterogeneous network of malicious code

In the actual detection process of malicious code, a large amount of information can be obtained frequently, but the information is not in one dimension. Such as a list of API calls that we can obtain a malicious sample, DLL reference information, and co-occurrence probabilities for certain files, network call information, etc. The complex and diverse information can jointly form a heterogeneous network with rich information, and the network is analyzed, so that accurate and extremely strong interpretative malicious file description can be obtained, and the information can well identify malicious codes and even explain malicious attributes. So-called heterogeneous networks, that is, different types of associated information are organically integrated together in the same network structure. Fig. 4 is an example of a heterogeneous network used in the malicious code detection method according to an embodiment of the present invention; as shown in fig. 4, such associated information includes: 1) co-occurrence association: files commonly appear in one folder and one compressed package; 2) API association: the file calls a similar or identical API; 3) DLL association: the file calls the same dynamic link library; 4) network association: the file has accessed the same web site or a similar web request has been made.

In the heterogeneous network shown in FIG. 4, it is assumed that the class to which the File-U File belongs is unknown, and the classes to which the other four files belong are known. It can be guessed which of the four files known is the most similar to the File-U. From FIG. 4, it can be determined by comparison that File-U and File-M1 are most similar. As can be seen from FIG. 3, in the left side of the compressed packet information, File-U and File-M1 both appear in the three left side compressed packets, and the two APIs called by File-U and File-M1, SetDoubleClickTime and SetTimer, belong to USER32.DLL, and have similar functions. Although there is a common API call cluster with File-B1, similar to File-M1, there is a coincidence of DLL information and also a co-occurrence association in the same compressed File, which is a stronger association existing through heterogeneous networks.

Heterogeneous network paradigm

The heterogeneous network structure is complex and various, and in order to better process information, the embodiment of the present invention exemplarily considers four basic network structure subunits as rules for acquiring information, which are also called heterogeneous network models herein. In the method of the embodiment of the present invention, we use four heterogeneous network paradigms to process a complex heterogeneous network. Fig. 5 is a schematic diagram of four exemplary patterns used in a method for detecting malicious code according to an embodiment of the present invention.

As in fig. 5, only two relationships are considered in these 5 paradigms: i-include, containing relationships, i.e., API associations; B-Belong, belongs to, i.e., DLL associations. Through the combination of the two associations and the experience of malicious code dynamic analysis, four normal form structures shown in fig. 4 are made, and the 4 normal forms can be used for capturing the similarity relation in the heterogeneous network.

Fig. 6 is an abstract diagram of an example of a heterogeneous network used in the malicious code detection method according to an embodiment of the present invention. As shown in fig. 6, each node represents an information unit in the heterogeneous network. Where F represents the executable file name, A represents the API, D represents the dynamically linked library, M represents the infected machine, and Z represents the compressed file or co-occurring folder. We can see a heterogeneous network with no more than 20 elements, whose structure is sufficiently complex, and as the elements in the network structure continue to increase, the structural complexity of the network continues to increase at high speed in a nonlinear fashion. In order to solve the difficulty of processing the heterogeneous network well, the method provided by the embodiment of the invention processes by two effective means, wherein the first means is to use a large amount of adjacent matrix calculation to simplify the calculation complexity, and the second means is to use a heterogeneous network paradigm to obtain different network effective information aiming at different cut-in visual angles. Specifically, as shown in fig. 7-10, which illustrate a paradigm of MID 1-MID 4, respectively, we can find a number of such MID1, MID2, MID3, and MID4 structures in the heterogeneous reticule of fig. 6. Exemplary path information matching the paradigm structure is indicated by the addition of a horizontal line under the corresponding node, as shown in fig. 7 through 10.

It is clear that a paradigm can be matched to a small fraction of a heterogeneous network, and it is clear that a paradigm can be matched to many nodes that follow this paradigm. In this small heterogeneous network, the paradigm MID1 can be matched to 4 sets of corresponding nodes. Similarly, as can be seen in FIGS. 7-9, the paradigm MID2, MID3, and MID4 all have respective nodes that are matched to.

Obtaining data from cuckoo sandbox

And analyzing the cuckoo report file repotjson to obtain the relevant data of the malicious code sample adopted by the embodiment. Specifically, the sub item of "APIs" in the dictionary included in repotjson is analyzed, wherein the sub item contains API statistical information under each process, and corresponds to the relationship I in the present invention. And analyzing a hook script in the cuckoo monitor to obtain DLL information in the API. Through the sandbox cuckoo, an analysis result of real-time operation of an executable file can be obtained. This result is presented in a python dictionary data structure, where an entry "APIs" records the type and number of executable file call APIs. This information is acquired here as information of relation I in the present invention. Relationship I represents the API case contained in the executable file, i.e., the relationship of include. There are 365 APIs detectable in cuckoo in total and 3948 files analyzed in total. Illustratively, these 3948 executables are known to belong to three families, namely the Allaple, Virut, Agent families. A matrix of [3948 x 365] is obtained here, the matrix element values of each row showing which APIs the corresponding sample of the row called on all 365 APIs, and the number of times the APIs were called. In addition, corresponding information of the API-DLL can be obtained, and the comparison table only needs to look up the configuration file of the cuckoo monitor at hook, extract all the rst files and obtain the API list.

Random walk on training set samples

The method of the embodiment of the invention generates the surrounding environment information of the center executable file of each normal-form subgraph through random walk. The central executable file is an executable file corresponding to the selected malicious code sample. Different subgraphs will produce different random walk information. In this example, four paradigm subgraphs are shown in FIG. 5, namely MID1, MID2, MID3, and MID 4.

The random walk scheme employed in the present embodiment is described here by taking MID3 as an example. Illustratively, a large amount of relevant but less helpful information may be reduced during walking in order to keep the random walk algorithm sensitive to large amounts of data. Meanwhile, all the subgraphs adopted by the embodiment of the invention are symmetrical between two F, and the surrounding environment information of F under the subgraph can be described by only taking half of the nodes to randomly walk. The following takes the MID3 as an example to specifically describe how to acquire the context, which is the environmental information of the file by random walk.

Fig. 11 is a partial schematic diagram of an adjacency matrix of a relationship I obtained according to a paradigm MID3 in a malicious code detection method according to an embodiment of the present invention. In this example, there are 3948 total executables analyzed, which may be F₀To F₃₉₄₇To indicate, in particular, the rows in fig. 11; 365 APIs analyzed by sandbox can be used as A₀To A₃₆₄Corresponding to the columns in fig. 11. By working in I, the relationship I has been obtained, assuming our sample F₀The rows in I are [0, 10, 4, 0, 0, 3, 30, …, 0, 19, 0]This is a 365 bit long vector, representing F₀API, A, of middle reference numerals 0 to 364₀To A₃₆₄The number of occurrences is [0, 10, 4, 0, 3, 30,. ], 0, 19, 0]. Here, half of the nodes are obtained by sampling: f->A->F. When sampling is performed, the probability used is determined by the quantitative relation in I. Illustratively, zero value skips are encountered, which are sampled upon encountering non-zero values.

Specifically, as shown in FIG. 11, F₀Obtaining the quantity relation of the corresponding elements of the line by the API information in the first line, sampling according to the quantity relation, skipping when the first value is 0, and continuing to obtain A₁I.e., API, assume here that A is obtainedThe second API, i.e. the API that appears 10 times. To obtain A₁Then, according to A₁The corresponding column is sampled. The physical meaning of the column of information is A₁Sampling according to the number of times of appearance in each file, namely obtaining a column quantity relation according to the column expressed longitudinally, and extracting according to the quantity relation to obtain F₂. Also, elements other than zero may be decimated based on whether the element value is zero. Thus, three nodes F which are walked by the node F are obtained₀->A₁->F₂This is F₀Performing a first walk to obtain a first set of vectors [ F, A, F ] while executing a file as a center]And (4) information.

Such random walk also needs to be performed multiple times to delineate node information around a sample point. Finally, it is conceivable that the obtained API quantity distribution is as close as possible to the actual API distribution of the sample, and a large number of related files can be obtained as auxiliary information. In fact, in the exemplary embodiment, setting any normal subgraph, it is necessary to repeat the random walk 80 times to obtain the information around the file node F, i.e. to obtain the position information of the node around the file node F, which is related to F. Here, 80 is merely an example, and any other number of walks may be set according to the actual situation. In this example, the first wandering to F₂From F₂Starting to continue to walk to the right according to rows until a second associated API with a non-zero element value is encountered, then walking to the lower direction according to columns until a third associated file with a non-zero element value is encountered, finishing the second random walk at the moment, and obtaining a second group of random walk vector information, namely [ F, A, F]And (4) information. In the same manner as before, the current file node continues to walk right and down until 80 random walks are completed, obtaining 80 sets [ F, A, F]And (4) information. This information can be counted into a dictionary containing all API, training set files, which in this example totals length 4313, i.e. (365 + 3948). After the 80 random walks, we can obtain 80 dictionary long vectors, which show the nodes of each element in the dictionary on and around the sub-graph at the current sampleThe information whether a point is present or not, i.e. this vector shows the surrounding information of the current sample, i.e. context information, associated with the sample. The obtained vector is used as a sample to train batch data in the sub-graph.

This example is done for MID3, and only the information of the file name F and API is of interest in MID 3. For other paradigms, such as MID2 and MID4, the information of the dynamic link library DLL and B can also be focused on, i.e., belong to a relationship. The information of D and B is sampled in the corresponding relational adjacency matrix to obtain the corresponding associated information, and the associated information is represented on the dictionary. Other paradigms may also relate to more relevant information, such as folders in which malicious code files appear, compressed packets in which malicious code files appear, websites visited by the malicious code files, network requests generated by the malicious code files, and so on. Likewise, these associations may be embodied on a dictionary-long vector to show the sample's surrounding environment information.

Corresponding context can be obtained for all four sub-graphs in fig. 5 through such random walk, so as to perform word vector model construction for each sub-graph, and finally obtain four corresponding word vector models.

Word vector training

And taking the ambient environment information obtained in the random walk step as the context of the central executable file to carry out word vector training. In the embodiment of the invention, a word vector training model of word2vec is adopted, and a dictionary of about 4500 dimensions is embedding to 128 dimensions. In this example, vectors of 4313 dimensions are difficult to use, word vector dimension reduction is performed before these information are applied, here, word vector is trained by using word2vec strategy, and each of the four subgraphs trains a word vector model according to vector data obtained by random walk. Fig. 12 is a schematic diagram illustrating a training structure of a word2vec word vector model used in the malicious code detection method according to an embodiment of the present invention. The structure inputs W (t-2) to W (t +2), SUM processing, namely summation, is carried out, and W (t) is output, so that dimension reduction is realized. The dimension reduction can be realized by a person skilled in the art by using a training structure of word2vec in the prior art, which is not described herein in detail. Due to the repeated sampling and the complex calculation, only 1000 samples are exemplarily selected for training, and the number ratio of the samples in the test set to the samples in the training set is 1: 9.

Generating feature vectors

The word vector model obtained above is used to extract features of a training set of malicious code samples. Each sample obtains a different feature word vector on the word vector models of the four subgraphs. And respectively training respective classification models by the four subgraphs. Through the work above, the word vector model corresponding to each sub-image is obtained. Training requires the construction of training set data. The construction of the training set data is again to use the random walk approach described above to obtain the feature vectors. Obviously, each wandering obtains only vector information of one path around the node F, and the word vectors obtained thereby are difficult to represent, so that multiple wandering is required, and the word vectors obtained each time are used as training features. That is, one sample is trained over and over again for a number of different features. Thus, we will perform a training process under a label for a plurality of times, and the label corresponds to a malicious code category. In an exemplary implementation, this number of repeated random walks is specified as 30 samples, with a total of three families of samples being trained. Therefore, the number of the tags is 3, namely Allapel, Virut and Agent, and the computation is complex due to the fact that repeated sampling is needed, only 1000 samples are selected for training, and the test training is 1: 9. 30 times per sample, a total of 30000 randomly walked data, each set containing 27000 training data and 3000 test data, was generated. And four subgraphs, namely, four groups of different training sets and test sets are sampled for the word vector models obtained from the four subgraphs respectively, wherein each group comprises 27000 pieces of training data and 3000 pieces of test data which are respectively from 900 training samples and 100 test samples. And training the four groups of test sets by using corresponding word vector models respectively to obtain four classification models, namely classifiers, corresponding to the four subgraphs respectively. In other examples, the number of times that one sample repeats the random walk is not limited to 30 times, and may be other set number of times.

Classifier design

And extracting the characteristics of the test set by using the word vector model obtained above, and performing classification calculation on the classification models obtained on four different subgraphs to obtain the classification result and the classification probability of the sample on each subgraph. And obtaining the weight of each subgraph through principal angle analysis, wherein the weight is used for harmonizing the classification probability on each subgraph to obtain a final classification result.

In this example, four word vector models, each test sample has 30 sets of word vector features. In this example, a voting method is used to determine which family the sample belongs to. The voting results are expressed in percentages, and the four classifiers produce four different discrimination percentages. And finally, analyzing the angle relation of each sub-image in the quantum space through the main angle to determine the weight of the classification result generated by each sub-image.

According to the subgraphs specified by m or m normal forms, the mark modes of m nodes can be obtained, wherein m is the number of different heterogeneous network normal forms. In this embodiment, m is 4. In this embodiment, the finally obtained classification result Y may be represented as: y ═ alpha_i×Y_iIn which α is_i(i ═ 1.., m) is Y_iWeight of (A), Y_iAnd (4) representing the classification result output by the ith normal form word vector model. The classification result is expressed in the form of distribution probability. In one implementation, α can be calculated by calculating the physical distance_i. Precisely, the distance here uses the principal angle between two vectors. In the scheme of the embodiment of the invention, word vector processing is carried out on all output vectors through word vectors, so that the dimensions of the output vectors are the same, and the principal angle between the vectors can be conveniently calculated. The principal angle θ of the two vectors is defined by:

wherein Y ∈ Y_i,y'∈Y_j

If and only if Y_i∩Y_jWhen not equal to 0, let θ be 0₁,θ₂,...θ₄Is Y_iAnd Y_jThe main angle of (1) is different word vector model Y_iAnd Y_jThe geometric distance of the corresponding vector space can be expressed as

Thus, we can compute Y in each vector space_ια of (a):

explaining the physical meaning of α (Alpha) here, it can be seen that the denominator of α is the sum of all d, the numerator is the sum of d for each of the other subspaces, and d is the angle between the subspaces. That is, if one subspace deviates significantly from the other subspace, then its α is large. It can be understood that with the principal angle weighting method of the embodiment of the present invention, if the feature space of a sub-graph is relatively independent, it will have a greater weight, and the greater weight will make it more sensitive to the feature space sample where it is located, so as to capture the sample conveniently; for the subspace with small weight, the problem that the sample cannot be captured does not occur; since the weight is small because of the proximity to other subspaces, it is likely that the same sample will be captured by multiple subspaces, and the target sample will not be missed.

Fig. 13 is a schematic flow chart illustrating a process of obtaining a final classification result of a malicious code by using 4 word vector models in the malicious code detection method according to an embodiment of the present invention. As shown in fig. 13, 1000 samples are used for training and testing, wherein 900 samples are used in the training set, 100 samples are used in the testing set, training and testing data, i.e., random walk vector data, are generated through the word vector model corresponding to each of the four normal-mode sub-graph models, and a classification result and a corresponding distribution probability are output through each classification model, and are both shown in a vector form. In this example, for the malicious code file to be detected which is currently input, the values of the weight Alpha, namely Alpha, corresponding to each norm are obtained through principal angle analysis, and are respectively 0.23, 0.29, 0.27 and 0.21; the classification results of the models corresponding to the respective paradigms are weighted by the above weights, and the finally obtained classification result is [0.31, 0.35, 0.33], that is, for the three types of labels Allapel, Virut, Agent used in the test, the probability corresponding to the second type is the highest, that is, the samples which would be originally wrongly classified into the third type can be correctly classified into the second type. Thereby, malicious code classification performance with higher accuracy can be obtained.

Example three:

the present invention also provides a malicious code detection apparatus, as shown in fig. 14, the apparatus includes a processor 1401, a memory 1402, a bus 1403, and a computer program stored in the memory 1402 and capable of running on the processor 1401, the processor 1401 includes one or more processing cores, the memory 1402 is connected to the processor 1401 through the bus 1403, the memory 1402 is used for storing program instructions, and the processor executes the computer program to implement the steps in the above-mentioned method embodiment of the first embodiment of the present invention.

Further, as an executable solution, the device for identifying micro-plastic may be a computer unit, and the computer unit may be a desktop computer, a notebook, a palm computer, a cloud server, and other computing devices. The computer unit may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above-described constituent structures of the computer unit are merely examples of the computer unit, and do not constitute a limitation of the computer unit, and may include more or less components than those described above, or combine some components, or different components. For example, the computer unit may further include an input/output device, a network access device, a bus, and the like, which is not limited in this embodiment of the present invention.

Further, as an executable solution, the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is the control center for the computer unit and which connects the various parts of the overall computer unit using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the computer unit by running or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Example four:

the present invention also provides a computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the steps of the above-described method of an embodiment of the present invention.

The computer unit integrated module/unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on this understanding, all or part of the processes in the method according to the embodiments of the present invention may also be implemented by a computer program, which can be stored in a computer readable storage medium and can be executed by a processor to implement the steps of the embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is appropriately increased or decreased as required by the legislation and patent practice in the jurisdiction.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for detecting malicious code, which is used for detecting the category of binary malicious code, and is characterized by comprising the following steps:

acquiring running information of a malicious code file to be detected, wherein the running information comprises API calling information or API calling information and DLL calling information of the malicious code file to be detected;

2. The method according to claim 1, wherein the malicious code file sample is an executable file, and in step S2, the running information is obtained by sandboxing and parsing the executable file.

3. The method of claim 2,

the running information of the malicious code file sample further comprises one or more of the following information: co-occurrence probability of files and network calling information;

the heterogeneous network also indicates one or more of the following association information between the malicious code file samples: co-occurrence associated information and network associated information;

the nodes of the heterogeneous network further comprise one or more of: folders in which the multiple malicious code file samples appear, compressed packages in which they appear, websites visited, and network requests generated.

4. The method according to claim 3, characterized in that the different heterogeneous network paradigm comprises one or more of the following paradigms MID1 to MID 4:

as shown in fig. 5

5. The method according to claim 4, wherein the heterogeneous network paradigm MID3 corresponds to a relational adjacency matrix of order i x j, where i represents the total number of malicious code file samples, and j represents the total number of all APIs; the value of each element in the i-j order matrix indicates the number of times that the API corresponding to the column appears in the malicious code file sample corresponding to the row;

wherein obtaining random walk information of a malicious code file sample for the MID3 comprises:

6. The method according to claim 1, wherein the step S4 includes repeating the random walk for each file sample until a predetermined number of walks is reached.

7. The method of claim 1, wherein the Word vector model is a Word2vec Word vector model, and wherein the Word2vec Word vector model reduces the dimensionality of the input random walk information to a predetermined dimensionality.

8. The method according to claim 1, wherein the step S6 includes:

9. A malicious code detection apparatus, comprising a memory and a processor, wherein the memory stores at least one program, and the at least one program is executed by the processor to implement the malicious code detection method according to any one of claims 1 to 8.

10. A computer-readable storage medium, wherein at least one program is stored in the storage medium, and the at least one program is executed by the processor to implement the method for detecting malicious code according to any one of claims 1 to 8.