CN117150485A

CN117150485A - System and method for detecting malware

Info

Publication number: CN117150485A
Application number: CN202210555448.4A
Authority: CN
Inventors: 冯楠坪
Original assignee: China Enterprise Network Communication Technology Co ltd
Current assignee: China Enterprise Network Communication Technology Co ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2023-12-01

Abstract

Systems and methods for detecting malware. The method is a computer-implemented method comprising: processing an assembly source code (ASM) file corresponding to a Portable Executable (PE) file to extract information from the ASM file; processing the extracted information of the ASM file using a machine learning based processing model; and determining whether the ASM file and/or the PE file is malicious based on processing the extracted information of the ASM file.

Description

System and method for detecting malware

Technical Field

The present invention relates to systems and methods for detecting malware.

Background

In recent years, various forms of malicious software such as luxury software, trojan horses, viruses, malicious mining programs or files and the like are continuously emerging. These malware create security hazards for computer systems. Some writers of malware introduce polymorphisms in malware components in order to evade detection. In general, programs or files belonging to the same malware classification have the same or similar form of malicious behavior. However, because writers use various strategies to continually modify and/or confuse, programs or files that are otherwise affiliated with the same class may look like many different files, which presents a high degree of challenges to the accuracy of malware detection and/or identification.

Disclosure of Invention

In a first aspect, there is provided a computer-implemented method for detecting malware, comprising: processing an assembly source code (ASM) file corresponding to a Portable Executable (PE) file to extract information from the ASM file; processing the extracted information of the ASM file using a machine learning based processing model; and determining whether the ASM file and/or the PE file is malicious based on the processing of the extracted information of the ASM file.

Optionally, the extracted information of the ASM file includes keyword information.

Optionally, the determining comprises determining whether the PE file and/or the ASM file belongs to one or more predetermined malware classifications.

Optionally, the determining further comprises classifying the PE file and/or the ASM file as belonging to which of the one or more predetermined malware classifications.

Each of the one or more predetermined malware classifications has a corresponding malicious behavior type and/or code similarity, respectively.

In one example, the one or more predetermined malware classifications include one, more, or all of the following: viruses, worms, spyware, adware, pornography, risk software, trojan horses, logic bombs, luxo software, backdoor programs, rootkits, keyloggers, and the like.

Optionally, processing the ASM file to extract information includes: performing keyword extraction and processing operations on the ASM file to extract keyword information (optionally including statistics); wherein the keyword information is configured to be processed by the machine-learned processing model.

The keyword extraction and processing operations may be statistical feature-based keyword extraction and processing operations, word graph model-based keyword extraction and processing operations, and/or topic model-based keyword extraction and processing operations. Keyword extraction and processing operations based on statistical features are preferred.

The keyword extraction and processing operations may include building a statistical feature matrix based on the extracted keywords.

Optionally, performing the keyword extraction and processing operations includes: the ASM file is processed using the self-encoder model to extract keyword information. The keyword information may be a value in a key value pair, the value corresponding to the keyword. In one example, the key-value pairs are in the form of "keys: value).

Optionally, performing keyword extraction and processing operations includes: word frequency-inverse document frequency (TF-IDF) operations are performed on the ASM file to extract keyword information. The keyword information may be a value in a key value pair, the value corresponding to the keyword. In one example, the key-value pairs are in the form of "keys: value).

Optionally, the keyword information includes a word frequency and/or word order of the keyword.

Optionally, the computer-implemented method further comprises performing a text cleansing operation to facilitate the keyword extraction and processing operations. The text cleansing operation may include culling punctuation, keywords with low feature weights, and/or dead words.

Optionally, the machine learning based processing model includes a multi-classification model configured (trained) for malware determination and/or classification.

Optionally, the machine learning based processing model includes a decision tree model configured (trained) for malware determination and/or classification.

Optionally, the machine learning based processing model includes a LightGBM model configured (trained) for malware determination and/or classification.

Optionally, the machine learning based processing model includes a plurality of models configured (trained) for malware determination and/or classification.

Optionally, the computer-implemented method further comprises: processing the PE file to extract information from the PE file; and processing the extracted information of the PE file using the machine learning based processing model; and wherein the determining comprises: based on the processing of the extracted information of the ASM file and the extracted information of the PE file, it is determined whether the ASM file and/or the PE file is malicious.

Optionally, the extracted information of the PE file includes URL and/or path information.

Optionally, processing the PE file to extract information includes performing URL and/or path information extraction and processing operations including: processing the PE file with a self-encoder model to extract URL and/or path information; wherein the URL and/or path information is configured to be processed by the machine-learned processing model.

Optionally, processing the PE file to extract information includes performing URL and/or path information extraction and processing operations including: performing binary reading operation on the PE file to extract URL and/or path information; and performing a word frequency-inverse document frequency (TF-IDF) operation based on the extracted URL and/or path information to obtain word-document information (optionally including statistics); wherein the word-document information is configured to be processed by the machine-learned processing model.

Optionally, the word-document information is in the form of a word-document matrix.

Optionally, the computer-implemented method further comprises performing a text cleansing operation to facilitate extraction of the URL and/or path information.

Optionally, the PE file does not contain PE header and/or PE header information. Optionally, the PE file contains a PE header and/or PE header information.

Optionally, the PE header and/or PE header information in the PE file is not used to determine whether the PE file and/or the ASM file is malicious. Optionally, the PE header and/or PE header information in the PE file is used to determine whether the PE file and/or the ASM file is malicious.

Optionally, the ASM file is generated from or from the PE file. In one example, the ASM file is generated using IDA Pro based on the PE file.

Optionally, the computer-implemented method further comprises generating the ASM file from the PE file.

Optionally, the computer-implemented method further comprises: and displaying the determination result. The results may include: whether the ASM file and/or the PE file is malicious; whether the ASM file and/or the PE file belong to one or more predetermined malware classifications; and/or to which of the one or more predetermined malware classifications the ASM file and/or the PE file belongs.

Optionally, the computer-implemented method further comprises: if the PE file and/or ASM file is determined to be malicious, a warning is provided.

In a second aspect, a malware detection system is provided that includes one or more processors configured to: processing an assembly source code (ASM) file corresponding to a Portable Executable (PE) file to extract information from the ASM file; processing the extracted information of the ASM file using a machine learning based processing model; and determining whether the ASM file and/or the PE file is malicious based on the processing of the extracted information of the ASM file. The one or more processors may be on the same device or may be distributed across multiple devices.

Optionally, the one or more processors are further configured to: it is determined whether the PE file and/or the ASM file belongs to one or more predetermined malware classifications.

Optionally, the one or more processors are further configured to: the PE file and/or the ASM file is classified as belonging to which of the one or more predetermined malware classifications.

Optionally, the one or more processors are configured to process the ASM file to extract information by at least: performing keyword extraction and processing operations on the ASM file to extract keyword information (optionally including statistics); wherein the keyword information is configured to be processed by the machine-learned processing model.

The keyword extraction and processing operations are statistical feature-based keyword extraction and processing operations, word graph model-based keyword extraction and processing operations, and/or topic model-based keyword extraction and processing operations. Keyword extraction and processing operations based on statistical features are preferred.

Optionally, the one or more processors are configured to perform the keyword extraction and processing operations by at least: the ASM file is processed using the self-encoder model to extract keyword information. The keyword information may be a value in a key value pair, the value corresponding to the keyword. In one example, the key-value pairs are in the form of "keys: value).

Optionally, the one or more processors are further configured to perform text cleansing operations to facilitate the keyword extraction and processing operations. The text cleansing operation may include culling punctuation, keywords with low feature weights, and/or dead words.

Optionally, the one or more processors are further configured to: processing the PE file to extract information from the PE file; and processing the extracted information of the PE file using the machine learning based processing model; and wherein the one or more processors are configured to: based on the processing of the extracted information of the ASM file and the extracted information of the PE file, it is determined whether the ASM file and/or the PE file is malicious.

Optionally, the one or more processors are further configured to process the PE file to extract information at least by performing URL and/or path information extraction and processing operations comprising: performing binary reading operation on the PE file to extract URL and/or path information; and performing a word frequency-inverse document frequency (TF-IDF) operation based on the extracted URL and/or path information to obtain word-document information (optionally including statistics); wherein the word-document information is configured to be processed by the machine-learned processing model.

Optionally, the one or more processors are further configured to perform a text cleansing operation to facilitate extraction of the URL and/or path information.

Optionally, the one or more processors are further configured to: the ASM file is generated from or in accordance with the PE file.

Optionally, the malware detection system further comprises: a display operatively connected to the one or more processors for displaying the determination. The results may include: whether the ASM file and/or the PE file is malicious; whether the ASM file and/or the PE file belong to one or more predetermined malware classifications; and/or to which of the one or more predetermined malware classifications the ASM file and/or the PE file belongs.

Optionally, the malware detection system further comprises: a warning device operatively connected to the one or more processors to provide a warning when the PE file and/or ASM file is determined to be malicious.

Optionally, the malware detection system is implemented on a dedicated computing system.

In a third aspect, there is provided a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to perform the method for detecting malware of the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, storing computer instructions that, when executed by a computer, cause the computer to perform the method for detecting malware of the first aspect.

As used herein, "malware" may include malicious electronic files or programs that are suitable for disturbing, destroying, or unauthorized access to a computer system. An "assembly source code (ASM) file" may include: asm file. "Portable Executable (PE) files" may include: pe file, acm file, ax file, cpl file, dll file, drv file, efi file, exe file, mui file, ocx file, scr file, sys file, and/or tsp file.

Before any embodiments are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. The invention is capable of being practiced or of being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.

Other features and aspects of the invention will become apparent by consideration of the detailed description and accompanying drawings. Any feature described herein with respect to one aspect or embodiment may be combined with any other feature described herein with respect to any other aspect or embodiment, where appropriate and applicable.

Drawings

Embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:

fig. 1 is a diagram of ASM file data according to one embodiment of the present invention.

FIG. 2 is a diagram of PE file data in accordance with one embodiment of the invention.

FIG. 3 is a flowchart of a computer-implemented method for detecting malware according to one embodiment of the invention.

Fig. 4 is a flow chart of processing ASM files according to one embodiment of the present invention.

Fig. 5 is a diagram illustrating a position ratio of occurrence of a keyword according to an embodiment of the present invention.

FIG. 6 is a diagram illustrating the relationship of keywords to classification categories according to one embodiment of the invention.

FIG. 7 is a flowchart of processing PE files according to one embodiment of the invention.

FIG. 8 is a flowchart of a computer-implemented method for detecting malware according to one embodiment of the invention.

FIG. 9 is a schematic diagram of a computer-implemented method for detecting malware, including a flow of training and use (prediction) of an algorithm model, according to one embodiment of the invention.

Fig. 10 is a flow chart of performing data processing (including dimension reduction) on ASM file and PE file data according to one embodiment of the present invention.

Fig. 11 is a diagram of performing data mapping on ASM file and/or PE file data according to one embodiment of the present invention.

Fig. 12 is a schematic diagram of a self-encoder (autoencoder) model according to one embodiment of the present invention.

FIG. 13 is a functional block diagram of an information handling system configured to implement at least a portion of a computer implemented method for detecting malware or to act as at least a portion of a malware detection system in accordance with an embodiment of the present invention.

FIG. 14 is a functional block diagram of a controller configured to implement at least a portion of a computer-implemented method for detecting malware or to be used as at least a portion of a malware detection system in accordance with an embodiment of the present invention.

Detailed Description

Fig. 1 shows an example of ASM file data. The inventors of the present application have conducted basic research analysis on ASM data through research, experiments, and trials, and found some characteristics of ASM data: (1) large ASM data volume; (2) ASM data has high similarity overall and obvious structural properties locally; (3) ASM data words and sentences have high repetition degree and context linkage is tight; and (4) the ASM data appears in some places with a distinct "key: value" key-value pair 102, where the key and value may be separated by a colon.

Fig. 2 shows an example of PE file data. The inventor of the present application has performed basic exploration analysis on PE data through research, experiment and experiment, and found some characteristics of PE data: (1) PE data is large in quantity, exists in binary data, and is generally difficult to read; and (2) the PE data can still find that the data in the form of similar URLs exists in part of the file (and some URLs can be directly accessed on a webpage), file paths, commands (such as creating a file) and the like after being read in binary.

From the above analysis, the inventors of the present application have learned that: because of the large amount of data, it is necessary to dimension down ASM data of the ASM file and PE data of the PE file in order to determine whether the ASM file and/or the PE file is malicious.

FIG. 3 illustrates a computer-implemented method 300 for detecting malware according to one embodiment of the invention. In general, the method 300 includes processing corresponding assembly source code (ASM) files and Portable Executable (PE) files to detect whether the ASM files and/or PEs belong to malware. The ASM file may be generated from or from a PE file. In one example, the ASM file is generated based on the PE file using IDA Pro.

The method 300 includes step 302A, where ASM files are processed to extract information. In one example, the information extracted from the ASM file includes keyword information. The method 300 further includes step 302B, wherein the PE file is processed to extract information. In one example, the information extracted from the PE file includes URL and/or file path information. The step of extracting information at steps 302A, 302B preferably involves a data dimension reduction operation, as described further below. It should be appreciated that steps 302A, 302B may be performed simultaneously or in any order.

After steps 302A, 302B are completed, the method 300 proceeds to step 304, where the extracted information is processed using a machine learning based processing model. As described above, the extracted information includes keyword information (ASM file) and URL and/or file path information (PE file), as examples. Step 304 may process the extracted information using one or more machine learning based processing models. In one example, the same machine learning based processing model is used to process the extracted keyword information (ASM file) and URL and/or file path information (PE file). In another example, the extracted keyword information (ASM file) and URL and/or file path information (PE file) are processed using different machine learning based processing models. In general, using more process models (or multi-model fusion strategies) may improve process accuracy. In one example, the processing model includes a multi-classification model configured for malware determination and/or classification. In one example, the processing model includes a decision tree model configured for malware determination and/or classification, such as a LightGBM model. The process model is trained (e.g., with keyword information and URL and/or file path information) for malware determination and/or classification.

Based on the processing of step 304, the method 300 determines whether the ASM file and/or the PE file is malicious (i.e., may disrupt, destroy, or unauthorized access to the computer system) at step 306. In one example, step 306 only gives a yes or no answer. In one example, step 306 further includes determining whether the PE file and/or ASM file belongs to one or more predetermined malware classifications.

Optionally, the method 300 further comprises a step 308 of classifying the PE files and/or ASM files as belonging to one or more predetermined malware classifications. In general, different malware classifications have different malicious behavior types and/or code similarities.

In one example, the predetermined malware classifications in method 300 include advanced classifications, such as one, more, or all of the following: viruses, worms, spyware, adware, pornography, risk software, trojan horses, logic bombs, luxo software, backdoor programs, rootkits, keyloggers, and the like. In another example, the predetermined malware classifications in method 300 include low-level classifications, such as one, more, or all of the following: short, trickbot, qakbot, empire, beacon, hupigon, delf, onLineGames, banker, zlob, banload, small, swizzor, viru, etc. In other embodiments, other manners of classification are also possible.

Although not shown, the method 300 may also include displaying the results determined at steps 306 and/or 308. The results may include: whether the ASM file and/or the PE file are malicious; whether the ASM file and/or the PE file belong to one or more predetermined malware classifications; and/or to which of the one or more predetermined malware classifications the ASM file and/or PE file belongs. Optionally, the method 300 further comprises: a warning is provided when the PE file and/or ASM file is determined to be malicious. The alert may be provided at a computer system that processes or downloads or transmits the PE file and/or ASM file. The alert may be a visual alert, an audible alert, a tactile alert, or the like.

In one example, the PE header and/or PE header information in the PE file is used to determine whether the PE file and/or the ASM file is malicious.

Fig. 4 illustrates a method 400 of processing ASM files according to one embodiment of the application. Method 400 may (but need not) correspond to step 302A in method 300 of fig. 3.

In this embodiment, the method 400 includes step 402, where keyword extraction and processing operations are performed on ASM files. The keyword extraction and processing operations are used to extract keyword information (including keywords) from the ASM file. The inventor of the present application has performed basic exploration analysis on ASM data through research, experiment and experiment, and found some characteristics of keywords in ASM data: (1) The positions of the keywords appearing in the sentences are relatively later (see fig. 5, where% represents the percentage of positions in the sentences (length), and scores 0-100 are the relative likelihoods of the keywords appearing (the higher the score, the greater the likelihoods); and (2) the repetition rate of the keywords is high. The inventors of the present application found that in some embodiments, the keywords are mainly "keys: the value "value in a key-value pair". In the present embodiment, the keyword extraction and processing operation is a keyword extraction and processing operation based on statistical characteristics. In other embodiments, the keyword extraction and processing operations may be word graph model-based keyword extraction and processing operations and/or topic model-based keyword extraction and processing operations. In one example, the keyword extraction and processing operations include word segmentation by space, '\n','\t'. In one example, the key is a value in a key-value pair (in the form of a "key: value"). In one example, the keyword extraction and processing operations may include: the ASM file is processed using the self-encoder model to extract keyword information (e.g., values in key-value pairs). In another example, the keyword extraction and processing operations may include: word frequency-inverse document frequency (TF-IDF) operations are performed on ASM files to extract keyword information (e.g., values in key value pairs). Keyword information may include keywords, word frequencies of keywords, word order, and/or other statistical data related to keywords. In one example, the keyword extraction and processing operations include keyword analysis based on statistical word frequency, location characteristics, associated information characteristics of keywords. In one example, the keyword extraction and processing operations include building a statistical feature matrix based on the extracted keywords. The keyword information obtained at step 402 is configured to be processed by a machine learning based processing model (e.g., step 304 of fig. 3).

FIG. 6 is an example illustration of a relationship of keywords to a (malware) classification category. In fig. 6, the classification category is represented by dots and the keyword or keyword feature is represented by lines connected to the dots.

Although not shown, the method 400 may also include performing a text cleansing (data cleansing) operation on the ASM file to facilitate extraction of the keywords. Text cleansing operations may include culling punctuation, keywords with low feature weights, and/or dead words.

FIG. 7 illustrates a method 700 of processing PE files in accordance with one embodiment of the invention. Method 700 may (but need not) correspond to step 302B in method 300 of fig. 3.

In this embodiment, method 700 includes step 702, where URL and/or path information extraction and processing operations are performed on a PE file. In one example, the URL and/or path information extraction and processing operations include binary read operations that facilitate extracting URL and/or path information from a PE file. The URL and/or path information may include a URL and/or a file path. In one example, the URL and/or path information extraction and processing operations include performing a word frequency-inverse document frequency (TF-IDF) operation on the PE file using the extracted URL and/or path information after performing the binary read operation on the PE file. Word frequency-inverse document frequency (TF-IDF) operations facilitate obtaining word-document information from extracted URL and/or path information. In one example, a word frequency-inverse document frequency (TF-IDF) operation includes building a word-document matrix. The obtained word-document information is configured to be processed by a machine learning based processing model (e.g., step 304 of fig. 3). In another example, the URL and/or path information extraction and processing operations include processing the PE file with a self-encoder model to extract URL and/or path information. The extracted URL and/or path information is configured to be processed by a machine learning based processing model (e.g., step 304 of fig. 3).

Although not shown, the method 700 may further include performing a text cleansing (data cleansing) operation on the PE file to facilitate extraction of the keywords. Text cleansing operations may include culling punctuation, keywords with low feature weights, and/or dead words.

FIG. 8 illustrates a computer-implemented method 800 for detecting malware according to one embodiment of the invention. The method is similar to method 300 of fig. 3, but does not use/process PE files corresponding to ASM files (i.e., only ASM files are used/processed).

Method 800 includes step 802, where ASM files are processed to extract information. In one example, the information extracted from the ASM file includes keyword information. As described above, in some examples, the ASM file may be processed using a self-encoder model or a word frequency-inverse document frequency (TF-IDF) operation may be performed on the ASM file to extract information.

After completing step 802, the method 800 proceeds to step 804, where the extracted information is processed using a machine learning based processing model. As described above, the extracted information includes keyword information (ASM file). Step 804 may process the extracted information using one or more machine learning based processing models. In one example, the processing model includes a multi-classification model configured for malware determination and/or classification. In one example, the processing model includes a decision tree model configured for malware determination and/or classification, such as a LightGBM model. The process model is trained (e.g., with keyword information) for malware determination and/or classification.

Based on the processing of step 804, method 800 determines if the ASM file and/or PE file is malicious (i.e., may disrupt, destroy, or unauthorized access to the computer system) at step 806. In one example, step 806 only gives a yes or no answer. In one example, step 806 further comprises determining whether the PE file and/or the ASM file belongs to one or more predetermined malware classifications.

Optionally, the method 800 further comprises step 808 of classifying the PE files and/or ASM files as belonging to one or more predetermined malware classifications. In general, different malware classifications have different malicious behavior types and/or code similarities.

FIG. 9 illustrates a computer-implemented method 900 for detecting malware, including the training and use (prediction) of algorithmic models, according to one embodiment of the invention. Method 900 may be considered an embodiment of method 300 of fig. 3.

In the training process of method 900, corresponding assembly source code (ASM) files and Portable Executable (PE) files are used, wherein the malware type is known. In one flow (a), a text cleansing (data cleansing) operation is performed on ASM files and PE files. The operations include performing an ASM data mapping operation on an ASM file to obtain tensor data thereof, and performing a PE data mapping operation on a PE file to obtain tensor data thereof. Then, the tensor data obtained from the ASM file and the tensor data obtained from the PE file are integrated/combined, and then the self-encoder model is utilized to perform data dimension reduction on the integrated/combined tensor data and obtain keywords and/or keyword features. The output from the encoder is then processed by the encoder. In one example, the encoder is configured to convert the classification variable from the output of the encoder into a numerical value for subsequent processing using an algorithmic model. In another flow (B), keyword analysis is performed on ASM files and PE files and text cleansing (data cleansing) operations (e.g., using cleansing models) are performed, followed by word frequency-inverse document frequency (TF-IDF) processing to obtain word-document matrices or word frequency matrices. In one example, the output of the encoder and the word-document matrix or word frequency matrix are integrated/combined as input to a training LGB (LightGBM) multi-class processing model process. An LGB (LightGBM) multi-classification processing model predicts from input whether or which type of malware an ASM file and/or PE file is. Comparing the predicted result of the LGB (LightGBM) multi-classification processing model with known classes, the performance of the LGB (LightGBM) multi-classification processing model and the accuracy of the prediction can be obtained. Based on its effectiveness and accuracy of the predictions, the parameters of the LGB (LightGBM) multi-class processing model may be optimized, the PE and/or ASM data mapping operations in the flow (a) may be optimized, the parameters of the self-encoder model in the flow (a) may be optimized, the text cleansing (data cleansing) operations in the flow (B) (e.g., its cleansing rules, cleansing models, etc.) may be optimized, etc., to train or improve the effectiveness of the algorithm and accuracy of the predictions. In one example, the training accuracy of the algorithm may be displayed (e.g., in real-time). In one example, a trained LGB (LightGBM) multi-class processing model, a self-encoder model, a word frequency matrix, etc., may be stored for use in prediction operations. In the above embodiment, both the flow (a) and the flow (B) are performed. However, in some other embodiments, only one of the flows (a) and (B) may be performed and utilized to train an LGB (LightGBM) multi-classification processing model. In still other embodiments, only one of the ASM file and the PE file may be processed and used to train an LGB (LightGBM) multi-classification processing model in flow (a) and flow (B).

In the use (prediction) process of the method 900, a text cleansing (data cleansing) operation is first performed on the corresponding ASM file and/or PE file to be tested using the cleansing model. The cleaning model may be the cleaning model optimized in the training process above. After the text cleansing (data cleansing) operation, the data is subjected to feature engineering processing by using a feature engineering model. The feature engineering model may be a feature engineering model optimized in the above training process. In one example, the feature engineering model is a self-encoder model. In another example, the feature engineering model is a word frequency-inverse document frequency (TF-IDF) operation model or a word frequency matrix or related model. In yet another example, the feature engineering model is an ebadd feature-based model. The self-encoder model may be used to process ASM files and/or PE files, where ASM files and PE files may be processed by the same or different self-encoder models. A word frequency-inverse document frequency (TF-IDF) operation model or word frequency matrix or related model may be used to process ASM files and/or PE files, where ASM files and PE files may be processed by the same or different models. After feature engineering, the data will be processed by an algorithm model to predict whether or which type of malware the ASM file and/or PE file is malware. In this embodiment, the algorithm model is an LGB (LightGBM) multi-class processing model, for example, an LGB (LightGBM) multi-class processing model optimized in the above training process. However, the algorithm model may alternatively be an LSTM model, an XGBoost model, or the like. In one embodiment, the classification results may be displayed (e.g., in real-time).

Fig. 10 illustrates a flow of performing data processing (including dimension reduction) on ASM file and PE file data according to one embodiment of the invention. This flow is the same as the flow (a) in fig. 9, and will not be described again here.

FIG. 11 illustrates data mapping performed on ASM file and/or PE file data according to one embodiment of the invention.

Fig. 12 is a self-encoder model 1200 according to one embodiment of the present invention. The self-encoder model 1200 is a type of artificial neural network used in semi-supervised learning and non-supervised learning, and is composed of an encoding part and a decoding part. The main functions of the self-encoder model 1100 include an encoding operation on the original tensor data and a decoding operation to restore the encoded data to the original tensor data. In one embodiment, the ASM data is processed using a self-encoder (and in particular its encoding operations) to reduce and extract keywords from the ASM data. Optionally, the PE data is also downgraded. The self-encoder model needs to be trained and modified (e.g., added to the residual) to improve its accuracy. The data processed from the encoder may then be further processed using the lgb multi-classification model, as described above.

Fig. 13 illustrates an information handling system 1300 according to one embodiment of the present invention. The information handling system 1300 is configured to implement a portion or all of a computer-implemented method (e.g., methods 300, 400, 700, 800, 900, 1200) for detecting malware to be used as at least a portion of a malware detection system. The information handling system 1300 may be a dedicated information handling system or a general-purpose information handling system configured (programmed) to implement a portion or all of a computer-implemented method (e.g., methods 300, 400, 700, 800, 900, 1200) for detecting malware.

The information handling system 1300 may have different configurations, forms, shapes, sizes, etc., but generally includes the necessary components for receiving, storing, and executing the appropriate computer instructions, commands, or code. The major components of information handling system 1300 include a processor 1302 and a memory 1304. The processor 1302 may be comprised of one or more of the following: a CPU, MCU, controller, logic circuit, raspberry pi chip, digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), field Programmable Gate Array (FPGA) or any other digital or analog circuit configured to interpret and/or execute program instructions and/or process signals and/or information and/or data. The memory 1304 may be comprised of one or more of the following: one or more volatile memory cells (e.g., RAM, DRAM, SRAM), one or more nonvolatile memory cells (e.g., ROM, PROM, EPROM, EEPROM, FRAM, MRAM, FLASH, SSD, NAN)D and NVDIMM), or any combination thereof. Suitable computer instructions, commands, code, information, and/or data (e.g., those that perform a portion or all of methods 300, 400, 700, 800, 900, 1200) may be stored in memory 1304. Optionally, the information handling system 1300 also includes one or more input devices 1306. Examples of input devices 1306 include one or more of the following: a keyboard, a mouse, a stylus, an image scanner, a microphone, a haptic/touch input device (e.g., a touch sensitive screen), an image/video input device (e.g., a camera), etc. Optionally, the information handling system 1300 also includes one or more output devices 1308. Examples of output devices 1308 include one or more of the following: displays (e.g., monitors, screens, projectors, etc.), speakers, disk drives, headphones, headsets, printers, etc. The display may include an LCD display, an LED/OLED display, or any other suitable display that may or may not be touch sensitive. A display may be used to display the results of a portion or all of the processes of the methods 300, 400, 700, 800, 900, 1200. The information handling system 1300 may also include one or more disk drives 1312, which may include one or more of the following: solid state drives, hard drives, optical drives, flash drives, tape drives, and the like. A suitable operating system may be installed on information handling system 1300, for example, in disk drive 1312 or memory 1304. The memory 1304 and disk drive 1312 may be operated by the processor 1302. Optionally, the information handling system 1300 also includes a communication apparatus 1310 for establishing one or more communication links (not shown) with one or more other computing devices, such as a server, personal computer, terminal, tablet, phone, watch, ioT device, or other wireless or handheld computing device. Communication device 1310 may include one or more of the following: modem, network Interface Card (NIC), integrated network interface, NFC transceiver, zigBee transceiver, wi-Fi transceiver, Transceiver, radio frequency transceiver, optical port, infrared port, USB connection, or the likeHe has a wired or wireless communication interface. The transceiver may be implemented by one or more devices (integrated transmitter and receiver, separate transmitter and receiver, etc.). The communication links may be wired or wireless communication links for conveying commands, instructions, information, and/or data (e.g., those in methods 300, 400, 700, 800, 900, 1200). In one example, the processor 1302, memory 1304, and optionally input device 1306, output device 1308, communication device 1310, and disk drive 1312 may be connected to each other by a bus, a Peripheral Component Interconnect (PCI) such as PCI Express, universal Serial Bus (USB), optical bus, or other similar bus structure. In one embodiment, some of these components may be connected through a network, such as the internet or a cloud computing network. Those skilled in the art will appreciate that the information handling system 1300 shown in FIG. 13 is merely exemplary and that the information handling system 1300 may have different configurations (e.g., more components, fewer components, etc.) in other embodiments.

FIG. 14 illustrates a functional block diagram of a controller 1400, the controller 1400 configured to implement a portion or all of a computer-implemented method (e.g., methods 300, 400, 700, 800, 900, 1200) for detecting malware, to be used as at least a portion of a malware detection system, according to one embodiment of the invention. The controller 1400 is configured (programmed) to implement a portion or all of a computer-implemented method (e.g., methods 300, 400, 700, 800, 900, 1200) for detecting malware.

The main components of the controller 1400 include a processor 1402 and a memory 1404. The processor 1402 may be comprised of one or more of the following: a CPU, MCU, controller, logic circuit, raspberry pi chip, digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), field Programmable Gate Array (FPGA) or any other digital or analog circuit configured to interpret and/or execute program instructions and/or process signals and/or information and/or data. The processor 1402 includes a machine learning processing module and a non-machine learning processing module. The machine learning processing module facilitates processing data using a machine learning method or machine learning model (e.g., stored in memory 1404). The machine learning processing module may also be used to train a machine learning model. The machine learning model may include any of the models described above. The non-machine learning processing module facilitates processing data using non-machine learning methods or models (e.g., stored in memory 1404). The memory 1404 may be comprised of one or more of the following: one or more volatile memory units (e.g., RAM, DRAM, SRAM), one or more non-volatile memory units (e.g., ROM, PROM, EPROM, EEPROM, FRAM, MRAM, FLASH, SSD, NAND and NVDIMM), or any combination thereof. The memory 1404 stores suitable computer instructions, commands, code, machine learning process models, information, and/or data (e.g., those relating to performing a portion or all of the methods 300, 400, 700, 800, 900, 1200). Those skilled in the art will appreciate that the controller 1400 shown in fig. 14 is merely exemplary, and that the controller 1400 may have different configurations (e.g., more components, fewer components, etc.) in other embodiments.

It should also be understood that while some of the figures show hardware and software located in a particular device, these figures are for illustrative purposes only. The functions described herein as being performed by one component may be performed by multiple components in a distributed fashion. Also, functions performed by multiple components may be combined and performed by a single component. In some implementations, the illustrated components may be combined or divided into separate software, firmware, and/or hardware. For example, the logic and processing may be distributed among multiple electronic processors rather than within and performed by a single electronic processor. Regardless of how they are combined or divided, the hardware and software components may reside on the same computing device or may be distributed among different computing devices connected by one or more networks or other suitable communication links. Similarly, components described as performing a particular function may also perform other functions not described herein. For example, a device or structure that is "configured" in some way is configured in at least that way, but may also be configured in ways that are not explicitly listed.

Although not required, the embodiments described with reference to the figures may be implemented as an Application Programming Interface (API) or a series of libraries for use by a developer, or may be included in another software application, such as a terminal or computer operating system or portable computing device operating system. Generally, program modules include routines, programs, objects, components, and data files that facilitate the execution of particular functions, so those skilled in the art will appreciate that the functions of software applications may be distributed across multiple routines, objects, and/or components to achieve the same functions as desired herein. It should also be appreciated that any suitable computing system architecture may be used where the methods and systems of the present invention are implemented, in whole or in part, by a computing system. This would include stand alone computers, network computers, dedicated or non-dedicated hardware devices. The terms "computing system" and "computing device" and the like are intended to comprise, without being limited to, any suitable arrangement of computer or information processing hardware capable of carrying out the described functions.

Those skilled in the art will appreciate that many variations and/or modifications of the invention as shown in the specific embodiments may be made to provide other embodiments of the invention. The described embodiments of the present invention are, therefore, to be considered in all respects as illustrative and not restrictive.

Some embodiments of the invention include one or more of the following features: modeling is performed by adopting a data reduction and multi-classification model; feature understanding and extracting text data by means of an effective visualization technique; adopting keywords to perform data dimension reduction, and then performing secondary analysis on the result so as to find out key information; and through offline scoring adjustment, the model is convenient to optimize and has higher stability. Alternatively or additionally, the invention may comprise further one or more features.

Claims

1. A computer-implemented method for detecting malware, comprising:

processing an assembly source code (ASM) file corresponding to a Portable Executable (PE) file to extract information from the ASM file;

processing the extracted information of the ASM file using a machine learning based processing model; and

based on the processing of the extracted information of the ASM file, it is determined whether the ASM file and/or the PE file is malicious.

2. The computer-implemented method for detecting malware according to claim 1, wherein the determining comprises determining whether the PE file and/or the ASM file belongs to one or more predetermined malware classifications.

3. The computer-implemented method for detecting malware according to claim 2, wherein the determining further comprises classifying the PE file and/or the ASM file as belonging to which of the one or more predetermined malware classifications.

4. The computer-implemented method for detecting malware according to claim 1, wherein processing the ASM file to extract information comprises:

performing keyword extraction and processing operations on the ASM file to extract keyword information;

wherein the keyword information is configured to be processed by the machine-learned processing model.

5. The computer-implemented method for detecting malware as in claim 4, wherein performing the keyword extraction and processing operations comprises: the ASM file is processed using a self-encoder model to extract the keyword information.

6. The computer-implemented method for detecting malware as in claim 4, wherein performing the keyword extraction and processing operations comprises: performing a word frequency-inverse document frequency (TF-IDF) operation on the ASM file to extract the keyword information.

7. The computer-implemented method for detecting malware according to any of claims 4 to 6, wherein the keyword information comprises word frequencies and/or word orders of the keywords.

8. The computer-implemented method for detecting malware according to any of claims 4-6, further comprising performing a text cleansing operation to facilitate the keyword extraction and processing operations.

9. The computer-implemented method for detecting malware according to any of claims 1-6, wherein the machine learning based processing model comprises a multi-classification model configured for malware determination and/or classification.

10. The computer-implemented method for detecting malware according to any of claims 1-6, wherein the machine learning based processing model comprises a decision tree model configured for malware determination and/or classification.

11. The computer-implemented method for detecting malware according to any of claims 1-6, wherein the machine learning based processing model comprises a LightGBM model configured for malware determination and/or classification.

12. The computer-implemented method for detecting malware according to any of claims 1-6, further comprising:

Processing the PE file to extract information from the PE file; and

processing the extracted information of the PE file using the machine learning based processing model; and

wherein the determining comprises:

based on the processing of the extracted information of the ASM file and the extracted information of the PE file, it is determined whether the ASM file and/or the PE file is malicious.

13. The computer-implemented method for detecting malware according to claim 12, wherein processing the PE file to extract information comprises performing URL and/or path information extraction and processing operations comprising:

processing the PE file with a self-encoder model to extract URL and/or path information;

wherein the URL and/or path information is configured to be processed by the machine-learned processing model.

14. The computer-implemented method for detecting malware according to claim 12, wherein processing the PE file to extract information comprises performing URL and/or path information extraction and processing operations comprising:

Performing binary reading operation on the PE file to extract URL and/or path information; and

performing a word frequency-inverse document frequency (TF-IDF) operation based on the extracted URL and/or path information to obtain word-document information,

wherein the word-document information is configured to be processed by the machine-learned processing model.

15. The computer-implemented method for detecting malware according to claim 13, further comprising performing a text cleansing operation to facilitate extraction of the URL and/or path information.

16. A malware detection system, comprising:

one or more processors configured to:

17. The malware detection system of claim 16, wherein the one or more processors are further configured to: a determination is made as to whether the PE file and/or the ASM file belongs to one or more predetermined malware classifications.

18. The malware detection system of claim 17, wherein the one or more processors are further configured to: classifying the PE file and/or the ASM file as belonging to which of the one or more predetermined malware classifications.

19. The malware detection system of claim 16, wherein the one or more processors are configured to process the ASM file to extract information by at least:

20. The malware detection system of claim 19, wherein the one or more processors are configured to perform the keyword extraction and processing operations by at least: the ASM file is processed using a self-encoder model to extract the keyword information.

21. The malware detection system of claim 19, wherein the one or more processors are configured to perform the keyword extraction and processing operations by at least: performing a word frequency-inverse document frequency (TF-IDF) operation on the ASM file to extract the keyword information.

22. The malware detection system according to any of claims 19 to 21, wherein the keyword information comprises a word frequency and/or word order of the keywords.

23. The malware detection system of any of claims 19-21, the one or more processors further configured to perform text cleansing operations to facilitate the keyword extraction and processing operations.

24. The malware detection system of any of claims 16-21, wherein the machine learning based processing model comprises a multi-classification model configured for malware determination and/or classification.

25. The malware detection system of any of claims 16-21, wherein the machine learning based processing model comprises a decision tree model configured for malware determination and/or classification.

26. The malware detection system of any of claims 16-21, wherein the machine learning based processing model comprises a LightGBM model configured for malware determination and/or classification.

27. The malware detection system of any of claims 16-21, the one or more processors further configured to:

Processing the PE file to extract information from the PE file; and

wherein the one or more processors are configured to:

28. The malware detection system of claim 27, wherein the one or more processors are further configured to process the PE file to extract information at least by performing URL and/or path information extraction and processing operations comprising:

29. The malware detection system of claim 27, wherein the one or more processors are further configured to process the PE file to extract information at least by performing URL and/or path information extraction and processing operations comprising:

performing a word frequency-inverse document frequency (TF-IDF) operation based on the extracted URL and/or path information to obtain word-document information;

30. The malware detection system of claim 28, wherein the one or more processors are further configured to perform a text cleansing operation to facilitate extraction of the URL and/or path information.