CN109359439B

CN109359439B - software detection method, device, equipment and storage medium

Info

Publication number: CN109359439B
Application number: CN201811257390.5A
Authority: CN
Inventors: 庞瑞; 张宏君
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2019-12-13
Anticipated expiration: 2038-10-26
Also published as: CN109359439A

Abstract

the invention discloses a software detection method, a device, equipment and a storage medium, wherein the method comprises the following steps: extracting numerical features and non-numerical features contained in each sample in a software sample library; processing the non-numerical characteristics by using N selected non-encrypted hash algorithms, and converting a processing result into numerical characteristics; n is an integer greater than 1; constructing a feature matrix according to the numerical characteristics contained in each sample and the numerical characteristics obtained by conversion; training a machine learning classifier by using the feature matrix; and detecting the target software by using the machine learning classifier. The method can convert the complex character string features extracted from the malicious software sample into the hash features which are easy to process by a machine learning algorithm, thereby reducing the difficulty of model training, obviously improving the training speed, reducing the space overhead and improving the distinguishing precision of the malicious software.

Description

software detection method, device, equipment and storage medium

Technical Field

the present invention relates to the field of detection technologies, and in particular, to a software detection method, apparatus, device, and storage medium.

Background

The malicious software mainly comprises destructive computer viruses, worm viruses, trojan backdoors, vulnerability exploitation programs, advertising phishing codes and the like, can be combined with various evading technologies and security vulnerabilities, breaks through the monitoring of the existing traditional defense system, and greatly destroys the benefits of users. The purpose of the malware detection system is to discover malware mixed in normal files in time, take measures autonomously as far as possible before destructive influence is generated on the malware, and notify users in time.

The existing malicious software detection method comprises two types of analysis and detection of static files and dynamic behavior. The existing malicious software static detection technology mainly depends on matching of a manually generated feature code library and a rule library, and even in the more advanced heuristic virus detection technology, the judgment and identification are assisted by a manually maintained expert knowledge base. However, under the current explosive expansion of the internet, thousands of hosts and users in the internet face threats of various varieties, polymorphism, shell adding, confusion adding and other malicious software. How to rapidly cope with the attack of variant viruses and malicious software, and automatically process and analyze massive and various malicious software, so that the detection rate of the malicious software is improved, the false alarm rate is reduced, and the method becomes a main problem of the current malicious software detection means.

The detection method based on machine learning does not depend on a feature code base and an expert knowledge base, utilizes the trained model to quickly and automatically distinguish and identify the malicious software, can classify the malicious software by further training the model, and has better research and application prospects. The machine learning malicious software detection method mainly depends on two steps, one is that a proper amount of samples are selected, characteristics in the samples are extracted, extracted numerical values and non-numerical values need to be screened and cleaned, missing items and error items are eliminated, the numerical values are standardized and normalized, the non-numerical values are specially encoded, single hot spot (one-hot) encoding is generally carried out, the numerical values are converted into numerical value forms which can be identified and processed by a computer, and all the extracted characteristics are combined to form a characteristic matrix. And secondly, a proper machine learning modeling mode needs to be selected, and for the problems brought by the current massive malicious software, the traditional methods such as logistic regression, naive Bayes, support vector machines, decision trees and the like are not suitable for malicious software detection and identification due to the factors such as low training speed, huge resource consumption, poor model evaluation effect and the like.

the traditional malware characteristic extraction method adopts one-hot coding or converts the extracted character string information into the value type of the AscII code, and the processing mode has the following defects:

1, one-hot coding is effective under the condition that the number and the name of character strings in a character string set are determined, and character string features extracted from malicious software are infinite because the total amount of the malicious software is infinite, and new malicious software is infinite, so that the character string set of a total sample is estimated by means of the character string set of a training sample, which brings great deviation;

2, the character string converted into the AscII code can indeed convert character string type features into numerical value type features, but considering that the character string feature lengths extracted from different samples may be inconsistent, so that the converted feature quantities are inconsistent, it is difficult to perform word segmentation and segmentation on the character string in the AscII code form, an algorithm still needs to be designed to convert the feature matrix dimensions input into the machine learning model into consistency, so that the complexity is still high;

3, the virus detection engine is difficult to resist various modes such as massive confusion, character string variation, artificial interference, sand mixing and the like generated by the virus generator.

Therefore, the existing method for extracting characteristics of the malicious software based on the machine learning detection method cannot meet the requirements, so that the technical problem to be solved by the invention is how to convert the characteristics of the complex character strings extracted from the malicious software samples into the characteristics which are easy to process by a machine learning algorithm, thereby reducing the difficulty of model training and improving the training speed.

disclosure of Invention

In view of the foregoing, embodiments of the present invention are proposed to provide a software detection method, apparatus, device and storage medium.

According to an aspect of an embodiment of the present invention, there is provided a software detection method, including:

Extracting numerical features and non-numerical features contained in each sample in a software sample library;

Processing the non-numerical characteristics by using N selected non-encrypted hash algorithms, and converting a processing result into numerical characteristics; n is an integer greater than 1;

constructing a feature matrix according to the numerical characteristics contained in each sample and the numerical characteristics obtained by conversion;

Training a machine learning classifier by using the feature matrix;

and detecting the target software by using the machine learning classifier.

According to another aspect of the embodiments of the present invention, there is provided a software detection apparatus, including:

the characteristic extraction module is used for extracting numerical characteristics and non-numerical characteristics contained in each sample in the software sample library;

The characteristic processing module is used for processing the non-numerical characteristic by utilizing the selected N non-encrypted Hash algorithms and converting a processing result into a numerical characteristic; n is an integer greater than 1;

The matrix construction module is used for constructing a feature matrix according to the numerical type features contained in each sample and the numerical type features obtained through conversion;

the learning training module is used for training a machine learning classifier by utilizing the characteristic matrix;

And the detection module is used for detecting the target software by utilizing the machine learning classifier.

According to a third aspect of embodiments of the present invention, there is provided a computing device, comprising: a memory, a processor, and a communication bus; the communication bus is used for realizing connection communication between the processor and the memory;

The processor is configured to execute a software detection program stored in the memory to implement the method steps of:

Training a machine learning classifier by using the feature matrix;

and detecting the target software by using the machine learning classifier.

According to a fourth aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method steps of:

training a machine learning classifier by using the feature matrix;

and detecting the target software by using the machine learning classifier.

compared with the prior art, the invention has the following beneficial effects:

according to the software detection scheme provided by the embodiment of the invention, a software detection method based on mixed non-encrypted hash characteristics and a machine learning model is adopted, and complex character string characteristics extracted from a malicious software sample can be converted into hash characteristics which are easy to process by a machine learning algorithm, so that the model training difficulty is reduced, the training speed is obviously improved, the space overhead is reduced, and the malicious software discrimination precision is improved.

The scheme has a good detection effect on most application scenes such as lack of sufficient malicious software expert knowledge base, lack of complete virus characteristic code base and the like. Meanwhile, the method can resist the common variants and polymorphisms of malicious software authors and other means for escaping detection, has strong resistance to human interference, crust addition and confusion, and has good anti-interference capability and robustness.

the foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

drawings

various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of a software detection method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a software inspection method according to a second embodiment of the present invention;

Fig. 3 is a flowchart of a feature processing method based on a hybrid unencrypted hash algorithm according to a third embodiment of the present invention;

FIG. 4 is a flow chart of a hybrid splicing and recombining method according to a third embodiment of the present invention;

Fig. 5 is a block diagram of a software detecting apparatus according to a fourth embodiment of the present invention;

fig. 6 is a block diagram of a software detecting apparatus according to a fifth embodiment of the present invention;

fig. 7 is a block diagram of a computing device according to a sixth embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

in a first embodiment of the present invention, a software detection method is provided, which aims to provide a software detection method based on a hybrid non-encrypted hash feature for overcoming the defects of the existing malware detection method. Specifically, as shown in fig. 1, the method of this embodiment includes the following steps:

Step S101, extracting numerical characteristics and non-numerical characteristics contained in each sample in a software sample library;

in the embodiment of the invention, before the step is executed, the process of obtaining the software sample and constructing the software sample library is also executed. Specifically, when a malicious software sample is obtained, marking the sample as a black sample, and determining the type of the malicious software; and when a normal software sample is obtained, marking the sample as a white sample. Therefore, the subsequent characteristic extraction and machine learning process can be carried out by utilizing the software in the software sample library.

In the embodiment of the present invention, the sample program in the software sample library is mainly a PE (Portable executable) file or a DLL (Dynamic link library) file with a similar file structure, for example. This allows for the extraction of numerical and non-numerical features in the sample program file. Of course, the sample program in the embodiment of the present invention may also be a file of another type, and the present invention is not limited to the PE or DLL file type.

In a particular embodiment of the invention, the numerical type features comprise one or more of the following features: code header field information, code segment information, string statistics, sample population statistics, function lists in import address tables, export function lists, byte statistics, and byte information entropy statistics.

In the embodiment of the invention, the non-numerical type feature mainly refers to character string type data. In a particular embodiment of the invention, the non-numerical type features include one or more of the following features: recognizable character string sequences, all path character string sequences, all uniform resource locator character string sequences, all registry key character string sequences, machine model character strings of the software header information, all software segment name character string sequences, entry segment name character strings and character string sequences consisting of more than Q continuous recognizable characters in all software segments; wherein Q is a positive integer. In one exemplary embodiment, Q is taken to be 5, but Q is not limited to being taken to this value.

it should be noted that one skilled in the art may add or subtract features from the above-mentioned features as required, but all fall within the scope of the protection concept of the present invention.

step S102, processing the non-numerical characteristics by using N selected non-encrypted Hash algorithms, and converting the processing result into numerical characteristics; n is an integer greater than 1;

in the embodiment of the invention, the principle of selecting the non-encryption hash algorithm is that the algorithms are complementary, so that hash collision and feature loss caused by only adopting one algorithm are avoided.

In an exemplary embodiment of the present invention, three non-cryptographic hash algorithms are selected, which specifically include: the MurMurHash3 algorithm, the SimHash algorithm, and the CRC32 algorithm. Of course, one skilled in the art can make additions or subtractions on this basis. The specific algorithm is not the key point of the embodiment of the invention, and the invention is mainly used for protecting the scheme of extracting the mixed features by adopting the mixed non-encrypted hash algorithm.

In an embodiment of the present invention, the processing the non-numerical characteristic by using the selected N non-cryptographic hash algorithms, and converting the processing result into the numerical characteristic specifically includes:

(1) Grouping the non-numerical characteristics according to a set grouping mode;

(2) for each group of non-numerical characteristics, performing hash processing by using the N non-encryption hash algorithms respectively to obtain N hash values, and converting the obtained N hash values into integer;

(3) And splicing the integer characteristics of each group to obtain the converted numerical characteristics.

Step S103, constructing a feature matrix according to the numerical characteristics contained in each sample and the numerical characteristics obtained by conversion;

in a specific embodiment of the present invention, the specific implementation manner of this step is as follows:

Carrying out standardization processing on each numerical characteristic;

normalizing the normalized characteristic data;

And constructing a characteristic matrix by using the characteristic data after the normalization processing.

in an optional embodiment of the present invention, after constructing the feature matrix, further comprising: and performing dimension reduction processing on the feature matrix according to a set dimension reduction method. And removing the characteristic columns with weak obvious relevance, and inputting the result into a machine learning classifier for training.

Step S104, training a machine learning classifier by using the feature matrix;

in an embodiment of the present invention, training a machine learning classifier using the feature matrix specifically includes:

training a first machine learning classifier by using a feature matrix constructed by software samples marked with malicious software and normal software so as to classify whether the software is the malicious software or the normal software;

And training a second machine learning classifier by utilizing a feature matrix constructed by labeling the malware samples with different types so as to classify the types of the malware.

That is, the first machine learning classifier is a two-class machine learning model and the second machine learning classifier is a multi-class machine learning model.

in an optional embodiment of the present invention, after training the machine learning classifier, the method further comprises:

And testing the trained machine learning classifier by using a test sample set so as to adjust the model parameters of the machine learning classifier.

And step S105, detecting the target software by using the machine learning classifier.

specifically, in the embodiment of the present invention, the features are extracted from the target software in the manner of step S102, and the features are input into the machine learning classifier for detection. The process is a real-time online testing process. The above S101 to S104 may be an offline implementation process.

specifically, in the embodiment of the invention, the classification of the malicious software and the normal software is realized by utilizing the first machine learning classifier; and the classification of the types of the malicious software is realized by utilizing the second machine learning classifier, so that the detection of the target software is realized.

In an optional embodiment of the invention, when the target software is detected to be the malicious software, the warning is generated according to a set warning mode.

in summary, in the software detection scheme provided in the embodiment of the present invention, a software detection method based on a hybrid unencrypted hash feature and a machine learning model is adopted, so that a complex character string feature extracted from a malware sample can be converted into a hash feature that is easy to process by a machine learning algorithm, thereby reducing the difficulty of model training, significantly improving the training speed, reducing the space overhead, and improving the malware discrimination accuracy.

in the second embodiment of the present invention, a software detection method is provided, and compared with the first embodiment, the present embodiment will be described in more detail with reference to specific application examples, and it should be noted that a great deal of technical details disclosed in the present embodiment are used for explaining the present invention, and are not used to limit the present invention solely.

specifically, as shown in fig. 2, an embodiment of the present invention provides a software detection method, and more specifically, provides a feature processing method based on a hybrid non-cryptographic hash algorithm, and a malware detection means based on the method and a machine learning algorithm. The method specifically comprises the following steps:

Step S100: collecting training samples and constructing a software sample library;

specifically, in this embodiment, the malware samples used for machine learning training are obtained, and are marked as black samples and marked as integer 1, and meanwhile, the normal program samples of the corresponding number are collected, and are marked as white samples and marked as integer 0.

In an exemplary embodiment of the present invention, the collected software samples are scanned one by using a public virus inspection engine library (total number is about 60 to 70, and the number of available engines varies according to the type of scanned file) on the virustotal website, and the judgment criteria is that more than 50 virus inspection engines are classified as malware, and none of the virus inspection engines is classified as normal file. By this step, 50 ten thousand malware samples and 50 ten thousand normal software samples were collected, with 40 million malware and 40 million normal software as training data sets and 10 million malware and 10 normal software as testing data sets. The collected program samples are mainly PE files, or DLL files with similar file structures. Meanwhile, a plurality of antivirus engines on virustotal can be used for classifying the malicious software, and the most types identified by the antivirus software are selected as the types and families of the malicious software in the training data by adopting a voting method.

step S200: extracting data information in each sample aiming at the collected software training samples;

Specifically, in the embodiment of the present invention, the extracted information includes: numeric information (including boolean types, i.e., considered as 0 and 1) and non-numeric information (mainly referring to string type data). And all data information is checked, and possible data loss and data dislocation are corrected, so that the obtained data information is complete and correct.

in the embodiment of the present invention, the extracted numerical features specifically include: code header field information, code segment information, string statistics, sample population statistics, function lists in import address tables, export function lists, byte statistics, and byte information entropy statistics. Specific feature types are shown in table 1:

table 1 extracted numerical characteristics

in this embodiment:

malicious code header field information, including: the file virtual size is whether a debug mode exists or not, whether a signature exists or not, the time stamp of the PE header, other numerical information of the PE file header and whether a thread local storage table exists or not;

Code segment information, comprising: whether the data contains resource segments, the number of segment areas, the number of zero-size code segments, the number of unnamed code segments and the number of segments containing 'MEM _ WRITE';

String statistics, including: the number of recognizable character strings, the average length of the character strings, the number statistics of printable character strings and the sum of all character information entropies;

sample population statistics, including: the path identifier ' C: \ \ quantity, http (s)// total occurrence number, ' HKEY ' occurrence quantity, ' MZ ' occurrence quantity, whether a relocation table is contained or not, and the number of symbols in a symbol table;

introducing a function list in an address table, comprising: introducing the number of functions of an address table;

A list of derived functions comprising: deriving the number of functions;

Byte statistics, including: the number of bytes 0x 00-0 xFF in the whole file, the total number of bytes of the file;

Byte information entropy statistics, including: entropy distribution of information of bytes.

in the embodiment of the present invention, the extracted non-numerical features include: recognizable character string sequences, all path character string sequences, all uniform resource locator character string sequences, all registry key character string sequences, machine model character strings of the software header information, all software segment name character string sequences, entry segment name character strings and character string sequences consisting of more than Q continuous recognizable characters in all software segments; wherein Q is a positive integer. In one exemplary embodiment, Q is taken to be 5, but Q is not limited to being taken to this value. Specific feature types are shown in table 2, for example:

TABLE 2 non-numerical features extracted

For the feature description and feature extraction methods listed in tables 1 and 2 above, it is agreed that for numeric features, if the entry is null, then the integer value 0 is substituted, and for non-numeric features, if the entry is null, then the string "0" is substituted.

step S300: and performing mixed hash feature processing based on three non-encrypted hash algorithms including MurMurHash3, SimHash and CRC32 on the non-numerical features, and converting the character string features which are difficult to process into a numerical feature matrix with a fixed length.

the Hash algorithm, i.e. mapping a member to a specific interval, is also called Hash algorithm. Generally, the method is divided into two categories, namely encryption hash algorithm and non-encryption hash algorithm. The common MD5 algorithm is an encryption hash algorithm, and can map a character string with any length into a 128-bit (16-byte) hash value through a hash algorithm, and has the advantages of wide application range, extremely low collision rate and the like. However, for feature processing of machine learning models, using a cryptographic hash algorithm is not suitable. The reason is that the machine learning feature processing needs to preserve the commonalities of the original features to the maximum extent that these commonalities can be exploited for class discrimination in the training process later. However, the cryptographic hash algorithm such as MD5 is very sensitive to the original feature variation, and the mere inversion of one bit can cause the hash value of MD5 to change dramatically, thus destroying the information contained in the original feature, which is very disadvantageous for machine learning training. Therefore, the embodiment of the invention adopts the non-encryption type Hash algorithm to extract the non-numerical type characteristics, and retains the category information of the original characteristics to the maximum extent, namely the method is an effective characteristic processing method.

Step S400: and (5) training the machine learning classifier by using the feature matrix obtained in the step (S300) to obtain a machine learning classification model.

Specifically, a two-classification machine learning model can be trained aiming at training data marked with malicious software and normal files, so that the function of distinguishing and identifying the malicious software is realized; the multi-classification machine learning model can be trained according to training data labeled with different classes, and the fact that the files which are distinguished as the malicious software belong to which family and class is further distinguished is achieved. In the embodiment of the present invention, malware is classified into 10 categories, such as Adware (Adware), Backdoor programs (Backdoor programs), Trojan horse programs (Trojan), destructive computer viruses (viruses), worm viruses (work), lasso viruses (ranom), hacker tools (HackTool), Rogue software (Rogue), Rootkit, and Virus tools (Virus Tool).

the machine learning algorithm adopted by the embodiment of the invention is a LightGBM method, namely a lightweight gradient elevator algorithm. The LightGBM algorithm is a lifting method, and can better lift the original traditional gradient enhanced decision tree algorithm, so that the calculation speed is higher, the application range is wider, the precision is higher, and the hardware overhead is smaller. The LightGBM selects a decision tree method based on histogram, and memory consumption and calculation cost are greatly optimized. Compared with the traditional pre-sorted algorithm, the memory consumption of the algorithm based on the histogram is only 1/8, the time complexity of searching for the division point in the decision tree is O (n), but compared with the pre-sorted algorithm in the aspect of data division, all the characteristics share the same index table, so that the operation is only needed on the index table. Meanwhile, the lightGBM can greatly reduce communication cost when the computer group is used for accelerating training, save communication time among parallel computers and greatly accelerate the training process. Embodiments of the present invention do not relate to training with parallel computer clusters.

Step S500: and (4) testing and performance evaluation are carried out on the machine learning classifier obtained by training by utilizing the test sample set so as to judge whether the model obtained by training can meet the actual requirement.

specifically, in the embodiment of the invention, 10 Wanaversion software and 10 Wan normal software are adopted to test the detection rate and the false alarm rate, 10 Wanaversion software samples are classified and tested, and the accuracy of the machine learning classifier is tested.

The specific implementation comprises the following steps: and (3) performing performance measurement on the trained model by using the test sample set, and adopting performance indexes such as accuracy (accuracy), recall rate (call rate), ROC curve/AUC and the like. In addition, a hypothesis test mode is adopted, and the generalization error is estimated by using the test error, so that the generalization performance condition of the model is obtained. That is, it can be concluded from the hypothesis test results how much better the generalization performance of a is statistically superior to the probability of B if model a is observed to be better than B on the test set. Based on the evaluation method, whether the trained model can meet the actual use requirement is judged, if the trained model can meet the actual use requirement, the next step can be carried out, if the trained model can not meet the requirement, the training stage is returned again, the training parameters are adjusted, the iteration number is increased, and different cost functions, regular terms, learning rate and other modes are selected to improve the model performance.

step S600: packaging the tested model to output a machine learning classification model which accords with subsequent system processing;

in an optional embodiment of the invention, the machine learning classification model is packaged into a visual and readable json format, and the json format comprises model generation date, model type, feature name, feature value range, learning rate, sub-decision tree number, basic information of each sub-tree, feature importance ordering and the like;

in another optional embodiment of the invention, the machine learning classification model is packaged into a binary format, which contains the same contents as above, but the binary packaging is adopted, so that the model reading speed can be greatly accelerated, and the reading and analyzing time can be effectively reduced for generating models with huge decision trees.

step S700: receiving externally input data characteristics of a software file to be tested by using a generated machine learning classifier, judging whether the software file is malicious software, if so, judging which type of malicious software the software file belongs to by using a family classification model, and sending out a malicious software warning in real time; the warning mode can be selected by the user, and includes but is not limited to log, e-mail, pop-up window, short message and the like.

in a third embodiment of the present invention, a feature processing method based on a hybrid non-cryptographic hash algorithm is proposed, which is to describe in detail the implementation process of step S300 in the second embodiment. Specifically, as shown in fig. 3, the method includes the following steps:

Step S301: extracting non-numerical characteristic data as shown in the table 2;

Step S302: carrying out de-duplication and de-noising on the non-numerical characteristics;

Specifically, since all numeric and non-numeric features have been previously cleaned, this step focuses on detecting a string sequence for duplicate APIs, DLL strings, and possibly incomplete APIs, DLL strings, where the general API function ends with. exe and DLL strings ends with. DLL.

Step S303, grouping the non-numerical features, and for each group, obtaining a hash value by using the hash method described in steps S3041, S3042, and S3043.

In an exemplary embodiment of the present invention, the non-numerical features extracted from rows 2 to 8 of table 2 with respect to the PE header are grouped into a string sequence; dividing the non-numerical characteristics extracted by the code segment in the lines 9 to 15 in the table 2 into a group to form a character string sequence; the last two 16 th and 17 th rows are grouped into a character string sequence with respect to the non-numerical features extracted from the lead-in address table and the lead-out function table.

step S3041: the input non-numeric feature is hashed using the Murmurhash3 algorithm.

murmurhash is a non-encryption hash algorithm and has the characteristics of high hash speed, low collision rate and the like, the hash value can be selected from 32-bit, 64-bit and 128-bit values, and the hash collision probability can be guaranteed to be almost 0 under the condition of tens of millions of data volume by calculating, for example, adopting a 128-bit hash value. The embodiment of the invention exemplarily adopts a Murmush 3 algorithm with the hash value of 128 bits.

specifically, the Murmurhash3 algorithm obtains a group of 2 consecutive bit blocks by selecting a sliding window, and finally obtains a 128-bit hash value by using large integer multiplication, shift operation, exclusive or operation, first-order linear transformation, cumulative summation, and the like.

step S3042: and hashing the input non-numerical characteristics by adopting Simhash.

The Simhash is a local sensitive hash, can well retain the characteristic information of original data, has strong contrast of hash values, and can well compare the similarity between different hash values by adopting Hamming distance. The Simhash is generally used for de-duplication of massive documents and is used for performing feature processing on extracted character strings in the embodiment of the invention.

The method for performing feature processing by using Simhash provided by the embodiment of the invention comprises the following steps:

(1) And converting the original character string into 2-byte n-dimensional vectors by using a 2-gram method, wherein each dimension in the vectors is 2 bytes.

For example, the string "msvcp 60. dll" is converted to "[ MS, SV, VC, CP, P6,60,0.,. d, dl, ll ]";

(2) designing a weight W for each dimension of the n-dimensional vector_i(i-0, 1.., n-1), if the weights are equal, it is possible to set

(3) for each dimension in the n-dimensional vector, the hashing method can be freely selected, and is not limited to encryption or non-encryption hashing algorithms, and is mainly determined by the number of bits of the hash value to be mapped. In the embodiment of the invention, MD5 is used as a hash method of the step, and a 128-bit hash value is generated;

(4) Weighting W is carried out on the hashed value of each dimension bit by bit_iIf the bit is 1, it is denoted as W_iIf the bit is 0, it is noted as-W_ithen, summing all weighted n-dimensional hash values bit by bit to obtain a 128-dimensional vector with each dimension being floating point type data;

(5) in the 128-dimensional vector, if one-dimensional data is greater than a threshold σ, the dimension is marked as 1, if the one-dimensional data is less than σ, the dimension is marked as 0, if the one-dimensional data is equal to σ, the one-dimensional data is still marked as 0, and the 128-dimensional floating-point type vector can be converted into a 128-dimensional bit string, which is the final simhash result.

the specific calculation method of the threshold σ is as follows:

wherein B is_ijthe bit value of the j bit after the ith dimension hash of the vector in the step 2 is 1 or 0, W_ias defined in step (2).

Step S3043, hash the input non-numeric feature with CRC 32.

CRC32 is a cyclic redundancy check algorithm that is typically used for correctness checks during transmission of data frames, and is used in the present invention to hash strings to a length of 32 bits and use them for signature processing. The method for performing characteristic processing by using CRC32 comprises the following steps:

(1) Selecting the following generating polynomial:

C(x)＝1+x+x²+x⁴+x⁵+x⁷+x⁸+x¹⁰+x¹¹+x¹²+x¹⁶+x²²+x²³+x²⁶+x³²its 16-ary sequence is 0xEDB 88320.

(2) And performing mod2 division operation on the binary form of the original character string sequence by using the generating polynomial as a divisor to obtain a 32-bit remainder, namely CRC32 hash code.

Step S305: and performing mixed splicing and recombination on the results obtained by adopting the three hash algorithms so as to form a new characteristic vector and a new matrix. In this embodiment, each set of hash values is 128+128+32 bits, and the actual storage format is byte type.

as shown in fig. 4, the hybrid splicing and recombination method proposed by the present invention:

for each group of hash values, word segmentation is carried out on the first 128+128 bits by taking bytes as units, each byte type is converted into integer, the later 32 bits are integrally converted into long integer, 33 integer features are formed in each group, and then three groups are spliced in sequence to form 99 feature vectors.

step S306: the numerical features as described in table 1 were extracted, for a total of 641 integer features. Wherein a boolean type is considered as an integer with a value bit 0, 1.

Step S307: the total number of 740 (i.e., 641+99) features extracted are normalized to eliminate the effect of the range difference between different features. The formula is expressed as:

where E (x) is the mean of the feature and σ is the standard deviation.

And normalizing the normalized feature data, and mapping each feature data of each line into a [0,1] interval.

step S308: the processed data form a feature matrix with dimension M multiplied by 740, and the feature matrix is input into a machine learning classifier for training. Wherein M is the number of samples.

optionally, in the embodiment of the present invention, the processed data forms a feature matrix of dimension mx 740, and dimension reduction methods such as pearson correlation coefficient and chi-square test are used to reduce the dimension of the feature matrix, so as to remove feature columns with weak obvious correlation, and then the result is input to a machine learning classifier for training.

Corresponding to the method according to the first embodiment of the present invention, a fourth embodiment of the present invention provides a software detection apparatus, as shown in fig. 5, specifically including:

a feature extraction module 510, configured to extract numerical features and non-numerical features included in each sample in the software sample library;

a feature processing module 520, configured to process the non-numerical feature by using the selected N non-cryptographic hash algorithms, and convert a processing result into a numerical feature; n is an integer greater than 1;

A matrix constructing module 530, configured to construct a feature matrix according to the numerical features included in each sample and the numerical features obtained by conversion;

a learning training module 540, configured to train a machine learning classifier using the feature matrix;

And a detection module 550, configured to detect the target software by using the machine learning classifier.

Optionally, in this embodiment of the present invention, the learning training module 540 is further configured to test the trained machine learning classifier by using a test sample set, so as to adjust a model parameter of the machine learning classifier.

optionally, in an embodiment of the present invention, the learning training module 540 is specifically configured to train the first machine learning classifier by using a feature matrix constructed by software samples labeled with malware and normal software, so as to classify whether the software is malware or normal software; and training a second machine learning classifier by utilizing a feature matrix constructed by marking different types of malicious software samples so as to classify the types of the malicious software.

Optionally, in an embodiment of the present invention, the numerical type feature includes one or more of the following features: code header field information, code segment information, string statistics, sample population statistics, function lists in import address tables, export function lists, byte statistics, and byte information entropy statistics.

Optionally, in an embodiment of the present invention, the non-numerical type feature includes one or more of the following features: recognizable character string sequences, all path character string sequences, all uniform resource locator character string sequences, all registry key character string sequences, machine model character strings of the software header information, all software segment name character string sequences, entry segment name character strings and character string sequences consisting of more than Q continuous recognizable characters in all software segments; wherein Q is a positive integer.

Optionally, in this embodiment of the present invention, the feature processing module 520 is specifically configured to group the non-numerical features according to a set grouping manner; for each group of non-numerical characteristics, performing hash processing by using the N non-encryption hash algorithms respectively to obtain N hash values, and converting the obtained N hash values into integer; and splicing the integer characteristics of each group to obtain the converted numerical characteristics.

optionally, in an embodiment of the present invention, the matrix constructing module 530 is specifically configured to perform a normalization process on each of the numerical features; normalizing the normalized characteristic data; and constructing a characteristic matrix by using the characteristic data after the normalization processing.

optionally, in this embodiment of the present invention, the matrix constructing module 530 is further configured to perform dimension reduction processing on the feature matrix according to a set dimension reduction method after constructing the feature matrix.

Optionally, in this embodiment of the present invention, the N kinds of non-cryptographic hash algorithms include at least two of the following algorithms: the MurMurHash3 algorithm, the SimHash algorithm, and the CRC32 algorithm.

The specific implementation process of each module can be referred to in the first and second embodiments, and details are not described in this embodiment.

in summary, the software detection scheme provided in the embodiment of the present invention adopts a software detection scheme based on a hybrid unencrypted hash feature and a machine learning model, and can convert a complex character string feature extracted from a malware sample into a hash feature that is easy to be processed by a machine learning algorithm, thereby reducing the difficulty of model training, significantly improving the training speed, reducing the space overhead, and improving the malware discrimination accuracy.

in a fifth embodiment of the present invention, a software detection apparatus is provided, as shown in fig. 6, which specifically includes:

The feature extraction module 610 is configured to extract numerical features and non-numerical features included in each sample in the software sample library;

the feature processing module 620 is configured to process the non-numerical feature by using the selected N types of non-cryptographic hash algorithms, and convert a processing result into a numerical feature; n is an integer greater than 1;

A matrix constructing module 630, configured to construct a feature matrix according to the numerical features included in each sample and the numerical features obtained by conversion;

A learning training module 640 for training a machine learning classifier using the feature matrix; optionally, the module may be configured as an offline module, and after offline training is completed, the model is packaged and transmitted to the detection module 670;

The file format judging module 650 is configured to detect whether the input target software is in a software format supported by the apparatus, and if so, trigger the feature extracting module 610, receive the transmitted file of the target software through the feature extracting module 610, extract a numerical feature and/or a non-numerical feature included in the software, and input the extracted numerical feature and/or non-numerical feature to the file pre-scanning module 660;

and the file pre-scanning module 660 is configured to search the matched feature codes according to the existing malware feature code library and the rule library, and screen the malware.

Optionally, the module adopts a traditional feature code matching technology and a yara rule matching technology. If malware is detected by feature code and rule matching, then an alert is sent directly to the result recording and online alert module 680; otherwise, triggering the feature processing module 620, processing the non-numerical features extracted from the target software by the feature processing module 620 by using the selected N non-encrypted hash algorithms, and converting the processing result into numerical features; the matrix construction module 630 is used for constructing a feature matrix according to the numerical features extracted from the target software and the numerical features obtained by conversion, and inputting the feature matrix to the detection module 670;

And the detection module 670 is configured to detect the target software by using the machine learning classifier. Specifically, the module is set as an online module, receives externally input data characteristics of the file to be detected by using the generated detection and classification model, and judges whether the file is malicious software, if so, the family classification model is used for judging which type of malicious software the file belongs to.

Result logging & online alert module 680: the method is used for monitoring the detection result of the malicious software on line in real time and sending out a malicious software warning in real time, and the warning mode can be selected by a user and comprises a mode without limitation of logs, emails, pop-up windows, short messages and the like.

in summary, the software detection scheme provided in the embodiment of the present invention adopts a software detection scheme based on a hybrid unencrypted hash feature and a machine learning model, and can convert a complex character string feature extracted from a malware sample into a hash feature that is easy to be processed by a machine learning algorithm, thereby reducing the difficulty of model training, significantly improving the training speed, reducing the space overhead, and improving the malware discrimination accuracy. Meanwhile, according to the scheme of the embodiment of the invention, software is pre-judged through the file pre-scanning module, and only when the software is not judged, the classification model is input, so that the judgment efficiency is further improved. In addition, this embodiment has still set up alarm module, has further improved user's use experience through this module.

In a sixth embodiment of the present invention, there is provided a computing device, as shown in fig. 7, including: memory 710, processor 720, and communication bus 730; the communication bus 730 is used for realizing connection communication between the processor 720 and the memory 710;

specifically, in the embodiments of the present invention, the Processor 720 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement the embodiments of the present invention. Wherein, the memory 710 is used for storing the executable instructions of the processor 720;

A memory 710 for storing program codes and transferring the program codes to the processor 520. Memory 710 may include Volatile Memory (Volatile Memory), such as Random Access Memory (RAM); the Memory 710 may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, HDD), or a Solid-State Drive (SSD); the memory 710 may also comprise a combination of the above types of memories.

specifically, in the embodiment of the present invention, the processor 720 is configured to execute a software detection program in the application program stored in the memory 710, so as to implement the following method steps:

Step 1, extracting numerical characteristics and non-numerical characteristics contained in each sample in a software sample library;

step 2, processing the non-numerical characteristics by using N selected non-encrypted hash algorithms, and converting the processing result into numerical characteristics; n is an integer greater than 1;

step 3, constructing a feature matrix according to the numerical characteristics contained in each sample and the numerical characteristics obtained by conversion;

Step 4, training a machine learning classifier by using the feature matrix;

And 5, detecting the target software by using the machine learning classifier.

The implementation process of each step in this embodiment can be referred to in the first to third embodiments, and is not described in detail in this embodiment.

in a seventh embodiment of the invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method steps of:

step 4, training a machine learning classifier by using the feature matrix;

and 5, detecting the target software by using the machine learning classifier.

Wherein the computer storage medium may be RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

in addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

in short, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. a software detection method, comprising:

Training a machine learning classifier by using the feature matrix;

And detecting the target software by using the machine learning classifier.

2. The method of claim 1, wherein prior to detecting the target software using the machine learning classifier, further comprising:

3. The method of claim 1, wherein training a machine learning classifier using the feature matrix specifically comprises:

4. the method of claim 1, wherein the numerical type features include one or more of the following features: code header field information, code segment information, string statistics, sample population statistics, function lists in import address tables, export function lists, byte statistics, and byte information entropy statistics.

5. the method of claim 1, wherein the non-numerical type features include one or more of the following: recognizable character string sequences, all path character string sequences, all uniform resource locator character string sequences, all registry key character string sequences, machine model character strings of the software header information, all software segment name character string sequences, entry segment name character strings and character string sequences consisting of more than Q continuous recognizable characters in all software segments; wherein Q is a positive integer.

6. the method according to claim 1, wherein the processing the non-numeric feature using the selected N non-cryptographic hash algorithms and converting the processing result into a numeric feature comprises:

Grouping the non-numerical characteristics according to a set grouping mode;

For each group of non-numerical characteristics, performing hash processing by using the N non-encryption hash algorithms respectively to obtain N hash values, and converting the obtained N hash values into integer;

And splicing the integer characteristics of each group to obtain the converted numerical characteristics.

7. the method according to claim 1, wherein the constructing a feature matrix based on the numerical features included in each sample and the converted numerical features comprises:

carrying out standardization processing on each numerical characteristic;

Normalizing the normalized characteristic data;

8. The method of claim 1, after constructing the feature matrix, further comprising: and performing dimension reduction processing on the feature matrix according to a set dimension reduction method.

9. The method of any one of claims 1 to 8, wherein the N non-cryptographic hash algorithms include at least two of: the MurMurHash3 algorithm, the SimHash algorithm, and the CRC32 algorithm.

10. A software detection apparatus, comprising:

11. The apparatus of claim 10, wherein the learning training module is further configured to test the trained machine learning classifier with a set of test samples to adjust model parameters of the machine learning classifier.

12. the apparatus of claim 10, wherein the learning training module is specifically configured to train the first machine learning classifier using a feature matrix constructed by software samples labeled with malware and normal software to classify whether the software is malware or normal software; and training a second machine learning classifier by utilizing a feature matrix constructed by marking different types of malicious software samples so as to classify the types of the malicious software.

13. the apparatus of claim 10, wherein the numerical type features include one or more of the following: code header field information, code segment information, string statistics, sample population statistics, function lists in import address tables, export function lists, byte statistics, and byte information entropy statistics.

14. The apparatus of claim 10, wherein the non-numeric feature comprises one or more of the following features: recognizable character string sequences, all path character string sequences, all uniform resource locator character string sequences, all registry key character string sequences, machine model character strings of the software header information, all software segment name character string sequences, entry segment name character strings and character string sequences consisting of more than Q continuous recognizable characters in all software segments; wherein Q is a positive integer.

15. The apparatus according to claim 10, wherein the feature processing module is specifically configured to group the non-numeric features in a set grouping manner; for each group of non-numerical characteristics, performing hash processing by using the N non-encryption hash algorithms respectively to obtain N hash values, and converting the obtained N hash values into integer; and splicing the integer characteristics of each group to obtain the converted numerical characteristics.

16. The apparatus of claim 10, wherein the matrix construction module is specifically configured to normalize each of the numerical features; normalizing the normalized characteristic data; and constructing a characteristic matrix by using the characteristic data after the normalization processing.

17. The apparatus of claim 10, wherein the matrix construction module is further configured to perform dimension reduction on the feature matrix according to a set dimension reduction method after constructing the feature matrix.

18. The apparatus of any one of claims 10 to 17, wherein the N non-cryptographic hash algorithms comprise at least two of: the MurMurHash3 algorithm, the SimHash algorithm, and the CRC32 algorithm.

19. A computing device, wherein the computing device comprises: a memory, a processor, and a communication bus; the communication bus is used for realizing connection communication between the processor and the memory;

The processor is configured to execute a software detection program stored in the memory to implement the steps of the software detection method according to any one of claims 1 to 9.

20. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which program, when being executed by a processor, carries out the steps of the software detection method according to any one of claims 1 to 9.