CN113378165A

CN113378165A - Malicious sample similarity judgment method based on Jaccard coefficient

Info

Publication number: CN113378165A
Application number: CN202110711130.6A
Authority: CN
Inventors: 任传伦; 刘文瀚; 吕帅; 夏建民; 张先国; 刘晓影; 王淮; 俞赛赛; 乌吉斯古愣; 孟祥頔
Original assignee: Cetc Cyberspace Security Research Institute Co Ltd; CETC 15 Research Institute; CETC 30 Research Institute
Current assignee: Cetc Cyberspace Security Research Institute Co Ltd; CETC 15 Research Institute; CETC 30 Research Institute
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-09-10
Anticipated expiration: 2041-06-25
Also published as: CN113378165B

Abstract

The invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which specifically comprises the following steps: respectively analyzing the malicious sample I and the malicious sample II by using a String command, extracting malicious sample character strings, and respectively converting the extracted malicious sample character strings into sample character String sets A and B; calculating the Jaccard coefficient between the sample character string sets A and B; setting a threshold value, and judging that the malicious sample I and the malicious sample II have stronger similarity if the calculated Jaccard coefficient value is greater than the threshold value; and for the malicious sample I and the malicious sample II with stronger similarity, finding the character string where the malicious sample is located by utilizing the spatial spectrum function. The invention provides a novel malicious sample similarity judgment method, which does not need complicated operations such as malicious sample feature extraction and the like, and can improve the efficiency of malicious sample similarity judgment.

Description

Malicious sample similarity judgment method based on Jaccard coefficient

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a malicious sample similarity judgment method based on a Jaccard coefficient.

Background

Generally, different computer network malicious samples generally have different functional characteristics, and respective internal structures of the samples are determined by the functional characteristics, so that the similarity between the malicious samples can be judged by extracting the characteristics of the malicious samples. At present, aiming at the technical scheme of judging whether malicious samples have similarity, a machine learning algorithm model is mainly constructed, and relevant detection judgment is completed by extracting the characteristics of the malicious samples. In the technical scheme of detecting the model by using the machine learning algorithm, each malicious sample needs to be subjected to feature extraction, after the feature extraction is preprocessed and converted into corresponding feature vector values, the feature vector values are input into the machine learning algorithm model, and conclusions such as whether the malicious samples have similarity or not are comprehensively obtained according to indexes such as output accuracy and precision. For the technical scheme of detecting by using the machine learning algorithm, not only the data needs to be preprocessed, but also the parameter adjustment processing needs to be continuously carried out and the detection model needs to be optimized as much as possible, the implementation process is complex, and a stable and reliable result cannot be quickly obtained.

In addition, in order to prevent the malicious code made by the lawbreaker from being detected, some common character strings in the malicious code sample are disorderly sequenced, so as to prevent the malicious code from being detected, for example, Symbol is modified into lbsymo. In the malicious sample analysis process, some meaningless continuous character strings capable of displaying messy codes are often encountered, and further analysis finds that the messy codes are also a variant of the malicious code sample. How to detect and locate the malicious sample of the out-of-order character string is also a problem which needs to be solved urgently at present.

Disclosure of Invention

Aiming at the problems that the implementation process of the existing computer network malicious sample detection method based on the machine learning algorithm is complex and a stable and reliable result cannot be obtained quickly, and simultaneously aiming at realizing the malicious sample positioning of disordered character strings, the invention discloses a malicious sample similarity judgment method based on Jaccard coefficients. If the Jaccard coefficient is larger, the similarity between two malicious samples is represented. On the basis, for the detection and definition of the malicious samples of the disordered character strings, the invention constructs the space spectrum of the two malicious sample character string sequences by using the statistical characteristics of sample codes, and positions the malicious samples of the disordered character strings by a space spectrum estimation method.

The Jaccard coefficient is used to compare similarity and difference between limited sample sets. Wherein the larger the Jaccard coefficient value, the higher the corresponding sample similarity. In a given two sets A and B, the Jaccard coefficient is the ratio of the intersection size of A and B to the union size of A and B, and the calculation formula of the Jaccard coefficient is as follows:

wherein J (A, B) is epsilon [0,1], and when the sets A and B are both empty, J (A, B) is defined as 1.

The invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which does not need to directly extract relevant features and other attribute characteristics of the malicious samples, only needs to analyze the content of the malicious samples into sample character String sets through String commands, respectively completes the calculation of the Jaccard coefficients among the sample character String sets by using the Jaccard coefficient principle, and completes the final similarity judgment and reversely deduces the similarity among the malicious samples after averaging the calculation results.

The invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which specifically comprises the following steps:

s1, analyzing the malicious sample I and the malicious sample II respectively by using a String command, extracting malicious sample character strings, and converting the extracted malicious sample character strings into sample character String sets A and B respectively;

s2, calculating the Jaccard coefficient between the sample character string sets A and B; the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, and the calculation formula of the Jaccard coefficient between the character string set A and the character string set B is as follows:

wherein, |, represents the number of elements in the calculation set, J (A, B) belongs to [0,1], and when the sets A and B are both empty, the value of J (A, B) is defined to be 1.

S3, setting a judgment threshold value mu according to the Jaccard coefficient value obtained by the calculation in the step S2, and judging that the malicious sample I and the malicious sample II have stronger similarity if the Jaccard coefficient value obtained by the calculation is larger than mu; and if the calculated Jaccard coefficient value is smaller than mu, judging that the two malicious samples have no strong similarity.

S4, for the malicious sample i and the malicious sample ii with strong similarity, respectively converting each character string in the corresponding character string set a and character string set B into a number, for example, using an atof function to convert the character string into a double-precision floating-point number, obtaining two character string numerical vectors a and B, and equally dividing the two character string numerical vectors a and B into N segments of character string sub-vectors, thus obtaining:

a＝[a₁,a₂,…,a_N]，b＝[b₁,b₂,…,b_N]，

and (3) taking the character string sub-vectors as basic elements, and calculating a cross-correlation matrix R of the numerical vectors a and b of the two character strings to obtain:

R＝a^Tb，

wherein, the element R of the ith row and the jth column of the cross-correlation matrix R_ij＝a_ib_j ^T，a_iI-th string subvector representing the string value vector a, i being 1,2, …, N, b_jThe jth string subvector representing the string-valued vector b, j being 1,2, …, N. Performing feature decomposition on the cross-correlation matrix R to obtain N feature vectors and feature values, and screening the feature vectors according to the feature values, for example, screening the feature vectors corresponding to the feature values close to 1, or setting a threshold μ₀Screening out the particles with a particle size of more than mu₀The feature vector corresponding to the feature value of (2), threshold value mu₀If the sum is more than 0.5, a feature matrix E formed by the screened M feature vectors is recorded as:

E＝[v₁,v₂,…,v_M]，

wherein v is_kThe k-th feature vector is represented, k is 1,2, …, M, and the feature vectors are column vectors; and constructing a spatial spectrum function P () by using the character string subvectors, wherein the calculation formula of the spatial spectrum function P (i, j) corresponding to the ith character string subvector of the character string numerical value vector a and the jth character string subvector of the character string numerical value vector b is as follows:

P(i,j)＝a_iEE^Hb_j ^T，

and for pairwise combination of all the character string sub-vectors in the character string set A and the character string set B, calculating corresponding space spectrum functions, finding out the serial number combination of the corresponding character string sub-vectors when the space spectrum functions have the maximum value, determining the corresponding character string sub-vectors of the serial number combinations in the character string set A and the character string set B, and corresponding the character string sub-vectors to the corresponding character strings in the character string set A and the character string set B, namely the character strings where the malicious samples are located.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a novel method for judging the similarity of malicious samples, which does not need complicated operations such as extraction of the features of the malicious samples and the like, and can improve the efficiency of judging the similarity of the malicious samples.

Drawings

Fig. 1 is a flow chart of malicious sample string set construction.

Detailed Description

For a better understanding of the present disclosure, an example is given here. Fig. 1 is a flow chart of malicious sample string set construction.

The example is described by selecting two malicious samples to perform similarity judgment, and the specific implementation process is as follows:

step S1 specifically includes:

s11, analyzing the malicious sample I by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set A, and adding the analyzed malicious sample character strings into the sample character String set A one by one;

s12, analyzing the malicious sample II by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set B, and adding the analyzed malicious sample character strings into the sample character String set B one by one;

a＝[a₁,a₂,…,a_N]，b＝[b₁,b₂,…,b_N]，

R＝a^Tb，

wherein, the element R of the ith row and the jth column of the cross-correlation matrix R_ij＝a_ib_j ^T，a_iI-th character string representing character string numerical vector aSubvector, i ═ 1,2, …, N, b_jThe jth string subvector representing the string-valued vector b, j being 1,2, …, N. Performing feature decomposition on the cross-correlation matrix R to obtain N feature vectors and feature values, and screening the feature vectors according to the feature values, for example, screening the feature vectors corresponding to the feature values close to 1, or setting a threshold μ₀Screening out the particles with a particle size of more than mu₀The feature vector corresponding to the feature value of (2), threshold value mu₀If the feature vector is larger than 0.5, a feature matrix E formed by the screened feature vectors is recorded as:

E＝[v₁,v₂,…,v_M]，

P(i,j)＝a_iEE^Hb_j ^T，

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A malicious sample similarity judgment method based on Jaccard coefficients is characterized by specifically comprising the following steps:

wherein, | · | represents the number of elements in the calculation set, J (A, B) belongs to [0,1], and when the sets A and B are both empty, the value of J (A, B) is defined to be 1;

s3, setting a judgment threshold value mu according to the Jaccard coefficient value obtained by the calculation in the step S2, and judging that the malicious sample I and the malicious sample II have stronger similarity if the Jaccard coefficient value obtained by the calculation is larger than mu; if the calculated Jaccard coefficient value is smaller than mu, judging that the two malicious samples have no stronger similarity;

s4, for a malicious sample I and a malicious sample II with strong similarity, converting each character string in a character string set A and a character string set B corresponding to the malicious sample I and the malicious sample II into a number respectively to obtain two character string numerical vectors a and B, and equally dividing the two character string numerical vectors a and B into N sections of character string sub-vectors respectively to obtain:

a＝[a₁,a₂,…,a_N]，b＝[b₁,b₂,…,b_N]，

R＝a^Tb，

wherein, the element R of the ith row and the jth column of the cross-correlation matrix R_ij＝a_ib_j ^T，a_iI-th character string representing character string numerical value vector aVector, i-1, 2, …, N, b_jA jth string subvector representing a string-valued vector b, j being 1,2, …, N; performing characteristic decomposition on the cross-correlation matrix R to obtain N characteristic vectors and characteristic values, screening the characteristic vectors according to the size of the characteristic values, and recording a characteristic matrix E formed by the screened M characteristic vectors as:

E＝[v₁,v₂,…,v_M]，

P(i,j)＝a_iEE^Hb_j ^T，

2. The method for determining similarity of malicious samples according to claim 1, wherein the step S1 specifically includes:

and S12, analyzing the malicious sample II by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set B, and adding the analyzed malicious sample character String into the sample character String set B one by one.

3. The method for determining similarity of malicious samples according to claim 1, wherein the feature vectors are filtered according to the magnitude of the feature values, specifically, the feature vectors corresponding to the feature values close to 1 are filtered, or a threshold μ is set₀Screening out the particles with a particle size of more than mu₀The feature vector corresponding to the feature value of (2), threshold value mu₀Greater than 0.5.