CN113378165A - Malicious sample similarity judgment method based on Jaccard coefficient - Google Patents
Malicious sample similarity judgment method based on Jaccard coefficient Download PDFInfo
- Publication number
- CN113378165A CN113378165A CN202110711130.6A CN202110711130A CN113378165A CN 113378165 A CN113378165 A CN 113378165A CN 202110711130 A CN202110711130 A CN 202110711130A CN 113378165 A CN113378165 A CN 113378165A
- Authority
- CN
- China
- Prior art keywords
- character string
- sample
- malicious
- vectors
- malicious sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/033—Test or assess software
Abstract
The invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which specifically comprises the following steps: respectively analyzing the malicious sample I and the malicious sample II by using a String command, extracting malicious sample character strings, and respectively converting the extracted malicious sample character strings into sample character String sets A and B; calculating the Jaccard coefficient between the sample character string sets A and B; setting a threshold value, and judging that the malicious sample I and the malicious sample II have stronger similarity if the calculated Jaccard coefficient value is greater than the threshold value; and for the malicious sample I and the malicious sample II with stronger similarity, finding the character string where the malicious sample is located by utilizing the spatial spectrum function. The invention provides a novel malicious sample similarity judgment method, which does not need complicated operations such as malicious sample feature extraction and the like, and can improve the efficiency of malicious sample similarity judgment.
Description
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a malicious sample similarity judgment method based on a Jaccard coefficient.
Background
Generally, different computer network malicious samples generally have different functional characteristics, and respective internal structures of the samples are determined by the functional characteristics, so that the similarity between the malicious samples can be judged by extracting the characteristics of the malicious samples. At present, aiming at the technical scheme of judging whether malicious samples have similarity, a machine learning algorithm model is mainly constructed, and relevant detection judgment is completed by extracting the characteristics of the malicious samples. In the technical scheme of detecting the model by using the machine learning algorithm, each malicious sample needs to be subjected to feature extraction, after the feature extraction is preprocessed and converted into corresponding feature vector values, the feature vector values are input into the machine learning algorithm model, and conclusions such as whether the malicious samples have similarity or not are comprehensively obtained according to indexes such as output accuracy and precision. For the technical scheme of detecting by using the machine learning algorithm, not only the data needs to be preprocessed, but also the parameter adjustment processing needs to be continuously carried out and the detection model needs to be optimized as much as possible, the implementation process is complex, and a stable and reliable result cannot be quickly obtained.
In addition, in order to prevent the malicious code made by the lawbreaker from being detected, some common character strings in the malicious code sample are disorderly sequenced, so as to prevent the malicious code from being detected, for example, Symbol is modified into lbsymo. In the malicious sample analysis process, some meaningless continuous character strings capable of displaying messy codes are often encountered, and further analysis finds that the messy codes are also a variant of the malicious code sample. How to detect and locate the malicious sample of the out-of-order character string is also a problem which needs to be solved urgently at present.
Disclosure of Invention
Aiming at the problems that the implementation process of the existing computer network malicious sample detection method based on the machine learning algorithm is complex and a stable and reliable result cannot be obtained quickly, and simultaneously aiming at realizing the malicious sample positioning of disordered character strings, the invention discloses a malicious sample similarity judgment method based on Jaccard coefficients. If the Jaccard coefficient is larger, the similarity between two malicious samples is represented. On the basis, for the detection and definition of the malicious samples of the disordered character strings, the invention constructs the space spectrum of the two malicious sample character string sequences by using the statistical characteristics of sample codes, and positions the malicious samples of the disordered character strings by a space spectrum estimation method.
The Jaccard coefficient is used to compare similarity and difference between limited sample sets. Wherein the larger the Jaccard coefficient value, the higher the corresponding sample similarity. In a given two sets A and B, the Jaccard coefficient is the ratio of the intersection size of A and B to the union size of A and B, and the calculation formula of the Jaccard coefficient is as follows:
wherein J (A, B) is epsilon [0,1], and when the sets A and B are both empty, J (A, B) is defined as 1.
The invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which does not need to directly extract relevant features and other attribute characteristics of the malicious samples, only needs to analyze the content of the malicious samples into sample character String sets through String commands, respectively completes the calculation of the Jaccard coefficients among the sample character String sets by using the Jaccard coefficient principle, and completes the final similarity judgment and reversely deduces the similarity among the malicious samples after averaging the calculation results.
The invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which specifically comprises the following steps:
s1, analyzing the malicious sample I and the malicious sample II respectively by using a String command, extracting malicious sample character strings, and converting the extracted malicious sample character strings into sample character String sets A and B respectively;
s2, calculating the Jaccard coefficient between the sample character string sets A and B; the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, and the calculation formula of the Jaccard coefficient between the character string set A and the character string set B is as follows:
wherein, |, represents the number of elements in the calculation set, J (A, B) belongs to [0,1], and when the sets A and B are both empty, the value of J (A, B) is defined to be 1.
S3, setting a judgment threshold value mu according to the Jaccard coefficient value obtained by the calculation in the step S2, and judging that the malicious sample I and the malicious sample II have stronger similarity if the Jaccard coefficient value obtained by the calculation is larger than mu; and if the calculated Jaccard coefficient value is smaller than mu, judging that the two malicious samples have no strong similarity.
S4, for the malicious sample i and the malicious sample ii with strong similarity, respectively converting each character string in the corresponding character string set a and character string set B into a number, for example, using an atof function to convert the character string into a double-precision floating-point number, obtaining two character string numerical vectors a and B, and equally dividing the two character string numerical vectors a and B into N segments of character string sub-vectors, thus obtaining:
a=[a1,a2,…,aN],b=[b1,b2,…,bN],
and (3) taking the character string sub-vectors as basic elements, and calculating a cross-correlation matrix R of the numerical vectors a and b of the two character strings to obtain:
R=aTb,
wherein, the element R of the ith row and the jth column of the cross-correlation matrix Rij=aibj T,aiI-th string subvector representing the string value vector a, i being 1,2, …, N, bjThe jth string subvector representing the string-valued vector b, j being 1,2, …, N. Performing feature decomposition on the cross-correlation matrix R to obtain N feature vectors and feature values, and screening the feature vectors according to the feature values, for example, screening the feature vectors corresponding to the feature values close to 1, or setting a threshold μ0Screening out the particles with a particle size of more than mu0The feature vector corresponding to the feature value of (2), threshold value mu0If the sum is more than 0.5, a feature matrix E formed by the screened M feature vectors is recorded as:
E=[v1,v2,…,vM],
wherein v iskThe k-th feature vector is represented, k is 1,2, …, M, and the feature vectors are column vectors; and constructing a spatial spectrum function P () by using the character string subvectors, wherein the calculation formula of the spatial spectrum function P (i, j) corresponding to the ith character string subvector of the character string numerical value vector a and the jth character string subvector of the character string numerical value vector b is as follows:
P(i,j)=aiEEHbj T,
and for pairwise combination of all the character string sub-vectors in the character string set A and the character string set B, calculating corresponding space spectrum functions, finding out the serial number combination of the corresponding character string sub-vectors when the space spectrum functions have the maximum value, determining the corresponding character string sub-vectors of the serial number combinations in the character string set A and the character string set B, and corresponding the character string sub-vectors to the corresponding character strings in the character string set A and the character string set B, namely the character strings where the malicious samples are located.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a novel method for judging the similarity of malicious samples, which does not need complicated operations such as extraction of the features of the malicious samples and the like, and can improve the efficiency of judging the similarity of the malicious samples.
Drawings
Fig. 1 is a flow chart of malicious sample string set construction.
Detailed Description
For a better understanding of the present disclosure, an example is given here. Fig. 1 is a flow chart of malicious sample string set construction.
The example is described by selecting two malicious samples to perform similarity judgment, and the specific implementation process is as follows:
the invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which specifically comprises the following steps:
s1, analyzing the malicious sample I and the malicious sample II respectively by using a String command, extracting malicious sample character strings, and converting the extracted malicious sample character strings into sample character String sets A and B respectively;
step S1 specifically includes:
s11, analyzing the malicious sample I by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set A, and adding the analyzed malicious sample character strings into the sample character String set A one by one;
s12, analyzing the malicious sample II by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set B, and adding the analyzed malicious sample character strings into the sample character String set B one by one;
s2, calculating the Jaccard coefficient between the sample character string sets A and B; the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, and the calculation formula of the Jaccard coefficient between the character string set A and the character string set B is as follows:
wherein, |, represents the number of elements in the calculation set, J (A, B) belongs to [0,1], and when the sets A and B are both empty, the value of J (A, B) is defined to be 1.
S3, setting a judgment threshold value mu according to the Jaccard coefficient value obtained by the calculation in the step S2, and judging that the malicious sample I and the malicious sample II have stronger similarity if the Jaccard coefficient value obtained by the calculation is larger than mu; and if the calculated Jaccard coefficient value is smaller than mu, judging that the two malicious samples have no strong similarity.
S4, for the malicious sample i and the malicious sample ii with strong similarity, respectively converting each character string in the corresponding character string set a and character string set B into a number, for example, using an atof function to convert the character string into a double-precision floating-point number, obtaining two character string numerical vectors a and B, and equally dividing the two character string numerical vectors a and B into N segments of character string sub-vectors, thus obtaining:
a=[a1,a2,…,aN],b=[b1,b2,…,bN],
and (3) taking the character string sub-vectors as basic elements, and calculating a cross-correlation matrix R of the numerical vectors a and b of the two character strings to obtain:
R=aTb,
wherein, the element R of the ith row and the jth column of the cross-correlation matrix Rij=aibj T,aiI-th character string representing character string numerical vector aSubvector, i ═ 1,2, …, N, bjThe jth string subvector representing the string-valued vector b, j being 1,2, …, N. Performing feature decomposition on the cross-correlation matrix R to obtain N feature vectors and feature values, and screening the feature vectors according to the feature values, for example, screening the feature vectors corresponding to the feature values close to 1, or setting a threshold μ0Screening out the particles with a particle size of more than mu0The feature vector corresponding to the feature value of (2), threshold value mu0If the feature vector is larger than 0.5, a feature matrix E formed by the screened feature vectors is recorded as:
E=[v1,v2,…,vM],
wherein v iskThe k-th feature vector is represented, k is 1,2, …, M, and the feature vectors are column vectors; and constructing a spatial spectrum function P () by using the character string subvectors, wherein the calculation formula of the spatial spectrum function P (i, j) corresponding to the ith character string subvector of the character string numerical value vector a and the jth character string subvector of the character string numerical value vector b is as follows:
P(i,j)=aiEEHbj T,
and for pairwise combination of all the character string sub-vectors in the character string set A and the character string set B, calculating corresponding space spectrum functions, finding out the serial number combination of the corresponding character string sub-vectors when the space spectrum functions have the maximum value, determining the corresponding character string sub-vectors of the serial number combinations in the character string set A and the character string set B, and corresponding the character string sub-vectors to the corresponding character strings in the character string set A and the character string set B, namely the character strings where the malicious samples are located.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (3)
1. A malicious sample similarity judgment method based on Jaccard coefficients is characterized by specifically comprising the following steps:
s1, analyzing the malicious sample I and the malicious sample II respectively by using a String command, extracting malicious sample character strings, and converting the extracted malicious sample character strings into sample character String sets A and B respectively;
s2, calculating the Jaccard coefficient between the sample character string sets A and B; the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, and the calculation formula of the Jaccard coefficient between the character string set A and the character string set B is as follows:
wherein, | · | represents the number of elements in the calculation set, J (A, B) belongs to [0,1], and when the sets A and B are both empty, the value of J (A, B) is defined to be 1;
s3, setting a judgment threshold value mu according to the Jaccard coefficient value obtained by the calculation in the step S2, and judging that the malicious sample I and the malicious sample II have stronger similarity if the Jaccard coefficient value obtained by the calculation is larger than mu; if the calculated Jaccard coefficient value is smaller than mu, judging that the two malicious samples have no stronger similarity;
s4, for a malicious sample I and a malicious sample II with strong similarity, converting each character string in a character string set A and a character string set B corresponding to the malicious sample I and the malicious sample II into a number respectively to obtain two character string numerical vectors a and B, and equally dividing the two character string numerical vectors a and B into N sections of character string sub-vectors respectively to obtain:
a=[a1,a2,…,aN],b=[b1,b2,…,bN],
and (3) taking the character string sub-vectors as basic elements, and calculating a cross-correlation matrix R of the numerical vectors a and b of the two character strings to obtain:
R=aTb,
wherein, the element R of the ith row and the jth column of the cross-correlation matrix Rij=aibj T,aiI-th character string representing character string numerical value vector aVector, i-1, 2, …, N, bjA jth string subvector representing a string-valued vector b, j being 1,2, …, N; performing characteristic decomposition on the cross-correlation matrix R to obtain N characteristic vectors and characteristic values, screening the characteristic vectors according to the size of the characteristic values, and recording a characteristic matrix E formed by the screened M characteristic vectors as:
E=[v1,v2,…,vM],
wherein v iskThe k-th feature vector is represented, k is 1,2, …, M, and the feature vectors are column vectors; and constructing a spatial spectrum function P () by using the character string subvectors, wherein the calculation formula of the spatial spectrum function P (i, j) corresponding to the ith character string subvector of the character string numerical value vector a and the jth character string subvector of the character string numerical value vector b is as follows:
P(i,j)=aiEEHbj T,
and for pairwise combination of all the character string sub-vectors in the character string set A and the character string set B, calculating corresponding space spectrum functions, finding out the serial number combination of the corresponding character string sub-vectors when the space spectrum functions have the maximum value, determining the corresponding character string sub-vectors of the serial number combinations in the character string set A and the character string set B, and corresponding the character string sub-vectors to the corresponding character strings in the character string set A and the character string set B, namely the character strings where the malicious samples are located.
2. The method for determining similarity of malicious samples according to claim 1, wherein the step S1 specifically includes:
s11, analyzing the malicious sample I by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set A, and adding the analyzed malicious sample character strings into the sample character String set A one by one;
and S12, analyzing the malicious sample II by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set B, and adding the analyzed malicious sample character String into the sample character String set B one by one.
3. The method for determining similarity of malicious samples according to claim 1, wherein the feature vectors are filtered according to the magnitude of the feature values, specifically, the feature vectors corresponding to the feature values close to 1 are filtered, or a threshold μ is set0Screening out the particles with a particle size of more than mu0The feature vector corresponding to the feature value of (2), threshold value mu0Greater than 0.5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110711130.6A CN113378165B (en) | 2021-06-25 | 2021-06-25 | Malicious sample similarity judgment method based on Jaccard coefficient |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110711130.6A CN113378165B (en) | 2021-06-25 | 2021-06-25 | Malicious sample similarity judgment method based on Jaccard coefficient |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113378165A true CN113378165A (en) | 2021-09-10 |
CN113378165B CN113378165B (en) | 2021-11-05 |
Family
ID=77579101
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110711130.6A Active CN113378165B (en) | 2021-06-25 | 2021-06-25 | Malicious sample similarity judgment method based on Jaccard coefficient |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113378165B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105245495A (en) * | 2015-08-27 | 2016-01-13 | 哈尔滨工程大学 | Similarity match based rapid detection method for malicious shellcode |
US20160269424A1 (en) * | 2015-03-13 | 2016-09-15 | Microsoft Technology Licensing, Llc | Large Scale Malicious Process Detection |
CN106709345A (en) * | 2015-11-17 | 2017-05-24 | 武汉安天信息技术有限责任公司 | Deep learning method-based method and system for deducing malicious code rules and equipment |
CN107679403A (en) * | 2017-10-11 | 2018-02-09 | 北京理工大学 | It is a kind of to extort software mutation detection method based on sequence alignment algorithms |
CN107908963A (en) * | 2018-01-08 | 2018-04-13 | 北京工业大学 | A kind of automatic detection malicious code core feature method |
US10437996B1 (en) * | 2017-07-24 | 2019-10-08 | EMC IP Holding Company LLC | Classifying software modules utilizing similarity-based queries |
CN110610084A (en) * | 2018-06-15 | 2019-12-24 | 武汉安天信息技术有限责任公司 | Dex file-based sample maliciousness determination method and related device |
CN112948828A (en) * | 2021-01-25 | 2021-06-11 | 厦门服云信息科技有限公司 | Binary program malicious code detection method, terminal device and storage medium |
-
2021
- 2021-06-25 CN CN202110711130.6A patent/CN113378165B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160269424A1 (en) * | 2015-03-13 | 2016-09-15 | Microsoft Technology Licensing, Llc | Large Scale Malicious Process Detection |
CN105245495A (en) * | 2015-08-27 | 2016-01-13 | 哈尔滨工程大学 | Similarity match based rapid detection method for malicious shellcode |
CN106709345A (en) * | 2015-11-17 | 2017-05-24 | 武汉安天信息技术有限责任公司 | Deep learning method-based method and system for deducing malicious code rules and equipment |
US10437996B1 (en) * | 2017-07-24 | 2019-10-08 | EMC IP Holding Company LLC | Classifying software modules utilizing similarity-based queries |
CN107679403A (en) * | 2017-10-11 | 2018-02-09 | 北京理工大学 | It is a kind of to extort software mutation detection method based on sequence alignment algorithms |
CN107908963A (en) * | 2018-01-08 | 2018-04-13 | 北京工业大学 | A kind of automatic detection malicious code core feature method |
CN110610084A (en) * | 2018-06-15 | 2019-12-24 | 武汉安天信息技术有限责任公司 | Dex file-based sample maliciousness determination method and related device |
CN112948828A (en) * | 2021-01-25 | 2021-06-11 | 厦门服云信息科技有限公司 | Binary program malicious code detection method, terminal device and storage medium |
Non-Patent Citations (3)
Title |
---|
李莹珠: "基于云计算的恶意代码检测技术研究", 《中国优秀博硕士学位论文全文数据库(硕士)》 * |
郭宏宇,冷冰,邓永晖: "基于改进型循环神经网络的恶意代码分类检测", 《信息技术》 * |
霍跃: "基于函数相似度的恶意软件研究", 《中国优秀博硕学位论文全文数据库(硕士) 信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113378165B (en) | 2021-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110287983B (en) | Single-classifier anomaly detection method based on maximum correlation entropy deep neural network | |
Quack et al. | Efficient mining of frequent and distinctive feature configurations | |
CN111027069B (en) | Malicious software family detection method, storage medium and computing device | |
KR20190072652A (en) | Information processing apparatus and information processing method | |
CN116523320B (en) | Intellectual Property Risk Intelligent Analysis Method Based on Internet Big Data | |
CN113516228B (en) | Network anomaly detection method based on deep neural network | |
CN116204831A (en) | Road-to-ground analysis method based on neural network | |
Juvonen et al. | Adaptive framework for network traffic classification using dimensionality reduction and clustering | |
CN113378165B (en) | Malicious sample similarity judgment method based on Jaccard coefficient | |
CN113283901A (en) | Byte code-based fraud contract detection method for block chain platform | |
CN111105041B (en) | Machine learning method and device for intelligent data collision | |
CN111797395A (en) | Malicious code visualization and variety detection method, device, equipment and storage medium | |
CN111797997A (en) | Network intrusion detection method, model construction method, device and electronic equipment | |
CN107943916B (en) | Webpage anomaly detection method based on online classification | |
CN111426657A (en) | Method for identifying and comparing three-dimensional fluorescence spectrogram of soluble organic matter | |
CN113159181B (en) | Industrial control system anomaly detection method and system based on improved deep forest | |
CN114141316A (en) | Method and system for predicting biological toxicity of organic matters based on spectrogram analysis | |
CN110135155B (en) | Fuzzy K neighbor-based Windows malicious software identification method | |
CN112163217A (en) | Malicious software variant identification method, device, equipment and computer storage medium | |
KR101907443B1 (en) | Component-based malicious file similarity analysis device and method | |
Cannarile et al. | A Study on Malware Detection and Classification Using the Analysis of API Calls Sequences Through Shallow Learning and Recurrent Neural Networks. | |
CN113688229B (en) | Text recommendation method, system, storage medium and equipment | |
Greau-Hamard et al. | Performance analysis and comparison of sequence identification algorithms in iot context | |
CN111797398B (en) | Malicious code visualization and variant detection method, system, equipment and storage medium | |
CN115242431A (en) | Industrial Internet of things data anomaly detection method based on random forest and long-short term memory network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |