CN113378165A - Malicious sample similarity judgment method based on Jaccard coefficient - Google Patents

Malicious sample similarity judgment method based on Jaccard coefficient Download PDF

Info

Publication number
CN113378165A
CN113378165A CN202110711130.6A CN202110711130A CN113378165A CN 113378165 A CN113378165 A CN 113378165A CN 202110711130 A CN202110711130 A CN 202110711130A CN 113378165 A CN113378165 A CN 113378165A
Authority
CN
China
Prior art keywords
character string
sample
malicious
vectors
malicious sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110711130.6A
Other languages
Chinese (zh)
Other versions
CN113378165B (en
Inventor
任传伦
刘文瀚
吕帅
夏建民
张先国
刘晓影
王淮
俞赛赛
乌吉斯古愣
孟祥頔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cetc Cyberspace Security Research Institute Co Ltd
CETC 15 Research Institute
CETC 30 Research Institute
Original Assignee
Cetc Cyberspace Security Research Institute Co Ltd
CETC 15 Research Institute
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cetc Cyberspace Security Research Institute Co Ltd, CETC 15 Research Institute, CETC 30 Research Institute filed Critical Cetc Cyberspace Security Research Institute Co Ltd
Priority to CN202110711130.6A priority Critical patent/CN113378165B/en
Publication of CN113378165A publication Critical patent/CN113378165A/en
Application granted granted Critical
Publication of CN113378165B publication Critical patent/CN113378165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Abstract

The invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which specifically comprises the following steps: respectively analyzing the malicious sample I and the malicious sample II by using a String command, extracting malicious sample character strings, and respectively converting the extracted malicious sample character strings into sample character String sets A and B; calculating the Jaccard coefficient between the sample character string sets A and B; setting a threshold value, and judging that the malicious sample I and the malicious sample II have stronger similarity if the calculated Jaccard coefficient value is greater than the threshold value; and for the malicious sample I and the malicious sample II with stronger similarity, finding the character string where the malicious sample is located by utilizing the spatial spectrum function. The invention provides a novel malicious sample similarity judgment method, which does not need complicated operations such as malicious sample feature extraction and the like, and can improve the efficiency of malicious sample similarity judgment.

Description

Malicious sample similarity judgment method based on Jaccard coefficient
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a malicious sample similarity judgment method based on a Jaccard coefficient.
Background
Generally, different computer network malicious samples generally have different functional characteristics, and respective internal structures of the samples are determined by the functional characteristics, so that the similarity between the malicious samples can be judged by extracting the characteristics of the malicious samples. At present, aiming at the technical scheme of judging whether malicious samples have similarity, a machine learning algorithm model is mainly constructed, and relevant detection judgment is completed by extracting the characteristics of the malicious samples. In the technical scheme of detecting the model by using the machine learning algorithm, each malicious sample needs to be subjected to feature extraction, after the feature extraction is preprocessed and converted into corresponding feature vector values, the feature vector values are input into the machine learning algorithm model, and conclusions such as whether the malicious samples have similarity or not are comprehensively obtained according to indexes such as output accuracy and precision. For the technical scheme of detecting by using the machine learning algorithm, not only the data needs to be preprocessed, but also the parameter adjustment processing needs to be continuously carried out and the detection model needs to be optimized as much as possible, the implementation process is complex, and a stable and reliable result cannot be quickly obtained.
In addition, in order to prevent the malicious code made by the lawbreaker from being detected, some common character strings in the malicious code sample are disorderly sequenced, so as to prevent the malicious code from being detected, for example, Symbol is modified into lbsymo. In the malicious sample analysis process, some meaningless continuous character strings capable of displaying messy codes are often encountered, and further analysis finds that the messy codes are also a variant of the malicious code sample. How to detect and locate the malicious sample of the out-of-order character string is also a problem which needs to be solved urgently at present.
Disclosure of Invention
Aiming at the problems that the implementation process of the existing computer network malicious sample detection method based on the machine learning algorithm is complex and a stable and reliable result cannot be obtained quickly, and simultaneously aiming at realizing the malicious sample positioning of disordered character strings, the invention discloses a malicious sample similarity judgment method based on Jaccard coefficients. If the Jaccard coefficient is larger, the similarity between two malicious samples is represented. On the basis, for the detection and definition of the malicious samples of the disordered character strings, the invention constructs the space spectrum of the two malicious sample character string sequences by using the statistical characteristics of sample codes, and positions the malicious samples of the disordered character strings by a space spectrum estimation method.
The Jaccard coefficient is used to compare similarity and difference between limited sample sets. Wherein the larger the Jaccard coefficient value, the higher the corresponding sample similarity. In a given two sets A and B, the Jaccard coefficient is the ratio of the intersection size of A and B to the union size of A and B, and the calculation formula of the Jaccard coefficient is as follows:
Figure BDA0003133813650000021
wherein J (A, B) is epsilon [0,1], and when the sets A and B are both empty, J (A, B) is defined as 1.
The invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which does not need to directly extract relevant features and other attribute characteristics of the malicious samples, only needs to analyze the content of the malicious samples into sample character String sets through String commands, respectively completes the calculation of the Jaccard coefficients among the sample character String sets by using the Jaccard coefficient principle, and completes the final similarity judgment and reversely deduces the similarity among the malicious samples after averaging the calculation results.
The invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which specifically comprises the following steps:
s1, analyzing the malicious sample I and the malicious sample II respectively by using a String command, extracting malicious sample character strings, and converting the extracted malicious sample character strings into sample character String sets A and B respectively;
s2, calculating the Jaccard coefficient between the sample character string sets A and B; the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, and the calculation formula of the Jaccard coefficient between the character string set A and the character string set B is as follows:
Figure BDA0003133813650000031
wherein, |, represents the number of elements in the calculation set, J (A, B) belongs to [0,1], and when the sets A and B are both empty, the value of J (A, B) is defined to be 1.
S3, setting a judgment threshold value mu according to the Jaccard coefficient value obtained by the calculation in the step S2, and judging that the malicious sample I and the malicious sample II have stronger similarity if the Jaccard coefficient value obtained by the calculation is larger than mu; and if the calculated Jaccard coefficient value is smaller than mu, judging that the two malicious samples have no strong similarity.
S4, for the malicious sample i and the malicious sample ii with strong similarity, respectively converting each character string in the corresponding character string set a and character string set B into a number, for example, using an atof function to convert the character string into a double-precision floating-point number, obtaining two character string numerical vectors a and B, and equally dividing the two character string numerical vectors a and B into N segments of character string sub-vectors, thus obtaining:
a=[a1,a2,…,aN],b=[b1,b2,…,bN],
and (3) taking the character string sub-vectors as basic elements, and calculating a cross-correlation matrix R of the numerical vectors a and b of the two character strings to obtain:
R=aTb,
wherein, the element R of the ith row and the jth column of the cross-correlation matrix Rij=aibj T,aiI-th string subvector representing the string value vector a, i being 1,2, …, N, bjThe jth string subvector representing the string-valued vector b, j being 1,2, …, N. Performing feature decomposition on the cross-correlation matrix R to obtain N feature vectors and feature values, and screening the feature vectors according to the feature values, for example, screening the feature vectors corresponding to the feature values close to 1, or setting a threshold μ0Screening out the particles with a particle size of more than mu0The feature vector corresponding to the feature value of (2), threshold value mu0If the sum is more than 0.5, a feature matrix E formed by the screened M feature vectors is recorded as:
E=[v1,v2,…,vM],
wherein v iskThe k-th feature vector is represented, k is 1,2, …, M, and the feature vectors are column vectors; and constructing a spatial spectrum function P () by using the character string subvectors, wherein the calculation formula of the spatial spectrum function P (i, j) corresponding to the ith character string subvector of the character string numerical value vector a and the jth character string subvector of the character string numerical value vector b is as follows:
P(i,j)=aiEEHbj T
and for pairwise combination of all the character string sub-vectors in the character string set A and the character string set B, calculating corresponding space spectrum functions, finding out the serial number combination of the corresponding character string sub-vectors when the space spectrum functions have the maximum value, determining the corresponding character string sub-vectors of the serial number combinations in the character string set A and the character string set B, and corresponding the character string sub-vectors to the corresponding character strings in the character string set A and the character string set B, namely the character strings where the malicious samples are located.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a novel method for judging the similarity of malicious samples, which does not need complicated operations such as extraction of the features of the malicious samples and the like, and can improve the efficiency of judging the similarity of the malicious samples.
Drawings
Fig. 1 is a flow chart of malicious sample string set construction.
Detailed Description
For a better understanding of the present disclosure, an example is given here. Fig. 1 is a flow chart of malicious sample string set construction.
The example is described by selecting two malicious samples to perform similarity judgment, and the specific implementation process is as follows:
the invention discloses a method for judging similarity of malicious samples based on Jaccard coefficients, which specifically comprises the following steps:
s1, analyzing the malicious sample I and the malicious sample II respectively by using a String command, extracting malicious sample character strings, and converting the extracted malicious sample character strings into sample character String sets A and B respectively;
step S1 specifically includes:
s11, analyzing the malicious sample I by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set A, and adding the analyzed malicious sample character strings into the sample character String set A one by one;
s12, analyzing the malicious sample II by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set B, and adding the analyzed malicious sample character strings into the sample character String set B one by one;
s2, calculating the Jaccard coefficient between the sample character string sets A and B; the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, and the calculation formula of the Jaccard coefficient between the character string set A and the character string set B is as follows:
Figure BDA0003133813650000051
wherein, |, represents the number of elements in the calculation set, J (A, B) belongs to [0,1], and when the sets A and B are both empty, the value of J (A, B) is defined to be 1.
S3, setting a judgment threshold value mu according to the Jaccard coefficient value obtained by the calculation in the step S2, and judging that the malicious sample I and the malicious sample II have stronger similarity if the Jaccard coefficient value obtained by the calculation is larger than mu; and if the calculated Jaccard coefficient value is smaller than mu, judging that the two malicious samples have no strong similarity.
S4, for the malicious sample i and the malicious sample ii with strong similarity, respectively converting each character string in the corresponding character string set a and character string set B into a number, for example, using an atof function to convert the character string into a double-precision floating-point number, obtaining two character string numerical vectors a and B, and equally dividing the two character string numerical vectors a and B into N segments of character string sub-vectors, thus obtaining:
a=[a1,a2,…,aN],b=[b1,b2,…,bN],
and (3) taking the character string sub-vectors as basic elements, and calculating a cross-correlation matrix R of the numerical vectors a and b of the two character strings to obtain:
R=aTb,
wherein, the element R of the ith row and the jth column of the cross-correlation matrix Rij=aibj T,aiI-th character string representing character string numerical vector aSubvector, i ═ 1,2, …, N, bjThe jth string subvector representing the string-valued vector b, j being 1,2, …, N. Performing feature decomposition on the cross-correlation matrix R to obtain N feature vectors and feature values, and screening the feature vectors according to the feature values, for example, screening the feature vectors corresponding to the feature values close to 1, or setting a threshold μ0Screening out the particles with a particle size of more than mu0The feature vector corresponding to the feature value of (2), threshold value mu0If the feature vector is larger than 0.5, a feature matrix E formed by the screened feature vectors is recorded as:
E=[v1,v2,…,vM],
wherein v iskThe k-th feature vector is represented, k is 1,2, …, M, and the feature vectors are column vectors; and constructing a spatial spectrum function P () by using the character string subvectors, wherein the calculation formula of the spatial spectrum function P (i, j) corresponding to the ith character string subvector of the character string numerical value vector a and the jth character string subvector of the character string numerical value vector b is as follows:
P(i,j)=aiEEHbj T
and for pairwise combination of all the character string sub-vectors in the character string set A and the character string set B, calculating corresponding space spectrum functions, finding out the serial number combination of the corresponding character string sub-vectors when the space spectrum functions have the maximum value, determining the corresponding character string sub-vectors of the serial number combinations in the character string set A and the character string set B, and corresponding the character string sub-vectors to the corresponding character strings in the character string set A and the character string set B, namely the character strings where the malicious samples are located.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (3)

1. A malicious sample similarity judgment method based on Jaccard coefficients is characterized by specifically comprising the following steps:
s1, analyzing the malicious sample I and the malicious sample II respectively by using a String command, extracting malicious sample character strings, and converting the extracted malicious sample character strings into sample character String sets A and B respectively;
s2, calculating the Jaccard coefficient between the sample character string sets A and B; the Jaccard coefficient is used for comparing similarity and difference between limited sample sets, and the calculation formula of the Jaccard coefficient between the character string set A and the character string set B is as follows:
Figure FDA0003133813640000011
wherein, | · | represents the number of elements in the calculation set, J (A, B) belongs to [0,1], and when the sets A and B are both empty, the value of J (A, B) is defined to be 1;
s3, setting a judgment threshold value mu according to the Jaccard coefficient value obtained by the calculation in the step S2, and judging that the malicious sample I and the malicious sample II have stronger similarity if the Jaccard coefficient value obtained by the calculation is larger than mu; if the calculated Jaccard coefficient value is smaller than mu, judging that the two malicious samples have no stronger similarity;
s4, for a malicious sample I and a malicious sample II with strong similarity, converting each character string in a character string set A and a character string set B corresponding to the malicious sample I and the malicious sample II into a number respectively to obtain two character string numerical vectors a and B, and equally dividing the two character string numerical vectors a and B into N sections of character string sub-vectors respectively to obtain:
a=[a1,a2,…,aN],b=[b1,b2,…,bN],
and (3) taking the character string sub-vectors as basic elements, and calculating a cross-correlation matrix R of the numerical vectors a and b of the two character strings to obtain:
R=aTb,
wherein, the element R of the ith row and the jth column of the cross-correlation matrix Rij=aibj T,aiI-th character string representing character string numerical value vector aVector, i-1, 2, …, N, bjA jth string subvector representing a string-valued vector b, j being 1,2, …, N; performing characteristic decomposition on the cross-correlation matrix R to obtain N characteristic vectors and characteristic values, screening the characteristic vectors according to the size of the characteristic values, and recording a characteristic matrix E formed by the screened M characteristic vectors as:
E=[v1,v2,…,vM],
wherein v iskThe k-th feature vector is represented, k is 1,2, …, M, and the feature vectors are column vectors; and constructing a spatial spectrum function P () by using the character string subvectors, wherein the calculation formula of the spatial spectrum function P (i, j) corresponding to the ith character string subvector of the character string numerical value vector a and the jth character string subvector of the character string numerical value vector b is as follows:
P(i,j)=aiEEHbj T
and for pairwise combination of all the character string sub-vectors in the character string set A and the character string set B, calculating corresponding space spectrum functions, finding out the serial number combination of the corresponding character string sub-vectors when the space spectrum functions have the maximum value, determining the corresponding character string sub-vectors of the serial number combinations in the character string set A and the character string set B, and corresponding the character string sub-vectors to the corresponding character strings in the character string set A and the character string set B, namely the character strings where the malicious samples are located.
2. The method for determining similarity of malicious samples according to claim 1, wherein the step S1 specifically includes:
s11, analyzing the malicious sample I by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set A, and adding the analyzed malicious sample character strings into the sample character String set A one by one;
and S12, analyzing the malicious sample II by using a String command, extracting a malicious sample character String, establishing a corresponding sample character String set B, and adding the analyzed malicious sample character String into the sample character String set B one by one.
3. The method for determining similarity of malicious samples according to claim 1, wherein the feature vectors are filtered according to the magnitude of the feature values, specifically, the feature vectors corresponding to the feature values close to 1 are filtered, or a threshold μ is set0Screening out the particles with a particle size of more than mu0The feature vector corresponding to the feature value of (2), threshold value mu0Greater than 0.5.
CN202110711130.6A 2021-06-25 2021-06-25 Malicious sample similarity judgment method based on Jaccard coefficient Active CN113378165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110711130.6A CN113378165B (en) 2021-06-25 2021-06-25 Malicious sample similarity judgment method based on Jaccard coefficient

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110711130.6A CN113378165B (en) 2021-06-25 2021-06-25 Malicious sample similarity judgment method based on Jaccard coefficient

Publications (2)

Publication Number Publication Date
CN113378165A true CN113378165A (en) 2021-09-10
CN113378165B CN113378165B (en) 2021-11-05

Family

ID=77579101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110711130.6A Active CN113378165B (en) 2021-06-25 2021-06-25 Malicious sample similarity judgment method based on Jaccard coefficient

Country Status (1)

Country Link
CN (1) CN113378165B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105245495A (en) * 2015-08-27 2016-01-13 哈尔滨工程大学 Similarity match based rapid detection method for malicious shellcode
US20160269424A1 (en) * 2015-03-13 2016-09-15 Microsoft Technology Licensing, Llc Large Scale Malicious Process Detection
CN106709345A (en) * 2015-11-17 2017-05-24 武汉安天信息技术有限责任公司 Deep learning method-based method and system for deducing malicious code rules and equipment
CN107679403A (en) * 2017-10-11 2018-02-09 北京理工大学 It is a kind of to extort software mutation detection method based on sequence alignment algorithms
CN107908963A (en) * 2018-01-08 2018-04-13 北京工业大学 A kind of automatic detection malicious code core feature method
US10437996B1 (en) * 2017-07-24 2019-10-08 EMC IP Holding Company LLC Classifying software modules utilizing similarity-based queries
CN110610084A (en) * 2018-06-15 2019-12-24 武汉安天信息技术有限责任公司 Dex file-based sample maliciousness determination method and related device
CN112948828A (en) * 2021-01-25 2021-06-11 厦门服云信息科技有限公司 Binary program malicious code detection method, terminal device and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160269424A1 (en) * 2015-03-13 2016-09-15 Microsoft Technology Licensing, Llc Large Scale Malicious Process Detection
CN105245495A (en) * 2015-08-27 2016-01-13 哈尔滨工程大学 Similarity match based rapid detection method for malicious shellcode
CN106709345A (en) * 2015-11-17 2017-05-24 武汉安天信息技术有限责任公司 Deep learning method-based method and system for deducing malicious code rules and equipment
US10437996B1 (en) * 2017-07-24 2019-10-08 EMC IP Holding Company LLC Classifying software modules utilizing similarity-based queries
CN107679403A (en) * 2017-10-11 2018-02-09 北京理工大学 It is a kind of to extort software mutation detection method based on sequence alignment algorithms
CN107908963A (en) * 2018-01-08 2018-04-13 北京工业大学 A kind of automatic detection malicious code core feature method
CN110610084A (en) * 2018-06-15 2019-12-24 武汉安天信息技术有限责任公司 Dex file-based sample maliciousness determination method and related device
CN112948828A (en) * 2021-01-25 2021-06-11 厦门服云信息科技有限公司 Binary program malicious code detection method, terminal device and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李莹珠: "基于云计算的恶意代码检测技术研究", 《中国优秀博硕士学位论文全文数据库(硕士)》 *
郭宏宇,冷冰,邓永晖: "基于改进型循环神经网络的恶意代码分类检测", 《信息技术》 *
霍跃: "基于函数相似度的恶意软件研究", 《中国优秀博硕学位论文全文数据库(硕士) 信息科技辑》 *

Also Published As

Publication number Publication date
CN113378165B (en) 2021-11-05

Similar Documents

Publication Publication Date Title
CN110287983B (en) Single-classifier anomaly detection method based on maximum correlation entropy deep neural network
Quack et al. Efficient mining of frequent and distinctive feature configurations
CN111027069B (en) Malicious software family detection method, storage medium and computing device
KR20190072652A (en) Information processing apparatus and information processing method
CN116523320B (en) Intellectual Property Risk Intelligent Analysis Method Based on Internet Big Data
CN113516228B (en) Network anomaly detection method based on deep neural network
CN116204831A (en) Road-to-ground analysis method based on neural network
Juvonen et al. Adaptive framework for network traffic classification using dimensionality reduction and clustering
CN113378165B (en) Malicious sample similarity judgment method based on Jaccard coefficient
CN113283901A (en) Byte code-based fraud contract detection method for block chain platform
CN111105041B (en) Machine learning method and device for intelligent data collision
CN111797395A (en) Malicious code visualization and variety detection method, device, equipment and storage medium
CN111797997A (en) Network intrusion detection method, model construction method, device and electronic equipment
CN107943916B (en) Webpage anomaly detection method based on online classification
CN111426657A (en) Method for identifying and comparing three-dimensional fluorescence spectrogram of soluble organic matter
CN113159181B (en) Industrial control system anomaly detection method and system based on improved deep forest
CN114141316A (en) Method and system for predicting biological toxicity of organic matters based on spectrogram analysis
CN110135155B (en) Fuzzy K neighbor-based Windows malicious software identification method
CN112163217A (en) Malicious software variant identification method, device, equipment and computer storage medium
KR101907443B1 (en) Component-based malicious file similarity analysis device and method
Cannarile et al. A Study on Malware Detection and Classification Using the Analysis of API Calls Sequences Through Shallow Learning and Recurrent Neural Networks.
CN113688229B (en) Text recommendation method, system, storage medium and equipment
Greau-Hamard et al. Performance analysis and comparison of sequence identification algorithms in iot context
CN111797398B (en) Malicious code visualization and variant detection method, system, equipment and storage medium
CN115242431A (en) Industrial Internet of things data anomaly detection method based on random forest and long-short term memory network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant