CN113468532B - Malicious software family inference method and system - Google Patents

Malicious software family inference method and system Download PDF

Info

Publication number
CN113468532B
CN113468532B CN202110820216.2A CN202110820216A CN113468532B CN 113468532 B CN113468532 B CN 113468532B CN 202110820216 A CN202110820216 A CN 202110820216A CN 113468532 B CN113468532 B CN 113468532B
Authority
CN
China
Prior art keywords
family
result
sample
inference
scanning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110820216.2A
Other languages
Chinese (zh)
Other versions
CN113468532A (en
Inventor
朱宏宇
田建伟
田峥
蒋永康
李生红
杨志邦
黎曦
李琪瑶
张宇翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Hunan Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Hunan Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Hunan Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Hunan Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Hunan Electric Power Co Ltd, Information and Telecommunication Branch of State Grid Hunan Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202110820216.2A priority Critical patent/CN113468532B/en
Publication of CN113468532A publication Critical patent/CN113468532A/en
Application granted granted Critical
Publication of CN113468532B publication Critical patent/CN113468532B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The invention discloses a malicious software family inference method, which comprises the steps of obtaining an executable program and constructing a sample set; inquiring the sample set and recording the inquiry result; scanning a sample set by adopting a plurality of antivirus engines and preprocessing a scanning result; utilizing heuristic rules to infer the maliciousness of the sample and output an inference result; and carrying out maximum likelihood estimation modeling and solving to output a final family inference result. The invention also discloses a system for realizing the malicious software family inference method. The method solves the problems of label fluctuation of the antivirus engine and high false alarm rate when facing the shell-added sample, reduces the influence of family popularity on the inference result, can effectively realize the automation of a large-scale malicious software family data set, and has high accuracy and strong robustness.

Description

Malicious software family inference method and system
Technical Field
The invention belongs to the technical field of computer security, and particularly relates to a method and a system for inferring a malicious software family.
Background
Malware is a long-standing and increasingly serious problem. In recent years, the abuse of botnet mirai and Lexu software wanncry, etc., caused great loss to public infrastructure and users' personal property. The increase in complexity and size of malicious programs has driven the secure community to explore better analytical tools and methods.
Classification of malware is a problem that needs to be improved. Malware classification mainly involves two tasks: 1) malicious software detection, which distinguishes benign software from malicious software; 2) and identifying and distinguishing different types of malware by the malware family. The method can accurately detect the malice of the program, and can effectively block the spread of malicious software; accurate identification of different malicious software families can effectively reduce the manual analysis cost and trace the source of malicious behaviors. Currently, antivirus engines mainly use a feature signature-based approach to accomplish the above-mentioned task of malware classification. The feature signatures are essentially pattern matches of strings, specific to detected malware, require manual creation by security analysts, and are labor intensive; therefore, when polymorphic and qualitative techniques are widely applied by malware, the expansion of feature signatures becomes a difficult problem.
Machine learning is viewed by the security community as the most likely approach to solving the malware family classification problem. Researchers are actively exploring machine learning based malware family classification models. As in the field of computer vision and natural language processing, the primary task of building machine learning models is to obtain more and better labeled data. It is well known that good data sets are one of the most effective methods to improve the accuracy of machine learning systems.
The manual marking of the malicious software needs a solid professional background; in the face of a huge amount of program samples, the crowdsourcing labeling method of the ImageNet (Jia Deng, 2010) formula is no longer applicable. At present, a security community lacks a recognized malware data set, and an academic community does not have too much deep research on automatic labeling of malware; the ground truth of most models at the time of verification is obtained from the scanning result of the antivirus engine according to experience. For the malicious labeling of software, a commonly adopted strategy is to upload suspicious samples to VirusTotal, scan the samples by tens of antivirus engines integrated by the VirusTotal, and then determine the malicious nature of the samples through an N/K type threshold strategy (N represents the number of antivirus engines identifying the samples as malicious, and K represents the total number of antivirus engines). For example, Saxe and Berlin work in 2015, using a threshold value of N/K ≧ 0.3; incer et al worked in 2018 using a threshold value of N.gtoreq.4. For family tagging of malware, the Avclass tool proposed by Marcos et al at 2016 is widely used; avclass extracts family tags from many inconsistent antivirus engine results and then identifies families using a Plurality of votes (Plurality Vote). Unfortunately, although Avclass can extract family tags, the accuracy of its tags is significantly affected by the popularity of the family.
Zhu et al in 2020 on the antivirus engine showed that: and the reasonable k value is selected, so that the maliciousness of the sample can be effectively marked. The research also shows that: 1) the maliciousness detection result of the antivirus engine on the sample fluctuates along with time, and generally can be stable after several months; 2) the malicious false alarm rate of the antivirus engine on the new sample is high; 3) the malicious detection false alarm rate of the antivirus engine on the shelled samples is very high;
in conclusion, malware family inference based on antivirus engines remains an open problem.
Disclosure of Invention
The invention aims to provide a malicious software family inference method with high accuracy and strong robustness.
The invention also aims to provide a system for realizing the malware family inference method.
The invention provides a malware family inference method, which comprises the following steps:
s1, obtaining an executable program and constructing a sample set;
s2, inquiring and recording the executable programs in the sample set;
s3, scanning executable programs in the sample set by adopting a plurality of antivirus engines, and preprocessing scanning results;
s4, based on the preprocessing result obtained in the step S3, utilizing a heuristic rule to deduce the maliciousness of the sample, and outputting a deduction result;
and S5, based on the preprocessing result obtained in the step S3, performing maximum likelihood estimation modeling and solving, and outputting and storing a family inference result.
In step S2, the querying and recording of the executable programs in the sample set specifically queries whether the inference result of the current program already exists: if yes, directly returning an inference result; if not, recording the appearance time and scanning time of the program.
In step S3, the multiple antivirus engines are used to scan the executable programs in the sample set and to preprocess the scanning results, specifically, the existing multiple antivirus engines are used to scan the executable programs in the sample set and to process the scanning results as follows: the method comprises the steps of same result filtering, result label word segmentation, common character string and random character string filtering, malicious family alias replacement, malicious degree calculation and family maximum engine consent degree calculation.
The filtering of the same result specifically refers to that different antivirus engines scan and obtain different label modes for the same executable program; if the labels of a plurality of antivirus engines for the same executable program are the same, one label is randomly selected and stored.
The malicious family alias replacement is to replace a family alias by using a malicious software family alias database, so that the family names are unified; the malicious software family alias database is obtained by learning the following strategies: let A and B be the names of two families, A → B is the rule of family names A to B; defining a support degree support (A → B) ═ P (AB) as the probability of the simultaneous occurrence of A and B, and a confidence degree confidence (A → B) ═ P (B/A) ═ P (AB)/P (A) as the probability of the occurrence of B when A occurs; if the support (A → B) is higher than the set threshold value, the confidence (A → B) is higher than the set threshold value and the A → B is judged to have the alias meaning manually, the B is determined to be the alias of the A; otherwise, B is assumed to be a different name than A.
The malice degree calculation specifically adopts the following formula to calculate the malice degree E 1
E 1 =n/K
Wherein n is the number of antivirus engines which identify the sample as malicious after filtering the result; k is the total number of antivirus engines after result filtering.
The calculation of the maximum engine consent degree of the family specifically adopts the following formula to calculate the maximum engine consent degree E of the family 2
E 2 =max(∑ i∈k f ij )
In the formula f ij Whether the antivirus engine i gives a family tag j, and if the antivirus engine i gives a family tag j, f ij 1, otherwise f ij =0。
Step S4, where the method performs maliciousness inference on the sample by using heuristic rules based on the preprocessing result obtained in step S3, and outputs an inference result, the method specifically includes the following steps, and each step outputs a Boolean result (True, False):
A. judging whether the difference between the scanning time of the sample and the appearance time of the sample is larger than a set threshold value or not, and outputting a judgment result r a
B. Judging whether the sample is shelled or not, and outputting a judgment result r b
C. Judging whether the maliciousness of the sample is larger than a set threshold value or not, and outputting a judgment result r c
D. Judging whether the maximum engine maliciousness of the family of the sample is larger than a set threshold value or not, and outputting a judgment result r d
E. And according to the judgment results of the steps A to D, calculating the maliciousness result r of the sample by adopting the following formula:
r=r a &r b &r c &r d
wherein, & is AND operation.
The step S5 of performing maximum likelihood estimation modeling and solving based on the preprocessing result obtained in the step S3 to output a final family inference result specifically includes the following steps:
k antivirus engines are set, I samples and J families are set, and the real family of the sample I is Y i And obey probability is θ ═ θ 12 ,...,θ J ]Is randomly distributed, the corresponding likelihood function is defined as
Figure BDA0003171651470000051
Wherein T is ij Is an indicator variable and if q is the true family of sample i, then T iq 1 or else T ij 0, and j ≠ q; theta j Probability of sample i from family j;
Figure BDA0003171651470000052
probability of identifying j family samples as l family for antivirus engine k; and theta j And
Figure BDA0003171651470000053
unknown; finding theta when the function h gets the maximum value by adopting an expectation maximization algorithm j And
Figure BDA0003171651470000054
thereby estimating the true family tag of the sample
Figure BDA0003171651470000055
The expectation maximization algorithm is an iterative optimization process, wherein theta j And
Figure BDA0003171651470000056
are all estimated as
Figure BDA0003171651470000057
Calculating h, theta by circulation j And
Figure BDA0003171651470000058
until the function h converges, the algorithm is ended, and a real family tag is output.
The invention also provides a system for realizing the malicious software family inference method, which specifically comprises a data acquisition module, a database module, a scanning module, a malice inference module and a family inference module; the data acquisition module, the database module, the scanning module, the malice inference module and the family inference module are sequentially connected in series; the data acquisition module is used for acquiring an executable program and constructing a sample set; the database module is used for inquiring each executable program in the sample set and recording the result; the scanning module is used for scanning the executable programs in the sample set by adopting a plurality of antivirus engines and preprocessing the scanning result; the malice inference module is used for inferring the malice of the sample by utilizing a heuristic rule based on the obtained preprocessing result and outputting an inference result; and the family inference module is used for carrying out maximum likelihood estimation modeling and solving based on the obtained preprocessing result so as to output a final family inference result.
The method and the system for inferring the malicious software family creatively model the family inference of the malicious software into a two-stage task, and comprise the following steps of inferring the malicious software based on heuristic rules and inferring the family based on expectation maximization: the former effectively solves the problem of label fluctuation of the antivirus engine and the problem of high false alarm rate when facing a crust sample; the influence of family popularity on the inference result is reduced as much as possible by the maximum likelihood estimation modeling and the expectation maximization method, and the accuracy of the inference result is improved; therefore, the method can effectively realize the automatic labeling of the large-scale malicious software family data set, and has high precision and strong robustness.
Drawings
FIG. 1 is a schematic process flow diagram of the process of the present invention.
FIG. 2 is a flow chart illustrating the process of querying the database by the executable software according to the embodiment of the method of the present invention.
FIG. 3 is a flowchart illustrating scan result preprocessing according to an embodiment of the present invention.
Fig. 4 is a flowchart illustrating malicious intent inference according to an embodiment of the present invention.
Fig. 5 is a schematic diagram illustrating probability transition of malicious family inference according to an embodiment of the present invention.
Fig. 6 is a functional block diagram of the system of the present invention.
Detailed Description
FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a malware family inference method, which comprises the following steps:
s1, obtaining an executable program and constructing a sample set;
s2, inquiring and recording each executable program in the sample set; specifically, whether an inference result of the current program exists or not is inquired, and if yes, the inference result is directly returned; if not, recording the appearance time and the scanning time of the program;
s3, scanning executable programs in the sample set by adopting a plurality of antivirus engines, and preprocessing scanning results; specifically, the method adopts the existing multi-antivirus engine to scan the executable program in the sample set, and processes the scanning result as follows: the method comprises the following steps of filtering the same result, segmenting words by result labels, filtering common character strings and random character strings, replacing alias names of malicious families, calculating the malicious degree and calculating the maximum engine consent degree of the families;
in specific implementation, the same result is filtered, specifically, different antivirus engines scan and obtain different label modes for the same executable program; if the labels of the antivirus engines for the same executable program are the same, one label is randomly selected and stored;
replacing the malicious family alias, specifically, replacing the family alias by using a malicious software family alias database, so as to realize the unification of the family names; the malicious software family alias database is obtained by learning the following strategies: let A and B be the names of two families, A → B is the rule for family names A to B; defining a support degree support (A → B) ═ P (AB) as the probability of the simultaneous occurrence of A and B, and a confidence degree confidence (A → B) ═ P (B/A) ═ P (AB)/P (A) as the probability of the occurrence of B when A occurs; if the support (A → B) is higher than the set threshold value, the confidence (A → B) is higher than the set threshold value and the A → B is judged to have the alias meaning manually, the B is determined to be the alias of the A; otherwise, identifying B and A as different names;
calculating the malice degree, specifically calculating the malice degree E by adopting the following formula 1
E 1 =n/K
Wherein n is the number of antivirus engines which identify the sample as malicious after filtering the result; k is the total number of the antivirus engines after result filtering;
calculating the maximum engine consent of the family, specifically, calculating the maximum engine consent of the family E by adopting the following formula 2
E 2 =max(∑ i∈k f ij )
In the formula f ij Whether the antivirus engine i gives a family tag j, and if the antivirus engine i gives a family tag j, f ij 1, otherwise f ij =0;
S4, based on the preprocessing result obtained in the step S3, utilizing a heuristic rule to deduce the maliciousness of the sample, and outputting a deduction result; the method specifically comprises the following steps, and Boolean results (True, False) are input in each step:
A. judging whether the difference between the scanning time of the sample and the appearance time of the sample is larger than a set threshold value or not, and outputting a judgment result r a (ii) a For example, if it is determined that the difference between the sample scanning time and the sample appearance time is greater than the set threshold value, r is a 0, otherwise r a =1;
B. Judging whether the sample is shelled or not, and outputting a judgment result r b (ii) a For example, if it is determined that the sample is shelled, r b 0, otherwise r b =1;
C. Judging whether the maliciousness of the sample is larger than a set threshold value or not, and outputting a judgment result r c (ii) a For example, if the maliciousness of the sample is determined to be greater than the set threshold value, r c 0, otherwise r c =1;
D. Judging whether the maximum engine malice of the family of the sample is larger than a set threshold value or not, and outputting a judgment result r d (ii) a For example, if the maximum engine maliciousness of the family of the sample is determined to be greater than the set threshold value, r d 0, otherwise r d =1;
E. And according to the judgment results of the steps A to D, calculating the malice result r of the sample by adopting the following formula:
r=r a &r b &r c &r d
wherein, & is AND operation; for example, if any one of the determination results in steps a to D is 0, the final maliciousness result is determined to be 0;
s5, based on the preprocessing result obtained in the step S3, performing maximum likelihood estimation modeling and solving to output a final family inference result; the method specifically comprises the following steps:
k antivirus engines are set, I samples and J families are set, and the real family of the sample I is Y i And obey probability is θ ═ θ 12 ,...,θ J ]Is then the corresponding likelihood function is defined as
Figure BDA0003171651470000091
Wherein T is ij Is an indicator variable and if q is the true family of sample i, then T iq 1 or else T ij 0, and j ≠ q; theta j Probability of sample i from family j;
Figure BDA0003171651470000092
probability of identifying a j family sample as a l family for the antivirus engine k; and theta j And
Figure BDA0003171651470000093
unknown; finding theta when the function h gets the maximum value by adopting an expectation maximization algorithm j And
Figure BDA0003171651470000094
thereby estimating the true family tag of the sample
Figure BDA0003171651470000095
In particular implementation, the expectation maximization algorithm is embodied as an iterative optimization process, where θ j And
Figure BDA0003171651470000096
are all estimated as
Figure BDA0003171651470000097
Calculating h, theta by circulation j And
Figure BDA0003171651470000098
until the function h converges, the algorithm is ended at this moment, and a real family tag is output;
when the method is implemented concretely, a browser/server architecture is adopted, and each module is configured at a server end to complete core functions such as scanning, inference and the like; the browser-side is configured to bulk upload malware and present malicious family inferences.
The process of the invention is further illustrated below with reference to one example:
the method comprises the following steps:
s1, obtaining an executable program and constructing a sample set;
s2, inquiring and recording each executable program in the sample set; as shown in fig. 3, the method specifically includes:
sample computation hash value (hash) can be performed, queries in a database;
if the data does not exist, scanning, carrying out malicious inference and family inference, and updating the database;
s3, scanning executable programs in the sample set by adopting a plurality of antivirus engines, and preprocessing scanning results; as shown in fig. 4, the method specifically includes:
calling an antivirus engine to obtain a scanning result;
filtering the same result;
segmenting words by result labels;
filtering the universal character string and the random character string;
malicious family alias replacement;
calculating the degree of malice and the maximum engine consent degree of the family;
s4, based on the preprocessing result obtained in the step S3, utilizing a heuristic rule to deduce the maliciousness of the sample, and outputting a deduction result; as shown in fig. 5, the method specifically includes:
judging the difference between the sample scanning time and the sample appearance time;
judging whether the sample is shelled;
judging the sample maliciousness;
judging the maximum engine consent of the family;
and S5, based on the preprocessing result obtained in the step S3, performing maximum likelihood estimation modeling and solving, and outputting a final family inference result.
FIG. 6 is a schematic diagram of functional modules of the system of the present invention: the invention also provides a system for realizing the malicious software family inference method, which specifically comprises a data acquisition module, a database module, a scanning module, a malice inference module and a family inference module; the data acquisition module, the database module, the scanning module, the malice inference module and the family inference module are sequentially connected in series; the data acquisition module is used for acquiring an executable program and constructing a sample set; the database module is used for inquiring each executable program in the sample set and recording inquiry results; the scanning module is used for scanning the executable programs in the sample set by adopting a plurality of antivirus engines and preprocessing the scanning result; the malice inference module is used for inferring the malice of the sample by utilizing a heuristic rule based on the obtained preprocessing result and outputting an inference result; and the family inference module is used for carrying out maximum likelihood estimation modeling and solving based on the obtained preprocessing result so as to output a final family inference result.

Claims (5)

1. A malware family inference method comprising the steps of:
s1, obtaining an executable program and constructing a sample set;
s2, inquiring and recording the executable programs in the sample set;
s3, scanning executable programs in the sample set by adopting multiple antivirus engines, and preprocessing a scanning result; specifically, the method adopts the existing multi-antivirus engine to scan the executable program in the sample set, and processes the scanning result as follows: the method comprises the following steps of filtering the same result, segmenting words by result labels, filtering general character strings and random character strings, replacing malicious family aliases, calculating the malicious degree and calculating the maximum engine malicious degree of a family;
the malicious family alias replacement is to replace a family alias by using a malicious software family alias database, so that the family names are unified; the alias database of the malware family is obtained by learning the following strategies: let A and B be the names of two families, A → B is the rule for family names A to B; defining a support degree support (A → B) ═ P (AB) as the probability of the simultaneous occurrence of A and B, and a confidence degree confidence (A → B) ═ P (B/A) ═ P (AB)/P (A) as the probability of the occurrence of B when A occurs; if the support (A → B) is higher than the set threshold value, the confidence (A → B) is higher than the set threshold value and the A → B is judged to have the alias meaning manually, the B is considered as the alias of the A; otherwise, identifying B and A as different names;
the malice degree calculation specifically adopts the following formula to calculate the malice degree E 1
E 1 =n/K
Wherein n is the number of antivirus engines which identify the sample as malicious after filtering the result; k is the total number of the antivirus engines after result filtering;
the calculation of the maximum engine malice degree of the family specifically adopts the following formula to calculate the maximum engine malice degree E of the family 2
E 2 =max(∑ i∈k f ij )
In the formula f ij Whether the antivirus engine i gives a family tag j, and if the antivirus engine i gives a family tag j, f ij 1, otherwise f ij =0;
S4, deducing the maliciousness of the sample by using a heuristic rule based on the preprocessing result obtained in the step S3, and outputting a deduction result; the method specifically comprises the following steps, wherein each step outputs a Boolean result:
A. judging whether the difference between the scanning time of the sample and the appearance time of the sample is larger than a set threshold value or not, and outputting a judgment result r a
B. Judging whether the sample is shelled or not, and outputting a judgment result r b
C. Judging whether the maliciousness of the sample is larger than a set threshold value or not, and outputting a judgment result r c
D. Judging whether the maximum engine maliciousness of the family of the sample is larger than a set threshold value or not, and outputting a judgment result r d
E. And according to the judgment results of the steps A to D, calculating the malice result r of the sample by adopting the following formula:
r=r a &r b &r c &r d
wherein, & is AND operation;
and S5, based on the preprocessing result obtained in the step S3, performing maximum likelihood estimation modeling and solving, and outputting and storing a family inference result.
2. The malware family inference method of claim 1, wherein step S2 queries and records executable programs in the sample set, specifically to query whether the inference result of the current program already exists: if yes, directly returning an inference result; if not, recording the appearance time and scanning time of the program.
3. The malware family inference method of claim 2, wherein the step S5 is performed by performing maximum likelihood estimation modeling and solving based on the preprocessing result obtained in the step S3, so as to output a final family inference result, and specifically comprises the following steps:
k antivirus engines are set, I samples and J families are set, and the real family of the sample I is Y i And obey probability is θ ═ θ 12 ,...,θ J ]Is randomly distributed, the corresponding likelihood function is defined as
Figure FDA0003762437600000031
Wherein T is ij Is an indicator variable and if q is the true family of sample i, then T iq 1 or else T ij 0, and j ≠ q; theta j Probability of sample i from family j;
Figure FDA0003762437600000032
probability of identifying j family samples as l family for antivirus engine k; and theta j And
Figure FDA0003762437600000033
unknown; finding theta when the function h gets the maximum value by adopting an expectation maximization algorithm j And
Figure FDA0003762437600000034
thereby estimating the true family tag of the sample
Figure FDA0003762437600000035
4. Malware family inference method as claimed in claim 3, characterized in that said expectation maximization algorithm is an iterative optimization process, where θ j And
Figure FDA0003762437600000036
are all estimated as
Figure FDA0003762437600000037
Calculating h, theta by circulation j And
Figure FDA0003762437600000038
until the function h converges, the algorithm is ended, and a real family tag is output.
5. A system for implementing the malware family inference method of any one of claims 1-4, specifically comprising a data acquisition module, a database module, a scanning module, a maliciousness inference module, and a family inference module; the data acquisition module, the database module, the scanning module, the malice inference module and the family inference module are sequentially connected in series; the data acquisition module is used for acquiring an executable program and constructing a sample set; the database module is used for inquiring each executable program in the sample set and recording results; the scanning module is used for scanning the executable programs in the sample set by adopting a plurality of antivirus engines and preprocessing the scanning result; the maliciousness inference module is used for inferring the maliciousness of the sample by utilizing heuristic rules based on the obtained preprocessing result and outputting an inference result; and the family inference module is used for carrying out maximum likelihood estimation modeling and solving based on the obtained preprocessing result so as to output a final family inference result.
CN202110820216.2A 2021-07-20 2021-07-20 Malicious software family inference method and system Active CN113468532B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110820216.2A CN113468532B (en) 2021-07-20 2021-07-20 Malicious software family inference method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110820216.2A CN113468532B (en) 2021-07-20 2021-07-20 Malicious software family inference method and system

Publications (2)

Publication Number Publication Date
CN113468532A CN113468532A (en) 2021-10-01
CN113468532B true CN113468532B (en) 2022-09-23

Family

ID=77881296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110820216.2A Active CN113468532B (en) 2021-07-20 2021-07-20 Malicious software family inference method and system

Country Status (1)

Country Link
CN (1) CN113468532B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280350A (en) * 2018-02-05 2018-07-13 南京航空航天大学 A kind of mobile network's terminal Malware multiple features detection method towards Android
CN111552971A (en) * 2020-04-30 2020-08-18 四川大学 Malicious software family classification evasion method based on deep reinforcement learning
RU2738344C1 (en) * 2020-03-10 2020-12-11 Общество с ограниченной ответственностью «Группа АйБи ТДС» Method and system for searching for similar malware based on results of their dynamic analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9853997B2 (en) * 2014-04-14 2017-12-26 Drexel University Multi-channel change-point malware detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280350A (en) * 2018-02-05 2018-07-13 南京航空航天大学 A kind of mobile network's terminal Malware multiple features detection method towards Android
RU2738344C1 (en) * 2020-03-10 2020-12-11 Общество с ограниченной ответственностью «Группа АйБи ТДС» Method and system for searching for similar malware based on results of their dynamic analysis
CN111552971A (en) * 2020-04-30 2020-08-18 四川大学 Malicious software family classification evasion method based on deep reinforcement learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Android恶意软件检测研究与进展;彭国军;《武汉大学学报》;20150228;全文 *
EM Meets Malicious Data: A Novel Methord for Massive Malware Family Inference;Yongkang Jiang;《ICBDT 2020:Proceedings of the 2020 3rd International Conference on Big Data Technologies》;20201023;全文 *
一种Android恶意软件多标签检测方法;王军等;《小型微型计算机系统》;20171015(第10期);全文 *
基于权限信息的Android恶意软件分类检测;郑艳梅等;《现代计算机(专业版)》;20180225(第06期);全文 *
基于静态结构的恶意代码同源性分析;陈琪等;《计算机工程与应用》(第14期);全文 *

Also Published As

Publication number Publication date
CN113468532A (en) 2021-10-01

Similar Documents

Publication Publication Date Title
CN109981625B (en) Log template extraction method based on online hierarchical clustering
CN111310662A (en) Flame detection and identification method and system based on integrated deep network
CN115268719B (en) Method, medium and electronic device for positioning target element on interface
CN111507385A (en) Extensible network attack behavior classification method
CN115277180A (en) Block chain log anomaly detection and tracing system
CN116664944A (en) Vineyard pest identification method based on attribute feature knowledge graph
CN114049508B (en) Fraud website identification method and system based on picture clustering and manual research and judgment
JP2007531136A (en) Method and apparatus for extracting visual object categories from a database with images
CN114329455A (en) User abnormal behavior detection method and device based on heterogeneous graph embedding
CN111898555B (en) Book checking identification method, device, equipment and system based on images and texts
CN112949778A (en) Intelligent contract classification method and system based on locality sensitive hashing and electronic equipment
CN113468532B (en) Malicious software family inference method and system
US11797705B1 (en) Generative adversarial network for named entity recognition
CN115909403A (en) Low-cost high-precision pig face identification method based on deep learning
CN113190851B (en) Active learning method of malicious document detection model, electronic equipment and storage medium
CN113259369B (en) Data set authentication method and system based on machine learning member inference attack
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
CN115098679A (en) Method, device, equipment and medium for detecting abnormality of text classification labeling sample
CN114969761A (en) Log anomaly detection method based on LDA theme characteristics
CN111079145B (en) Malicious program detection method based on graph processing
CN111931229B (en) Data identification method, device and storage medium
Jiao et al. Ieye: Personalized image privacy detection
CN114398887A (en) Text classification method and device and electronic equipment
CN112422505A (en) Network malicious traffic identification method based on high-dimensional extended key feature vector
CN111125699B (en) Malicious program visual detection method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant