CN109472145A - A kind of code reuse recognition methods and system based on graph theory - Google Patents
A kind of code reuse recognition methods and system based on graph theory Download PDFInfo
- Publication number
- CN109472145A CN109472145A CN201711489518.6A CN201711489518A CN109472145A CN 109472145 A CN109472145 A CN 109472145A CN 201711489518 A CN201711489518 A CN 201711489518A CN 109472145 A CN109472145 A CN 109472145A
- Authority
- CN
- China
- Prior art keywords
- sample
- feature data
- detected
- known malicious
- code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
Abstract
The invention discloses a kind of code reuse recognition methods and system based on graph theory, wherein the described method includes: parsing sample to be detected and known malicious sample formation abstract syntax tree;It converts the abstract syntax tree of formation to and platform-independent intermediate representation language;The call relation of all functions is obtained according to the intermediate representation language, drafting function calls flow graph;Feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample;Feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected;The feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines the code reuse degree of sample to be detected Yu known malicious sample.The present invention judges the homology of malicious code by analysis code similitude, can not only be used to match malicious code family, but also can trace attacker.
Description
Technical field
The present invention relates to field of information security technology more particularly to a kind of code reuse recognition methods based on graph theory and it is
System.
Background technique
Due to malicious attacker generally will not each repetition write the same module, it is possible to pass through the similar of code
Property judge malicious code homology, for matching family, trace attacker.
Current code reuse identification depends on text string extracting and longest common subsequence algorithm, and speed is relatively slow and possesses
Higher error rate, and new feature can not be formed according to existing feature, it can only monotonously carry out pattern match.
Summary of the invention
In view of the above technical problems, the present invention carries out processing shape to sample to be detected and known malicious sample by nomography
At feature data of drawing, the differentiation of code reuse is finally carried out according to the feature of figure, and then judges the homology of malicious code
And finally trace attacker.
The present invention realizes with the following method: a kind of code reuse recognition methods based on graph theory, comprising:
It parses sample to be detected and known malicious sample forms abstract syntax tree;
It converts the abstract syntax tree of formation to and platform-independent intermediate representation language;
The call relation of all functions is obtained according to the intermediate representation language, drafting function calls flow graph;
Feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample;
Feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected;
The feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines sample to be detected and known evil
The code reuse degree of meaning sample.
Further, the feature data of drawing using sample to be detected matches the feature data of drawing library, determines
The code reuse degree of sample to be detected and known malicious sample, specifically includes:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super
Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
Further, the preset threshold is more than or equal to 35%.
Further, further includes: improve feature data of drawing as training data using Open Source Code and history codes
The recognition accuracy in library.
The present invention is realized using following system: a kind of code reuse identifying system based on graph theory, comprising:
Abstract syntax tree generation module forms abstract syntax tree for parsing sample to be detected and known malicious sample;
Intermediate representation language generation module, for converting the abstract syntax tree of formation to and platform-independent intermediate representation language
Speech;
Function call flow graph drafting module is drawn for obtaining the call relation of all functions according to the intermediate representation language
Function call flow graph;
Feature data of drawing library generation module, for utilizing the relevant function call stream graphics of nomography processing known malicious sample
At feature data of drawing library;
Feature data of drawing generation module, it is special that the function call flow graph for handling sample to be detected using nomography forms figure
Levy data;
It is multiplexed determination module, for matching the feature data of drawing library using the feature data of drawing of sample to be detected, is determined
The code reuse degree of sample to be detected and known malicious sample.
Further, the multiplexing determination module, is specifically used for:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super
Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
Further, the preset threshold is more than or equal to 35%.
Further, further includes: improve feature data of drawing as training data using Open Source Code and history codes
The recognition accuracy in library.
A kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is held by processor
A kind of as above any code reuse recognition methods based on graph theory is realized when row.
To sum up, the present invention provides a kind of code reuse recognition methods and system based on graph theory, by sample to be detected
It is handled with known malicious sample, successively forms abstract syntax tree and platform-independent intermediate representation language, function call stream
Figure, and feature data of drawing finally is converted by sample to be detected, feature data of drawing is converted by a large amount of known malicious samples
Library, and finally matched using feature data of drawing with feature data of drawing library, code is finally confirmed according to matching degree
Multiplexed situation.Because author will not usually rewrite the same module repeatedly, using this feature, attacker can be traced.
Detailed description of the invention
In order to illustrate more clearly of technical solution of the present invention, letter will be made to attached drawing needed in the embodiment below
Singly introduce, it should be apparent that, the accompanying drawings in the following description is only some embodiments recorded in the present invention, for this field
For those of ordinary skill, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of code reuse recognition methods embodiment flow chart based on graph theory provided by the invention;
Fig. 2 is a kind of code reuse identifying system example structure figure based on graph theory provided by the invention.
Specific embodiment
The embodiment of The present invention gives a kind of code reuse recognition methods and system based on graph theory, in order to make this technology
The personnel in field more fully understand the technical solution in the embodiment of the present invention, and make the above objects, features and advantages of the present invention
Can be more obvious and easy to understand, technical solution in the present invention is described in further detail with reference to the accompanying drawing:
The code reuse recognition methods embodiment based on graph theory that present invention firstly provides a kind of, as shown in Figure 1, comprising:
S101: parsing sample to be detected and known malicious sample forms abstract syntax tree.
Wherein, the code that the sample to be detected and known malicious sample can generate for compilers such as llvm or gcc.
S102: it converts the abstract syntax tree of formation to and platform-independent intermediate representation language.
Wherein, the intermediate representation language includes but is not limited to: vex language.
S103: obtaining the call relation of all functions according to the intermediate representation language, and drafting function calls flow graph.
S104: feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample.
Wherein, this general purpose image data library Neo4J can be used to store in the feature data of drawing library, because of this database branch
The flowage structure for carrying out query graph in a manner of searching for generally is held, the flow chart number of submodule can be directly searched in practical applications
According to and it is corresponding.
S105: feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected.
S106: the feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines test sample to be checked
The code reuse degree of this and known malicious sample.Wherein it is possible to determine the figure of sample to be detected using the mode searched for generally
Whether part is similar or complete to the feature data of drawing of each known malicious sample in feature data of drawing library for shape characteristic
It is exactly the same.
Preferably, the feature data of drawing using sample to be detected matches the feature data of drawing library, determine to
The code reuse degree for detecting sample and known malicious sample, specifically includes:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super
Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.If its similarity is respectively less than default
Threshold value then determines sample to be detected all malice sample nonexistent code multiplexings corresponding with feature data of drawing library.
Wherein, the preset threshold is more than or equal to 35%.
Preferably, further includes: feature data of drawing library is improved as training data using Open Source Code and history codes
Recognition accuracy.
The code reuse identifying system embodiment based on graph theory that invention also provides a kind of, as shown in Figure 2, comprising:
Abstract syntax tree generation module 201 forms abstract syntax tree for parsing sample to be detected and known malicious sample;
Intermediate representation language generation module 202, for converting the abstract syntax tree of formation to and platform-independent intermediate representation
Language;
Function call flow graph drafting module 203 is drawn for obtaining the call relation of all functions according to the intermediate representation language
Function call flow graph processed;
Feature data of drawing library generation module 204, for utilizing the relevant function call stream of nomography processing known malicious sample
Figure is at feature data of drawing library;
Feature data of drawing generation module 205, for handling the function call flow graph formation figure of sample to be detected using nomography
Shape characteristic;
It is multiplexed determination module 206, for matching the feature data of drawing library using the feature data of drawing of sample to be detected, really
The code reuse degree of fixed sample to be detected and known malicious sample.
Preferably, the multiplexing determination module, is specifically used for:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super
Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
Wherein, the preset threshold is more than or equal to 35%.
Preferably, further includes: feature data of drawing library is improved as training data using Open Source Code and history codes
Recognition accuracy.
The present invention discloses a kind of non-transitorycomputer readable storage mediums, are stored thereon with computer program,
A kind of as above any code reuse recognition methods based on graph theory is realized when the program is executed by processor.
All the embodiments in this specification are described in a progressive manner, the same or similar between each embodiment
Part may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system
For embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, related place is implemented referring to method
The part explanation of example.
As described above, above-described embodiment gives a kind of code reuse recognition methods and system embodiment based on graph theory,
By handling sample to be detected and known malicious sample, abstract syntax tree and platform-independent middle table are successively formed
Show language, function call flow graph, and finally convert feature data of drawing for sample to be detected, is formed based on known malicious sample
Feature data of drawing library, detect sample code module to be detected feature data of drawing whether with certain in feature data of drawing library
The feature data of drawing part of known malicious code is similar or identical, determines if that there are codes to answer with known sample
With and homology.
Above embodiments are to illustrative and not limiting technical solution of the present invention.Appointing for spirit and scope of the invention is not departed from
What modification or part replacement, are intended to be within the scope of the claims of the invention.
Claims (9)
1. a kind of code reuse recognition methods based on graph theory characterized by comprising
It parses sample to be detected and known malicious sample forms abstract syntax tree;
It converts the abstract syntax tree of formation to and platform-independent intermediate representation language;
The call relation of all functions is obtained according to the intermediate representation language, drafting function calls flow graph;
Feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample;
Feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected;
The feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines sample to be detected and known evil
The code reuse degree of meaning sample.
2. the method as described in claim 1, which is characterized in that the feature data of drawing using sample to be detected matches institute
Feature data of drawing library is stated, the code reuse degree of sample to be detected Yu known malicious sample is determined, specifically includes:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super
Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
3. method according to claim 2, which is characterized in that the preset threshold is more than or equal to 35%.
4. the method as described in claim 1, which is characterized in that further include: using Open Source Code and history codes as training
Data improve the recognition accuracy in feature data of drawing library.
5. a kind of code reuse identifying system based on graph theory characterized by comprising
Abstract syntax tree generation module forms abstract syntax tree for parsing sample to be detected and known malicious sample;
Intermediate representation language generation module, for converting the abstract syntax tree of formation to and platform-independent intermediate representation language
Speech;
Function call flow graph drafting module is drawn for obtaining the call relation of all functions according to the intermediate representation language
Function call flow graph;
Feature data of drawing library generation module, for utilizing the relevant function call stream graphics of nomography processing known malicious sample
At feature data of drawing library;
Feature data of drawing generation module, it is special that the function call flow graph for handling sample to be detected using nomography forms figure
Levy data;
It is multiplexed determination module, for matching the feature data of drawing library using the feature data of drawing of sample to be detected, is determined
The code reuse degree of sample to be detected and known malicious sample.
6. system as claimed in claim 5, which is characterized in that the multiplexing determination module is specifically used for:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super
Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
7. system as claimed in claim 6, which is characterized in that the preset threshold is more than or equal to 35%.
8. system as claimed in claim 5, which is characterized in that further include: using Open Source Code and history codes as training
Data improve the recognition accuracy in feature data of drawing library.
9. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program quilt
A kind of code reuse recognition methods based on graph theory as described in any in claim 1-4 is realized when processor executes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711489518.6A CN109472145A (en) | 2017-12-29 | 2017-12-29 | A kind of code reuse recognition methods and system based on graph theory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711489518.6A CN109472145A (en) | 2017-12-29 | 2017-12-29 | A kind of code reuse recognition methods and system based on graph theory |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109472145A true CN109472145A (en) | 2019-03-15 |
Family
ID=65658042
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711489518.6A Withdrawn CN109472145A (en) | 2017-12-29 | 2017-12-29 | A kind of code reuse recognition methods and system based on graph theory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109472145A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339531A (en) * | 2020-02-24 | 2020-06-26 | 南开大学 | Malicious code detection method and device, storage medium and electronic equipment |
CN112989345A (en) * | 2021-03-17 | 2021-06-18 | 北京安天网络安全技术有限公司 | Threat handling method and framework |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101266550A (en) * | 2007-12-21 | 2008-09-17 | 北京大学 | Malicious code detection method |
CN104407872A (en) * | 2014-12-04 | 2015-03-11 | 北京邮电大学 | Code clone detection method |
US20150242637A1 (en) * | 2014-02-25 | 2015-08-27 | Verisign, Inc. | Automated vulnerability intelligence generation and application |
US9166997B1 (en) * | 2013-09-19 | 2015-10-20 | Symantec Corporation | Systems and methods for reducing false positives when using event-correlation graphs to detect attacks on computing systems |
CN105046152A (en) * | 2015-07-24 | 2015-11-11 | 四川大学 | Function call graph fingerprint based malicious software detection method |
CN107229563A (en) * | 2016-03-25 | 2017-10-03 | 中国科学院信息工程研究所 | A kind of binary program leak function correlating method across framework |
-
2017
- 2017-12-29 CN CN201711489518.6A patent/CN109472145A/en not_active Withdrawn
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101266550A (en) * | 2007-12-21 | 2008-09-17 | 北京大学 | Malicious code detection method |
US9166997B1 (en) * | 2013-09-19 | 2015-10-20 | Symantec Corporation | Systems and methods for reducing false positives when using event-correlation graphs to detect attacks on computing systems |
US20150242637A1 (en) * | 2014-02-25 | 2015-08-27 | Verisign, Inc. | Automated vulnerability intelligence generation and application |
CN104407872A (en) * | 2014-12-04 | 2015-03-11 | 北京邮电大学 | Code clone detection method |
CN105046152A (en) * | 2015-07-24 | 2015-11-11 | 四川大学 | Function call graph fingerprint based malicious software detection method |
CN107229563A (en) * | 2016-03-25 | 2017-10-03 | 中国科学院信息工程研究所 | A kind of binary program leak function correlating method across framework |
Non-Patent Citations (3)
Title |
---|
刘星 等: "恶意代码的函数调用图相似性分析", 《计算机工程与科学》 * |
路程: "Android平台恶意软件检测系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑(月刊)》 * |
钱雨村 等: "恶意代码同源性分析及家族聚类", 《计算机工程与应用》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339531A (en) * | 2020-02-24 | 2020-06-26 | 南开大学 | Malicious code detection method and device, storage medium and electronic equipment |
CN111339531B (en) * | 2020-02-24 | 2023-12-19 | 南开大学 | Malicious code detection method and device, storage medium and electronic equipment |
CN112989345A (en) * | 2021-03-17 | 2021-06-18 | 北京安天网络安全技术有限公司 | Threat handling method and framework |
CN112989345B (en) * | 2021-03-17 | 2024-04-12 | 北京安天网络安全技术有限公司 | Threat handling method and framework |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110245496B (en) | Source code vulnerability detection method and detector and training method and system thereof | |
CN109697162B (en) | Software defect automatic detection method based on open source code library | |
CN110737899B (en) | Intelligent contract security vulnerability detection method based on machine learning | |
CN113190849B (en) | Webshell script detection method and device, electronic equipment and storage medium | |
US11609748B2 (en) | Semantic code search based on augmented programming language corpus | |
US11651014B2 (en) | Source code retrieval | |
US20220292200A1 (en) | Deep-learning based device and method for detecting source-code vulnerability with improved robustness | |
CN103914657A (en) | Malicious program detection method based on function characteristics | |
US9870351B2 (en) | Annotating embedded tables | |
CN115146279A (en) | Program vulnerability detection method, terminal device and storage medium | |
CN109472145A (en) | A kind of code reuse recognition methods and system based on graph theory | |
KR20150122855A (en) | Distributed processing system and method for real time question and answer | |
CN116661805B (en) | Code representation generation method and device, storage medium and electronic equipment | |
CN110750297B (en) | Python code reference information generation method based on program analysis and text analysis | |
CN116149669B (en) | Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium | |
CN114285587A (en) | Domain name identification method and device and domain name classification model acquisition method and device | |
CN114792092B (en) | Text theme extraction method and device based on semantic enhancement | |
CN115066674A (en) | Method for evaluating source code using numeric array representation of source code elements | |
CN110737469A (en) | Source code similarity evaluation method based on semantic information on functional granularities | |
CN116401145A (en) | Source code static analysis processing method and device | |
CN115828888A (en) | Method for semantic analysis and structurization of various weblogs | |
CN115373982A (en) | Test report analysis method, device, equipment and medium based on artificial intelligence | |
JP2015018372A (en) | Expression extraction model learning device, expression extraction model learning method and computer program | |
US9684691B1 (en) | System and method to facilitate the association of structured content in a structured document with unstructured content in an unstructured document | |
CN113297580A (en) | Code semantic analysis-based electric power information system safety protection method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190315 |
|
WW01 | Invention patent application withdrawn after publication |