CN109472145A - A kind of code reuse recognition methods and system based on graph theory - Google Patents

A kind of code reuse recognition methods and system based on graph theory Download PDF

Info

Publication number
CN109472145A
CN109472145A CN201711489518.6A CN201711489518A CN109472145A CN 109472145 A CN109472145 A CN 109472145A CN 201711489518 A CN201711489518 A CN 201711489518A CN 109472145 A CN109472145 A CN 109472145A
Authority
CN
China
Prior art keywords
sample
feature data
detected
known malicious
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201711489518.6A
Other languages
Chinese (zh)
Inventor
李登峰
李柏松
王小丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ahtech Network Safe Technology Ltd
Original Assignee
Beijing Ahtech Network Safe Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ahtech Network Safe Technology Ltd filed Critical Beijing Ahtech Network Safe Technology Ltd
Priority to CN201711489518.6A priority Critical patent/CN109472145A/en
Publication of CN109472145A publication Critical patent/CN109472145A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Abstract

The invention discloses a kind of code reuse recognition methods and system based on graph theory, wherein the described method includes: parsing sample to be detected and known malicious sample formation abstract syntax tree;It converts the abstract syntax tree of formation to and platform-independent intermediate representation language;The call relation of all functions is obtained according to the intermediate representation language, drafting function calls flow graph;Feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample;Feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected;The feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines the code reuse degree of sample to be detected Yu known malicious sample.The present invention judges the homology of malicious code by analysis code similitude, can not only be used to match malicious code family, but also can trace attacker.

Description

A kind of code reuse recognition methods and system based on graph theory
Technical field
The present invention relates to field of information security technology more particularly to a kind of code reuse recognition methods based on graph theory and it is System.
Background technique
Due to malicious attacker generally will not each repetition write the same module, it is possible to pass through the similar of code Property judge malicious code homology, for matching family, trace attacker.
Current code reuse identification depends on text string extracting and longest common subsequence algorithm, and speed is relatively slow and possesses Higher error rate, and new feature can not be formed according to existing feature, it can only monotonously carry out pattern match.
Summary of the invention
In view of the above technical problems, the present invention carries out processing shape to sample to be detected and known malicious sample by nomography At feature data of drawing, the differentiation of code reuse is finally carried out according to the feature of figure, and then judges the homology of malicious code And finally trace attacker.
The present invention realizes with the following method: a kind of code reuse recognition methods based on graph theory, comprising:
It parses sample to be detected and known malicious sample forms abstract syntax tree;
It converts the abstract syntax tree of formation to and platform-independent intermediate representation language;
The call relation of all functions is obtained according to the intermediate representation language, drafting function calls flow graph;
Feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample;
Feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected;
The feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines sample to be detected and known evil The code reuse degree of meaning sample.
Further, the feature data of drawing using sample to be detected matches the feature data of drawing library, determines The code reuse degree of sample to be detected and known malicious sample, specifically includes:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
Further, the preset threshold is more than or equal to 35%.
Further, further includes: improve feature data of drawing as training data using Open Source Code and history codes The recognition accuracy in library.
The present invention is realized using following system: a kind of code reuse identifying system based on graph theory, comprising:
Abstract syntax tree generation module forms abstract syntax tree for parsing sample to be detected and known malicious sample;
Intermediate representation language generation module, for converting the abstract syntax tree of formation to and platform-independent intermediate representation language Speech;
Function call flow graph drafting module is drawn for obtaining the call relation of all functions according to the intermediate representation language Function call flow graph;
Feature data of drawing library generation module, for utilizing the relevant function call stream graphics of nomography processing known malicious sample At feature data of drawing library;
Feature data of drawing generation module, it is special that the function call flow graph for handling sample to be detected using nomography forms figure Levy data;
It is multiplexed determination module, for matching the feature data of drawing library using the feature data of drawing of sample to be detected, is determined The code reuse degree of sample to be detected and known malicious sample.
Further, the multiplexing determination module, is specifically used for:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
Further, the preset threshold is more than or equal to 35%.
Further, further includes: improve feature data of drawing as training data using Open Source Code and history codes The recognition accuracy in library.
A kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is held by processor A kind of as above any code reuse recognition methods based on graph theory is realized when row.
To sum up, the present invention provides a kind of code reuse recognition methods and system based on graph theory, by sample to be detected It is handled with known malicious sample, successively forms abstract syntax tree and platform-independent intermediate representation language, function call stream Figure, and feature data of drawing finally is converted by sample to be detected, feature data of drawing is converted by a large amount of known malicious samples Library, and finally matched using feature data of drawing with feature data of drawing library, code is finally confirmed according to matching degree Multiplexed situation.Because author will not usually rewrite the same module repeatedly, using this feature, attacker can be traced.
Detailed description of the invention
In order to illustrate more clearly of technical solution of the present invention, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, the accompanying drawings in the following description is only some embodiments recorded in the present invention, for this field For those of ordinary skill, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of code reuse recognition methods embodiment flow chart based on graph theory provided by the invention;
Fig. 2 is a kind of code reuse identifying system example structure figure based on graph theory provided by the invention.
Specific embodiment
The embodiment of The present invention gives a kind of code reuse recognition methods and system based on graph theory, in order to make this technology The personnel in field more fully understand the technical solution in the embodiment of the present invention, and make the above objects, features and advantages of the present invention Can be more obvious and easy to understand, technical solution in the present invention is described in further detail with reference to the accompanying drawing:
The code reuse recognition methods embodiment based on graph theory that present invention firstly provides a kind of, as shown in Figure 1, comprising:
S101: parsing sample to be detected and known malicious sample forms abstract syntax tree.
Wherein, the code that the sample to be detected and known malicious sample can generate for compilers such as llvm or gcc.
S102: it converts the abstract syntax tree of formation to and platform-independent intermediate representation language.
Wherein, the intermediate representation language includes but is not limited to: vex language.
S103: obtaining the call relation of all functions according to the intermediate representation language, and drafting function calls flow graph.
S104: feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample. Wherein, this general purpose image data library Neo4J can be used to store in the feature data of drawing library, because of this database branch The flowage structure for carrying out query graph in a manner of searching for generally is held, the flow chart number of submodule can be directly searched in practical applications According to and it is corresponding.
S105: feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected.
S106: the feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines test sample to be checked The code reuse degree of this and known malicious sample.Wherein it is possible to determine the figure of sample to be detected using the mode searched for generally Whether part is similar or complete to the feature data of drawing of each known malicious sample in feature data of drawing library for shape characteristic It is exactly the same.
Preferably, the feature data of drawing using sample to be detected matches the feature data of drawing library, determine to The code reuse degree for detecting sample and known malicious sample, specifically includes:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.If its similarity is respectively less than default Threshold value then determines sample to be detected all malice sample nonexistent code multiplexings corresponding with feature data of drawing library.
Wherein, the preset threshold is more than or equal to 35%.
Preferably, further includes: feature data of drawing library is improved as training data using Open Source Code and history codes Recognition accuracy.
The code reuse identifying system embodiment based on graph theory that invention also provides a kind of, as shown in Figure 2, comprising:
Abstract syntax tree generation module 201 forms abstract syntax tree for parsing sample to be detected and known malicious sample;
Intermediate representation language generation module 202, for converting the abstract syntax tree of formation to and platform-independent intermediate representation Language;
Function call flow graph drafting module 203 is drawn for obtaining the call relation of all functions according to the intermediate representation language Function call flow graph processed;
Feature data of drawing library generation module 204, for utilizing the relevant function call stream of nomography processing known malicious sample Figure is at feature data of drawing library;
Feature data of drawing generation module 205, for handling the function call flow graph formation figure of sample to be detected using nomography Shape characteristic;
It is multiplexed determination module 206, for matching the feature data of drawing library using the feature data of drawing of sample to be detected, really The code reuse degree of fixed sample to be detected and known malicious sample.
Preferably, the multiplexing determination module, is specifically used for:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
Wherein, the preset threshold is more than or equal to 35%.
Preferably, further includes: feature data of drawing library is improved as training data using Open Source Code and history codes Recognition accuracy.
The present invention discloses a kind of non-transitorycomputer readable storage mediums, are stored thereon with computer program, A kind of as above any code reuse recognition methods based on graph theory is realized when the program is executed by processor.
All the embodiments in this specification are described in a progressive manner, the same or similar between each embodiment Part may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system For embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, related place is implemented referring to method The part explanation of example.
As described above, above-described embodiment gives a kind of code reuse recognition methods and system embodiment based on graph theory, By handling sample to be detected and known malicious sample, abstract syntax tree and platform-independent middle table are successively formed Show language, function call flow graph, and finally convert feature data of drawing for sample to be detected, is formed based on known malicious sample Feature data of drawing library, detect sample code module to be detected feature data of drawing whether with certain in feature data of drawing library The feature data of drawing part of known malicious code is similar or identical, determines if that there are codes to answer with known sample With and homology.
Above embodiments are to illustrative and not limiting technical solution of the present invention.Appointing for spirit and scope of the invention is not departed from What modification or part replacement, are intended to be within the scope of the claims of the invention.

Claims (9)

1. a kind of code reuse recognition methods based on graph theory characterized by comprising
It parses sample to be detected and known malicious sample forms abstract syntax tree;
It converts the abstract syntax tree of formation to and platform-independent intermediate representation language;
The call relation of all functions is obtained according to the intermediate representation language, drafting function calls flow graph;
Feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample;
Feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected;
The feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines sample to be detected and known evil The code reuse degree of meaning sample.
2. the method as described in claim 1, which is characterized in that the feature data of drawing using sample to be detected matches institute Feature data of drawing library is stated, the code reuse degree of sample to be detected Yu known malicious sample is determined, specifically includes:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
3. method according to claim 2, which is characterized in that the preset threshold is more than or equal to 35%.
4. the method as described in claim 1, which is characterized in that further include: using Open Source Code and history codes as training Data improve the recognition accuracy in feature data of drawing library.
5. a kind of code reuse identifying system based on graph theory characterized by comprising
Abstract syntax tree generation module forms abstract syntax tree for parsing sample to be detected and known malicious sample;
Intermediate representation language generation module, for converting the abstract syntax tree of formation to and platform-independent intermediate representation language Speech;
Function call flow graph drafting module is drawn for obtaining the call relation of all functions according to the intermediate representation language Function call flow graph;
Feature data of drawing library generation module, for utilizing the relevant function call stream graphics of nomography processing known malicious sample At feature data of drawing library;
Feature data of drawing generation module, it is special that the function call flow graph for handling sample to be detected using nomography forms figure Levy data;
It is multiplexed determination module, for matching the feature data of drawing library using the feature data of drawing of sample to be detected, is determined The code reuse degree of sample to be detected and known malicious sample.
6. system as claimed in claim 5, which is characterized in that the multiplexing determination module is specifically used for:
If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.
7. system as claimed in claim 6, which is characterized in that the preset threshold is more than or equal to 35%.
8. system as claimed in claim 5, which is characterized in that further include: using Open Source Code and history codes as training Data improve the recognition accuracy in feature data of drawing library.
9. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program quilt A kind of code reuse recognition methods based on graph theory as described in any in claim 1-4 is realized when processor executes.
CN201711489518.6A 2017-12-29 2017-12-29 A kind of code reuse recognition methods and system based on graph theory Withdrawn CN109472145A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711489518.6A CN109472145A (en) 2017-12-29 2017-12-29 A kind of code reuse recognition methods and system based on graph theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711489518.6A CN109472145A (en) 2017-12-29 2017-12-29 A kind of code reuse recognition methods and system based on graph theory

Publications (1)

Publication Number Publication Date
CN109472145A true CN109472145A (en) 2019-03-15

Family

ID=65658042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711489518.6A Withdrawn CN109472145A (en) 2017-12-29 2017-12-29 A kind of code reuse recognition methods and system based on graph theory

Country Status (1)

Country Link
CN (1) CN109472145A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339531A (en) * 2020-02-24 2020-06-26 南开大学 Malicious code detection method and device, storage medium and electronic equipment
CN112989345A (en) * 2021-03-17 2021-06-18 北京安天网络安全技术有限公司 Threat handling method and framework

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101266550A (en) * 2007-12-21 2008-09-17 北京大学 Malicious code detection method
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
US20150242637A1 (en) * 2014-02-25 2015-08-27 Verisign, Inc. Automated vulnerability intelligence generation and application
US9166997B1 (en) * 2013-09-19 2015-10-20 Symantec Corporation Systems and methods for reducing false positives when using event-correlation graphs to detect attacks on computing systems
CN105046152A (en) * 2015-07-24 2015-11-11 四川大学 Function call graph fingerprint based malicious software detection method
CN107229563A (en) * 2016-03-25 2017-10-03 中国科学院信息工程研究所 A kind of binary program leak function correlating method across framework

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101266550A (en) * 2007-12-21 2008-09-17 北京大学 Malicious code detection method
US9166997B1 (en) * 2013-09-19 2015-10-20 Symantec Corporation Systems and methods for reducing false positives when using event-correlation graphs to detect attacks on computing systems
US20150242637A1 (en) * 2014-02-25 2015-08-27 Verisign, Inc. Automated vulnerability intelligence generation and application
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN105046152A (en) * 2015-07-24 2015-11-11 四川大学 Function call graph fingerprint based malicious software detection method
CN107229563A (en) * 2016-03-25 2017-10-03 中国科学院信息工程研究所 A kind of binary program leak function correlating method across framework

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘星 等: "恶意代码的函数调用图相似性分析", 《计算机工程与科学》 *
路程: "Android平台恶意软件检测系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑(月刊)》 *
钱雨村 等: "恶意代码同源性分析及家族聚类", 《计算机工程与应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339531A (en) * 2020-02-24 2020-06-26 南开大学 Malicious code detection method and device, storage medium and electronic equipment
CN111339531B (en) * 2020-02-24 2023-12-19 南开大学 Malicious code detection method and device, storage medium and electronic equipment
CN112989345A (en) * 2021-03-17 2021-06-18 北京安天网络安全技术有限公司 Threat handling method and framework
CN112989345B (en) * 2021-03-17 2024-04-12 北京安天网络安全技术有限公司 Threat handling method and framework

Similar Documents

Publication Publication Date Title
CN110245496B (en) Source code vulnerability detection method and detector and training method and system thereof
CN109697162B (en) Software defect automatic detection method based on open source code library
CN110737899B (en) Intelligent contract security vulnerability detection method based on machine learning
CN113190849B (en) Webshell script detection method and device, electronic equipment and storage medium
US11609748B2 (en) Semantic code search based on augmented programming language corpus
US11651014B2 (en) Source code retrieval
US20220292200A1 (en) Deep-learning based device and method for detecting source-code vulnerability with improved robustness
CN103914657A (en) Malicious program detection method based on function characteristics
US9870351B2 (en) Annotating embedded tables
CN115146279A (en) Program vulnerability detection method, terminal device and storage medium
CN109472145A (en) A kind of code reuse recognition methods and system based on graph theory
KR20150122855A (en) Distributed processing system and method for real time question and answer
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN110750297B (en) Python code reference information generation method based on program analysis and text analysis
CN116149669B (en) Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium
CN114285587A (en) Domain name identification method and device and domain name classification model acquisition method and device
CN114792092B (en) Text theme extraction method and device based on semantic enhancement
CN115066674A (en) Method for evaluating source code using numeric array representation of source code elements
CN110737469A (en) Source code similarity evaluation method based on semantic information on functional granularities
CN116401145A (en) Source code static analysis processing method and device
CN115828888A (en) Method for semantic analysis and structurization of various weblogs
CN115373982A (en) Test report analysis method, device, equipment and medium based on artificial intelligence
JP2015018372A (en) Expression extraction model learning device, expression extraction model learning method and computer program
US9684691B1 (en) System and method to facilitate the association of structured content in a structured document with unstructured content in an unstructured document
CN113297580A (en) Code semantic analysis-based electric power information system safety protection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20190315

WW01 Invention patent application withdrawn after publication