CN109472145A

CN109472145A - A kind of code reuse recognition methods and system based on graph theory

Info

Publication number: CN109472145A
Application number: CN201711489518.6A
Authority: CN
Inventors: 李登峰; 李柏松; 王小丰
Original assignee: Beijing Ahtech Network Safe Technology Ltd
Current assignee: Beijing Ahtech Network Safe Technology Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2019-03-15

Abstract

The invention discloses a kind of code reuse recognition methods and system based on graph theory, wherein the described method includes: parsing sample to be detected and known malicious sample formation abstract syntax tree；It converts the abstract syntax tree of formation to and platform-independent intermediate representation language；The call relation of all functions is obtained according to the intermediate representation language, drafting function calls flow graph；Feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample；Feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected；The feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines the code reuse degree of sample to be detected Yu known malicious sample.The present invention judges the homology of malicious code by analysis code similitude, can not only be used to match malicious code family, but also can trace attacker.

Description

A kind of code reuse recognition methods and system based on graph theory

Technical field

The present invention relates to field of information security technology more particularly to a kind of code reuse recognition methods based on graph theory and it is System.

Background technique

Due to malicious attacker generally will not each repetition write the same module, it is possible to pass through the similar of code Property judge malicious code homology, for matching family, trace attacker.

Current code reuse identification depends on text string extracting and longest common subsequence algorithm, and speed is relatively slow and possesses Higher error rate, and new feature can not be formed according to existing feature, it can only monotonously carry out pattern match.

Summary of the invention

In view of the above technical problems, the present invention carries out processing shape to sample to be detected and known malicious sample by nomography At feature data of drawing, the differentiation of code reuse is finally carried out according to the feature of figure, and then judges the homology of malicious code And finally trace attacker.

The present invention realizes with the following method: a kind of code reuse recognition methods based on graph theory, comprising:

It parses sample to be detected and known malicious sample forms abstract syntax tree；

It converts the abstract syntax tree of formation to and platform-independent intermediate representation language；

The call relation of all functions is obtained according to the intermediate representation language, drafting function calls flow graph；

Feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample；

Feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected；

The feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines sample to be detected and known evil The code reuse degree of meaning sample.

Further, the feature data of drawing using sample to be detected matches the feature data of drawing library, determines The code reuse degree of sample to be detected and known malicious sample, specifically includes:

If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.

Further, the preset threshold is more than or equal to 35%.

Further, further includes: improve feature data of drawing as training data using Open Source Code and history codes The recognition accuracy in library.

The present invention is realized using following system: a kind of code reuse identifying system based on graph theory, comprising:

Abstract syntax tree generation module forms abstract syntax tree for parsing sample to be detected and known malicious sample；

Intermediate representation language generation module, for converting the abstract syntax tree of formation to and platform-independent intermediate representation language Speech；

Function call flow graph drafting module is drawn for obtaining the call relation of all functions according to the intermediate representation language Function call flow graph；

Feature data of drawing library generation module, for utilizing the relevant function call stream graphics of nomography processing known malicious sample At feature data of drawing library；

Feature data of drawing generation module, it is special that the function call flow graph for handling sample to be detected using nomography forms figure Levy data；

It is multiplexed determination module, for matching the feature data of drawing library using the feature data of drawing of sample to be detected, is determined The code reuse degree of sample to be detected and known malicious sample.

Further, the multiplexing determination module, is specifically used for:

Further, the preset threshold is more than or equal to 35%.

A kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is held by processor A kind of as above any code reuse recognition methods based on graph theory is realized when row.

To sum up, the present invention provides a kind of code reuse recognition methods and system based on graph theory, by sample to be detected It is handled with known malicious sample, successively forms abstract syntax tree and platform-independent intermediate representation language, function call stream Figure, and feature data of drawing finally is converted by sample to be detected, feature data of drawing is converted by a large amount of known malicious samples Library, and finally matched using feature data of drawing with feature data of drawing library, code is finally confirmed according to matching degree Multiplexed situation.Because author will not usually rewrite the same module repeatedly, using this feature, attacker can be traced.

Detailed description of the invention

In order to illustrate more clearly of technical solution of the present invention, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, the accompanying drawings in the following description is only some embodiments recorded in the present invention, for this field For those of ordinary skill, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of code reuse recognition methods embodiment flow chart based on graph theory provided by the invention；

Fig. 2 is a kind of code reuse identifying system example structure figure based on graph theory provided by the invention.

Specific embodiment

The embodiment of The present invention gives a kind of code reuse recognition methods and system based on graph theory, in order to make this technology The personnel in field more fully understand the technical solution in the embodiment of the present invention, and make the above objects, features and advantages of the present invention Can be more obvious and easy to understand, technical solution in the present invention is described in further detail with reference to the accompanying drawing:

The code reuse recognition methods embodiment based on graph theory that present invention firstly provides a kind of, as shown in Figure 1, comprising:

S101: parsing sample to be detected and known malicious sample forms abstract syntax tree.

Wherein, the code that the sample to be detected and known malicious sample can generate for compilers such as llvm or gcc.

S102: it converts the abstract syntax tree of formation to and platform-independent intermediate representation language.

Wherein, the intermediate representation language includes but is not limited to: vex language.

S103: obtaining the call relation of all functions according to the intermediate representation language, and drafting function calls flow graph.

S104: feature data of drawing library is formed using the relevant function call flow graph of nomography processing known malicious sample. Wherein, this general purpose image data library Neo4J can be used to store in the feature data of drawing library, because of this database branch The flowage structure for carrying out query graph in a manner of searching for generally is held, the flow chart number of submodule can be directly searched in practical applications According to and it is corresponding.

S105: feature data of drawing is formed using the function call flow graph that nomography handles sample to be detected.

S106: the feature data of drawing library is matched using the feature data of drawing of sample to be detected, determines test sample to be checked The code reuse degree of this and known malicious sample.Wherein it is possible to determine the figure of sample to be detected using the mode searched for generally Whether part is similar or complete to the feature data of drawing of each known malicious sample in feature data of drawing library for shape characteristic It is exactly the same.

Preferably, the feature data of drawing using sample to be detected matches the feature data of drawing library, determine to The code reuse degree for detecting sample and known malicious sample, specifically includes:

If the similarity of a certain known malicious sample in the feature data of drawing of sample to be detected and feature data of drawing library is super Preset threshold is crossed, then determining sample to be detected, there are code reuses with the known malicious sample.If its similarity is respectively less than default Threshold value then determines sample to be detected all malice sample nonexistent code multiplexings corresponding with feature data of drawing library.

Wherein, the preset threshold is more than or equal to 35%.

Preferably, further includes: feature data of drawing library is improved as training data using Open Source Code and history codes Recognition accuracy.

The code reuse identifying system embodiment based on graph theory that invention also provides a kind of, as shown in Figure 2, comprising:

Abstract syntax tree generation module 201 forms abstract syntax tree for parsing sample to be detected and known malicious sample；

Intermediate representation language generation module 202, for converting the abstract syntax tree of formation to and platform-independent intermediate representation Language；

Function call flow graph drafting module 203 is drawn for obtaining the call relation of all functions according to the intermediate representation language Function call flow graph processed；

Feature data of drawing library generation module 204, for utilizing the relevant function call stream of nomography processing known malicious sample Figure is at feature data of drawing library；

Feature data of drawing generation module 205, for handling the function call flow graph formation figure of sample to be detected using nomography Shape characteristic；

It is multiplexed determination module 206, for matching the feature data of drawing library using the feature data of drawing of sample to be detected, really The code reuse degree of fixed sample to be detected and known malicious sample.

Preferably, the multiplexing determination module, is specifically used for:

Wherein, the preset threshold is more than or equal to 35%.

The present invention discloses a kind of non-transitorycomputer readable storage mediums, are stored thereon with computer program, A kind of as above any code reuse recognition methods based on graph theory is realized when the program is executed by processor.

All the embodiments in this specification are described in a progressive manner, the same or similar between each embodiment Part may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system For embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, related place is implemented referring to method The part explanation of example.

As described above, above-described embodiment gives a kind of code reuse recognition methods and system embodiment based on graph theory, By handling sample to be detected and known malicious sample, abstract syntax tree and platform-independent middle table are successively formed Show language, function call flow graph, and finally convert feature data of drawing for sample to be detected, is formed based on known malicious sample Feature data of drawing library, detect sample code module to be detected feature data of drawing whether with certain in feature data of drawing library The feature data of drawing part of known malicious code is similar or identical, determines if that there are codes to answer with known sample With and homology.

Above embodiments are to illustrative and not limiting technical solution of the present invention.Appointing for spirit and scope of the invention is not departed from What modification or part replacement, are intended to be within the scope of the claims of the invention.

Claims

1. a kind of code reuse recognition methods based on graph theory characterized by comprising

2. the method as described in claim 1, which is characterized in that the feature data of drawing using sample to be detected matches institute Feature data of drawing library is stated, the code reuse degree of sample to be detected Yu known malicious sample is determined, specifically includes:

3. method according to claim 2, which is characterized in that the preset threshold is more than or equal to 35%.

4. the method as described in claim 1, which is characterized in that further include: using Open Source Code and history codes as training Data improve the recognition accuracy in feature data of drawing library.

5. a kind of code reuse identifying system based on graph theory characterized by comprising

6. system as claimed in claim 5, which is characterized in that the multiplexing determination module is specifically used for:

7. system as claimed in claim 6, which is characterized in that the preset threshold is more than or equal to 35%.

8. system as claimed in claim 5, which is characterized in that further include: using Open Source Code and history codes as training Data improve the recognition accuracy in feature data of drawing library.

9. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program quilt A kind of code reuse recognition methods based on graph theory as described in any in claim 1-4 is realized when processor executes.