CN112199115A

CN112199115A - Cross-Java byte code and source code line association method based on feature similarity matching

Info

Publication number: CN112199115A
Application number: CN202010998361.5A
Authority: CN
Inventors: 杨珉; 张源; 戴嘉润; 张磊
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2021-01-08

Abstract

The invention belongs to the technical field of android platform application security analysis, and particularly relates to a Java bytecode-source code association method based on feature similarity. The invention comprises the following steps: the method comprises the following steps that Java byte codes based on Conditional Random Fields (CRF) are divided into lines, firstly, a large number of Java byte code files marked with debugging information are collected, the marked byte code files are input into a CRF model for training, and the trained CRF model is used for automatically and accurately segmenting the byte code files of passive code line information; cross-language feature extraction, namely extracting features in Java byte codes and source codes; and (3) line matching between the Java byte codes and the source codes, namely solving the optimal matching result between the Java byte code lines and the source code lines by using a longest common subsequence algorithm. The method can accurately establish the mapping relation between the Java byte codes and the Java source codes in line-to-line granularity, and provides great convenience for analysis of closed-source software on the android platform.

Description

Cross-Java byte code and source code line association method based on feature similarity matching

Technical Field

The invention belongs to the technical field of android platform application software reverse analysis, and particularly relates to a cross-Java byte code and source code line correlation method based on feature similarity matching.

Background

When a Java program is compiled into bytecode, the compiler supports the bytecode with debugging information attached thereto, which usually includes a line number, a function name, a program variable name, and the like, for better reading and debugging the bytecode. The line number has the main function of accurately positioning the position of the source code when the exception is thrown in the running process of the byte code, and is convenient for a developer to debug the program and repair the defect. On the other hand, when performing association analysis of cross-source code and bytecode, the line association information can effectively help to perform accurate association at a code line level. For example, for evaluating whether a target bytecode is custom developed based on a version of source code, the line-association information may better help identify whether a function in the source code is similar to a function in the bytecode, and what custom development is being performed. Therefore, the row-related information is very important for reverse analysis tasks related to diagnosis of program defects, analysis and comparison of programs, and the like.

Existing line correlation techniques typically utilize debug information to map Java bytecode to corresponding source code. When debugging information exists in the byte code file, the target byte code file can be converted into a smali file by utilizing tools such as bakmali, and a large number of marks of 'linexx' appear in the disassembled smali file, wherein xx designates a line number in the source code. By utilizing the line number information, safety personnel can quickly position the source code line corresponding to the smali instruction and establish accurate line correlation information.

However, in most closed source software, such as Android ROM files customized by various mobile phone manufacturers, commercial versions of mobile phone software, and the like, debugging information in the release system or software is often removed in order to reduce the size of the release system or software or hide source code information. In this case, it would no longer be applicable to associate code in different language hierarchies through debugging information. Therefore, how to correlate Java bytecode and source code in the absence of debugging information becomes a new challenge.

Based on the above analysis, it is very necessary to develop a line correlation technique between Java bytecode and source code for no debugging information.

Reference to the literature

1.Lafferty J,Mccallum A,Pereira F C N.Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proc.18th International Conf.on Machine Learning.2001.

2.Eugene,W,Myers.An O(ND)difference algorithm and its variations[J].Algorithmica,1986.。

Disclosure of Invention

In order to overcome the defect that the prior art completely depends on debugging information in byte codes, the invention provides a cross-Java byte code and source code line correlation method based on feature similarity matching, which can realize the establishment of accurate Java byte code and source code line level correlation information when no debugging information exists in the byte codes.

The invention relates to a cross-Java byte code and source code line correlation method based on feature similarity matching, and the overall framework diagram of the cross-Java byte code and source code line correlation method is shown in figure 1. The method comprises the following specific steps:

(1) java bytecode lines based on Conditional Random Fields (CRF); firstly, collecting a large number of Java byte code files marked with debugging information (line information), then inputting the marked Java byte code files into a CRF model for training, wherein the CRF model (1) after training is used for automatically and accurately segmenting the byte code files of the passive code line information;

(2) cross-language feature extraction; i.e. extracting features in Java bytecode and source code. Note that the features here refer to features shared between a single line of source code and a single Java bytecode aggregation line (obtained in the line splitting operation of the previous step);

(3) and (3) line matching between the Java byte codes and the source codes, namely solving an optimal matching result between the Java byte code lines and the source code lines by using a longest common subsequence algorithm (2).

The individual steps are further described below:

(1) conditional Random Field (CRF) based Java byte code division row

Conditional Random Fields (CRFs), which are conditional probability distribution models of a set of output sequences given a set of input sequences, are widely used in natural language processing. CRF is often used in the field of part-of-speech tagging in natural language processing, for example, a sentence composed of several words whose parts-of-speech are selected from a known set of parts-of-speech (nouns, verbs, etc.), and after the part-of-speech of each word is determined, a random field is formed. A random field labeled part-of-speech can be trained by a CRF model and learn how to label part-of-speech for unseen sequences.

Based on the principle that the CRF can effectively perform word segmentation on the sequence, the CRF model is migrated to the lines of the Java bytecode. The CRF model training and branch process is illustrated in FIG. 2, and the present invention assigns one of the following 4 tags to each smali instruction (analogous to part-of-speech tagging): s (a section with a single instruction), B (the beginning of the section), M (the middle of the section) and E (the end of the section), the labeled smali byte code file is used for training a CRF model, a prediction model is derived after enough training rounds, and finally the trained model can automatically label the part of speech of the byte code file without the code line information (S, B, M or E). By utilizing the predicted marking information, the invention can automatically finish the segmentation of the corresponding byte codes. The segmented bytecode file is composed of aggregation lines, the aggregation lines are composed of one or more continuous smali instructions, and one smali aggregation line represents one potential Java source code line.

(2) Cross-language feature extraction

I.e. to extract the syntactic characteristics shared in common by the Java bytecode and the source code. If a Java source code line matches a corresponding smali line, then certain characteristics must be shared between the two, and conversely, Java lines and smali lines with similar characteristics are most likely associated. The invention defines the following two attributes to select a suitable feature set:

(1) and (4) sharing property. I.e. the selected feature should be present in both the Java source code line and the smali bytecode. For example, temporary variable names exist only in source code, and thus, temporary variable names are not suitable as common features for both languages;

(2) and (5) consistency. I.e., the appropriate features consistently extracted from the source code and the bytecode. For example, the smali instruction "array-creation" is generated only in method calls with variable length parameters, without explicit array creation in the corresponding source code, so array creation is not a suitable feature in this example.

The present invention selects 5 features (as shown in fig. 3) that meet the above two attributes:

constant value: containing constant strings, int-type and long-type integer values.

The function called: the function name, and the length of the parameter, contain all function calls except those automatically generated by the compiler, such as toString, valueOf, and ap pend.

Class member variable access: class member variable names.

Object creation: the class name of the Object contains all class names except those automatically generated by the compiler, such as Object, StringBuilder, etc.

The special instructions are: instruction types, including throw, monitor, switch, instance-of, return.

The invention extracts 5 features in each line of Java source code and each line of smali aggregation line and forms a line of feature set, and the feature sets are used for calculating the feature similarity in the matching step.

(3) Matching between Java source code and bytecode

The matching between the Java source code and the byte code is realized by mapping between lines of the Java source code lines marked with characteristics and the smali lines.

The input of the matching module isThe matching aim is to find a one-to-one mapping relation between a Java source code line sequence and a sequence of a smali aggregation line, which is really a classic longest common subsequence problem. The equivalence principle between sequence elements needs to be defined before finding the longest common subsequence; the present invention utilizes feature similarity to approximately measure equivalence between sequence elements. Assuming that s and b are two feature sets extracted from a source code line and a smali aggregation line respectively, feature similarity is calculated by adopting a Jaccard formula, wherein T_{LineSimilarity}Is a predefined threshold between 0 and 1.

IsEquivalent(s，b)：Jaccard_Sim(s,b)＞＝T_{LineSimilarity}

When the value of Jaccard calculation is larger, the code line and the smash aggregation line sequence share more common characteristics, the code line and the smash aggregation line sequence are most likely to be matched, and otherwise the code line and the smash aggregation line sequence are less likely to be matched. According to the principle, the feature similarity is used as an equivalence principle in the matching process.

After the equivalence principle between the code line and the byte code sequence is defined, the whole Java method source code and the corresponding smali byte code method need to be mapped integrally. The mapping principle has the following points:

each source code row is connected to at most one smali aggregation row, and vice versa;

the line number of the smali aggregation line matched with the source code line is not larger than that of a previously matched smali aggregation line;

it is desirable to match as many source code lines and smali aggregate lines as possible.

In the invention, two rows of matching are adopted to find the best matching result, which is specifically described as follows:

the first round of matching: it can be seen from the above three mapping requirements that the matching problem of the present invention can be transformed into the problem of finding the longest common subsequence between two different language sequences, i.e. the present invention uses the longest common subsequence algorithm to perform the optimal matching of the sequence of Java source code lines and the sequence of smali byte codes.

And a second round of matching: due to the existence of compound sentences, this is not the case in the first round of matchingThe active code lines may be matched to the smali statement segments. For example, a line statement in Java source code may be divided into two or more aggregation lines in the smali bytecode, and this may result in that the characteristics of the smali aggregation lines and the source code characteristics do not reach T_{LineSimilarity}Threshold, in turn, causing failure of the match. In order to process the compound statement, the invention traverses the smali statement which cannot be mapped in the first round through the variable sliding window, if the feature similarity between the smali statement segment in the window and the Java source code line which cannot be matched exceeds T_{LineSimilarity}Threshold, a mapping can be established, as shown in fig. 3, and a sliding window will start scanning from unpaired smali regions.

Corresponding to the correlation method, the invention relates to a cross Java byte code and source code line correlation system based on feature similarity matching; the system includes three modules: (1) a Conditional Random Field (CRF) based Java byte code line module; (2) a cross-language feature extraction module; (3) a line matching module between the Java byte code and the source code; the three modules perform the respective three-step operations.

Compared with the existing cross-language correlation technology with debugging information, the method and the device automatically realize the correlation between the Java bytecode without the debugging information and the source code, fill up the blank in the related field, and provide great convenience for the development of the fields of code vulnerability location, patch detection and the like

Drawings

FIG. 1 is a cross-language row association architecture diagram of the present invention.

FIG. 2 illustrates the CRF model training and branch process.

Fig. 3 shows the feature classes shared by two languages.

Fig. 4 is a process of two rounds of matching using sliding windows.

Detailed Description

The invention designs a Java byte code line division module based on a Conditional Random Field (CRF), which firstly trains and learns a CRF model and then automatically and accurately divides byte codes of passive code line number information by using the model. The invention simultaneously designs a feature extractor of Java source codes and smali byte codes, the feature extractor can respectively extract necessary grammatical features for the two languages, and the distance between the features can represent the similarity between different language lines. The invention also designs a matching module between the Java source code and the byte code, and the matching module can optimally match the source code line of the whole Java method with the byte code aggregation line. This section introduces specific implementations of these several modules.

One, automatic byte code division line

The process segments the smali file without the debugging information. The process mainly comprises the steps of inputting a large number of smali instructions marked with line number information to train a model, and then automatically segmenting lines of the smali instructions without debugging information by using the trained model.

Training of CRF model: the training set of the CRF model is a smali file containing debug information, each of which contains ". line xx" information at the beginning of each aggregate line, as shown in fig. 2, and the present invention first assigns one of the following 4 tags to each smali instruction according to these. line tags: s (segment with single instruction), B (start of segment), M (middle of segment) and E (end of segment). Meanwhile, the invention also standardizes the operands of all the smali instructions, and because the smali instructions have infinite formats due to different operands, the number of the instruction formats can seriously interfere with the training precision of the CRF model, so the invention simultaneously removes the register operands of all the instructions.

The final training data was constructed from a smali file containing 23 Android ROMs with debug information and 2,064 Maven software packages. The invention extracts about 100 ten thousand labeled smali methods as a training set, 1000 ten thousand labeled smali methods as a test set, a specific model adopts CRF + + (an open source implementation of CRF), and the model is trained by setting "cost parameter" and "termination criterion" to 1 and 0.0001 respectively. And finally, outputting the labeled smali instruction by using the trained model with the smali instruction as input, wherein the labeled smali instruction is equivalent to the completion of the line segmentation.

Second, feature extraction of Java source code and smali

The feature extractor defines 5 shared features in two different languages, constant values, called functions, class member variable access, object creation, and special instructions. The extraction process of the features is divided into feature extraction of Java source code lines and feature extraction of smali aggregation lines.

(1) Java source code feature extraction

The Java language has complex syntax such as anonymous inner classes, nested inner classes, etc. In order to resolve the characteristics of each statement in a Java source code line, a Java project can be compiled into an executable file, and then relevant characteristics in the source code can be extracted by resolving byte code information, but building the project is not a fully automatic process and requires frequent manual intervention. Even with package managers such as Maven and Gradle, it takes a lot of time to manually compile them into binary files.

The invention chooses to extract features by directly parsing the Java grammar. The invention adopts an open source Spoon tool to extract Java grammar characteristics. Firstly, generating Abstract Syntax Tree (AST) from a Java source file, and then giving a Java line to be inquired; the extractor can traverse the AST structure to extract the desired features.

On the other hand, in the process of extracting the Java source code features, the extractor needs to handle additional situations. Static variables in the Java source code, such as into the bytecode, may be optimized to constant values by the compiler, for example, "user handle. user _ OWNER" may appear in the form of 0x0 in the smali file, when the characteristics of the Java source code are changed compared with the characteristics of the bytecode; for this case, the Java feature extractor would perform text Normalization (transformations Normalization) on the source code, and the extractor would first construct a global constant table by parsing all Java source code files, and if a constant name is found to appear in the table, would use its constant value to construct the feature.

(2) Smali feature extraction

In the process of extracting the features of the smali aggregation line, dexlib is used for analyzing the smali file. dexlib is an open source Java byte code parsing library which provides a rich interface to traverse byte code class, method, instruction and other information. The invention accesses all information in the smali file, such as class, method, instruction, label and the like, through a related interface provided by the dexlib.

On the other hand, extraction of the constant features requires additional processing: constants are typically stored in registers that are re-referenced by the smali instruction to access the associated constant information. Therefore, the smali feature extractor cannot directly obtain the constant from the operand value. To extract constant values from a smali instruction, the smali feature extractor constructs the relationship between constants and registers using constant propagation analysis (constant propagation analysis) techniques. This process first scans the entire method to construct a table, keeping all virtual registers that have been allocated constants and are not overwritten by subsequent instructions. When the operand in the scanned smali instruction contains a constant, only a mapping table constructed in advance needs to be searched.

Matching between Java source code and byte code

The invention completes the cross-language matching of the whole Java method through the process. After the CRF model is trained, the byte code file without debugging information is accurately divided into lines by the model, and then the common features of all the lines are respectively extracted by the smali feature extractor and the Java language feature extractor, so that two groups of ordered sequences with the features are formed.

The first round of matching in the association process is to find the longest common subsequence, i.e. the best match between the source code line and the aggregated smal line, using the optimal longest common subsequence algorithm, Myers' algorithm.

Then, the invention takes a second round of matching to match the source code rows that failed in matching. For a non-matching source code line, the correlator immediately finds matching candidates among all non-matching smali instructions. In particular, for each line of source code, the correlator sets a sliding window of variable length to enumerate all possible sequences of smali instructions. The correlator calculates all possible sliding windows and source generationsSimilarity between code lines and then selecting the window that achieves the highest similarity. If the similarity between the smali instruction and the source code line in the selected sliding window exceeds the predefined T_{LineSimilarity}Threshold, they are marked as a match and instructions within the sliding window are deleted in subsequent searches.

Finally, through the two rows and the row association algorithm, a precise mapping table is established between the smali aggregation row and the Java source code row in the whole function, and security personnel can perform security analysis such as Bug repair, patch detection and the like through the mapping table.

Claims

1. A cross Java byte code and source code line correlation method based on feature similarity matching is characterized by comprising the following specific steps:

(1) a Conditional Random Field (CRF) based Java bytecode line division; firstly, collecting a large number of Java byte code files marked with debugging information, inputting the marked Java byte code files into a CRF (learning random access control) model for training, wherein the CRF model after training is used for automatically and accurately segmenting the byte code files of passive code line information;

(2) cross-language feature extraction; extracting features in Java byte codes and source codes; the features here refer to features shared between a single row of source code and a single Java bytecode aggregation row;

(3) and (3) line matching between the Java byte codes and the source codes, namely solving the optimal matching result between the Java byte code lines and the source code lines by using a longest common subsequence algorithm.

2. The method according to claim 1, wherein the Conditional Random Field (CRF) in step (1) is a conditional probability distribution model of another set of output sequences given a set of input sequences, the CRF model being migrated into the branches of Java bytecode based on the principle that CRF can efficiently tokenize sequences; the CRF model training and branch process comprises the following steps: according to the line tag, one of the following 4 tags is allocated to each smali instruction: s: segment with single instruction, B: start of segment, M: middle of segment, E: the end of a segment; the labeled smali byte code file is used for training a CRF model, a prediction model is derived after enough training rounds, and finally the trained model can automatically label the part of speech of the byte code file of the passive code row information: s, B, M or E; the segmentation of the corresponding byte codes can be automatically completed by utilizing the predicted marking information; the segmented bytecode file is composed of aggregation lines, the aggregation lines are composed of one or more continuous smali instructions, and one smali aggregation line represents one potential Java source code line.

3. The method according to claim 2, wherein the cross-language feature extraction in step (2) is to extract a syntactic feature shared by the Java bytecode and the source code; if the Java source code line is matched with the corresponding smali line, certain characteristics are necessarily shared between the Java source code line and the corresponding smali line, and conversely, the Java line and the smali line with similar characteristics are most likely to be associated; the following two attributes are defined to select a suitable feature set:

(1) shareability, i.e., the selected feature should be present in both the Java source code line and the smali bytecode;

(2) consistency, i.e., the appropriate features consistently extracted from the source code and the bytecode;

selecting 5 characteristics according with the two attributes:

constant value: containing constant character strings, int type and long type integer values;

the function called: function names and parameter lengths, including all function calls except for functions automatically generated by the compiler;

class member variable access: class member variable names;

object creation: the class name of the object, including all class names except the one automatically generated by the compiler;

the special instructions are: instruction types including throw, monitor, switch, instance-of, return;

5 kinds of features in each line of Java source code and each line of smali aggregation line are extracted and form a line of feature set, and the feature sets are used for calculating feature similarity in the matching step.

4. The method according to claim 3, wherein the matching between the Java source code and the bytecode in the step (3) is mapping between lines on the characteristic-marked Java source code line and the smali line;

the matching aims to find a one-to-one mapping relation between the two sequences, which is the classic longest common subsequence problem; before finding the longest common subsequence, defining an equivalence principle between sequence elements; that is, the equivalence between the sequence elements is approximately measured by using the feature similarity; assuming that s and b are two feature sets extracted from a source code line and a smali aggregation line respectively, feature similarity is calculated by adopting a Jaccard formula, wherein T_{LineSimilarity}Is a predefined threshold between 0 and 1:

IsEquivalent(s,b):Jaccard_Sim(s,b)＞＝T_{LineSimilarity}

when the value calculated by Jaccard is larger, the code line and the sequence of the smali aggregation line share more common characteristics, the code line and the sequence of the smali aggregation line are most likely to be matched, otherwise, the code line and the sequence of the smali aggregation line are less likely to be matched; according to the principle, the feature similarity is used as an equivalence principle in the matching process;

after an equivalence principle between a code line and a byte code sequence is defined, integrally mapping a source code of the whole Java method and a corresponding smali byte code method; the mapping principle is as follows:

as many source code lines and smali aggregation lines as possible need to be matched;

two rounds of line matching are adopted to find the best matching result, which specifically comprises the following steps:

the first round of matching: according to the three mapping principles, the matching problem can be converted into a problem of finding the longest public subsequence between two different language sequences, namely, the optimal matching of a Java source code line sequence and a smali byte code sequence is carried out by using the longest public subsequence algorithm;

and a second round of matching: due to the existence of the compound statement, not all source code lines can be matched with the smili statement segment in the first round of matching; for this purpose, traversal is carried out on the smali sentences which cannot be mapped in the first round through a variable sliding window, and if the feature similarity between the smali sentence segments in the window and the Java source code lines which cannot be matched exceeds T_{LineSimilarity}Threshold, a mapping relationship is established and a sliding window will start scanning from unpaired smali regions.