CN109324971B - Software data flow analysis method based on intermediate language and taint analysis - Google Patents

Software data flow analysis method based on intermediate language and taint analysis Download PDF

Info

Publication number
CN109324971B
CN109324971B CN201811156016.6A CN201811156016A CN109324971B CN 109324971 B CN109324971 B CN 109324971B CN 201811156016 A CN201811156016 A CN 201811156016A CN 109324971 B CN109324971 B CN 109324971B
Authority
CN
China
Prior art keywords
rec
expression
instruction
taint
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811156016.6A
Other languages
Chinese (zh)
Other versions
CN109324971A (en
Inventor
喻波
杨强
乐泰
唐勇
解炜
周旭
罗艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201811156016.6A priority Critical patent/CN109324971B/en
Publication of CN109324971A publication Critical patent/CN109324971A/en
Application granted granted Critical
Publication of CN109324971B publication Critical patent/CN109324971B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a software data flow analysis method based on middle language and taint analysis, which comprises the following steps: step S1: defining an instruction format and an expression format; uniformly describing the general instruction type of an intermediate language, and constructing a temporary variable expression, a register expression and a taint mark expression for data representation in the taint analysis process; step S2: establishing a taint propagation rule based on an intermediate language, and expressing a taint mark by using a taint _ label; step S3: dynamically tracking and analyzing the flow of program data based on the intermediate language; step S4: constructing a stain source and a global variable T based on tracked stain information during program executionGLocal variable TLSystem call function parameter TFThe data flow relationships between. The method has the advantages of better accuracy, stronger comprehensiveness, richer information and the like.

Description

Software data flow analysis method based on intermediate language and taint analysis
Technical Field
The invention mainly relates to a data stream analysis method based on taint analysis, in particular to a software data stream analysis method based on intermediate language and taint analysis.
Background
The taint analysis technology is a key technology in the aspects of malicious code detection, software supply chain security, software vulnerability mining and the like. The data flow analysis can be carried out on the software program based on taint analysis, and the main processes of the data flow analysis comprise taint marking, taint propagation tracking, taint analysis and the like of input data. The taint analysis technology is mainly applied to the following types: the method is mainly applied to software vulnerability mining and malicious code analysis; secondly, a taint analysis method is utilized to construct the relationship among input data, key check points, path constraints and internal variables, and software vulnerability mining and supply chain safety are realized; thirdly, establishing a relation between input data and internal system function call so as to deduce behavior characteristics, wherein the behavior characteristics are mainly applied to software supply chain safety and malicious code analysis. And software dataflow analysis is the basis of the specific application of taint analysis.
However, the existing taint analysis technology has some defects in the aspect of application to software data flow analysis, and common taint analysis methods are all based on assembly language design taint propagation rules, and although instruction types such as shift instructions, arithmetic instructions, constant instructions and the like can be correctly processed, it is difficult to accurately establish taint propagation rules for instructions which implicitly affect a mark register and affect a control flow, such as jump instructions, comparison instructions and the like. Due to the difference of the instruction sets, the method is difficult to be applied to software analysis applications of cross-instruction sets, such as cross-instruction set vulnerability retrieval applications. Another disadvantage is that in the taint propagation process, the common taint analysis method only records taint sources and taint marks, and loses semantic information such as operation relations among taint data.
From the perspective of practical application, behavior analysis of malicious codes requires analysis of the relationship between remote control input data of the malicious codes and internal system calls to construct a malicious behavior model; in the software vulnerability mining process based on the fuzzy test, the relationship between input data and internal data flow needs to be mined based on software data flow analysis, and the relationship between internal variables needs to be recorded and tracked in a fine-grained manner. The existing taint analysis method cannot meet the requirement of fine-grained and accurate software data flow analysis. The analysis process of the software data stream needs to accurately and comprehensively track the stain spreading process and also needs to master more abundant information for recording the stain removing point source and the stain mark.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a software data flow analysis method based on middle language and taint analysis, which has better accuracy, stronger comprehensiveness and richer information.
In order to solve the technical problems, the invention adopts the following technical scheme:
a software data flow analysis method based on intermediate language and taint analysis comprises the following steps:
step S1: defining an instruction format and an expression format; uniformly describing the general instruction type of an intermediate language, and constructing a temporary variable expression, a register expression and a taint mark expression for data representation in the taint analysis process;
step S2: establishing a taint propagation rule based on an intermediate language, and expressing a taint mark by using a taint _ label;
step S3: dynamically tracking and analyzing the flow of program data based on the intermediate language;
step S4: constructing a stain source and a global variable T based on tracked stain information during program executionGLocal variable TLSystem call function parameter TFThe data flow relationships between.
As a further improvement of the invention: the step S1 includes:
s101: an intermediate language generic instruction format representation;
arithmetic and shift instruction classes: a subtraction instruction Sub, an addition instruction Add, a logical right shift instruction Sar, a logical left shift instruction Shl, Or an instruction Or;
memory access class: a memory read instruction Load and a memory write instruction Store;
register operation class: writing a register instruction Put and reading a register instruction Get;
comparison class: comparing the command Cmp;
branch type: a conditional branch IF-THEN-ELSE instruction;
data bit width conversion class: 16-bit to 32-bit instructions 16Uto32, 8-bit to 32-bit instructions 8Uto32, 1-bit to 32-bit to 1Uto32, 32-bit to 1-bit instructions 32to 1;
representing the address of the current execution instruction by using PC;
step S102: recording expression format expression of the taint information;
storing an expression record table: using STORERECThe representation is used for recording the read-write record of the memory address;
a stack bottom pointer record table: by BPRECThe representation is used for saving a stack bottom pointer in the analysis of the using stack type function calling code and assisting in judging the incoming parameters and caller receiving parameters in the function calling process; BP (Back propagation) ofRECThe method is a stack structure, wherein Peek represents the last element on a stack, Pop represents the last element on a Pop stack, and Push represents the pressing of an element to the last element of the stack;
a stack top pointer record table: with SPRECThe representation is used for saving a stack top pointer in the analysis of the using stack type function calling code and assisting in judging the incoming parameters and the transmission parameters in the function calling process; SPRECThe stack structure is also adopted, Peek represents that the last element on the stack is taken, Pop represents that the last element on the stack is popped, and Push represents that an element is pushed to the end of the stack;
register expression: using REGRECThe expression is used for recording the reading and writing of the register in the taint tracking process, the register is indexed according to the sequence number in the register expression recording table, and only one latest value is recorded in the register in the table in consideration of the temporary use of the register;
temporary variable expressions: by TMPRECThe representation is used for recording other temporary operation information except memory read-write and register access in the taint tracking process;
global taint table: using GLOBALTAINTRECRepresents: for recording global taint information during taint tracking.
As a further improvement of the invention: the taint propagation rules in said step S2 include the following categories:
s201: for common arithmetic instructions and shift instructions, if some source data s has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s202: for the register reading and register writing instructions, if some source data s has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s203: for the comparison instruction, if some source data s of the comparison instruction has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s204: for a conditional branch instruction, if some source data s of the branch condition has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s205: for the memory address read instruction, if the accessed memory address is the global variable s, the taint mark of the target data t inherits the taint mark of the global variable s, that is, t.tint _ label is s.tint _ label; if the accessed memory address is an expression and the expression is marked as taint, the returned reading result is taint data, namely t.tint _ label ═ s.tint _ label;
s206: for the memory address write instruction, if the accessed memory data s has a taint mark, the returned read result is taint data, that is, t.tint _ label is s.tint _ label;
s207: for the data bit width conversion instruction, if the source data s has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s208: in other cases where the above condition is not satisfied, the taint mark of the destination data t is clear, i.e., t.tint _ label ═ 0.
As a further improvement of the invention: the dynamic tracking analysis process in step S3 includes:
s301: for a given program or a specified function, a source of contamination S is determined, the type of contamination source comprising a parameter of the contamination source SAGlobal variable sewage source SGAnd user input sewage source SI. The choice of the source of the sewage may be specified by configuration;
s302: and converting the target program into the intermediate language by using the existing intermediate language conversion tool, and tracking and analyzing the processing flow of the intermediate language instruction by the CPU.
As a further improvement of the invention: the processing flow of the stain tracking in the step S302 is as follows:
s3021: initializing storage expression record table STORERECInitializing the bottom of stack pointer BP for nullRECRecording stack empty, initializing stack top SPRECThe record stack is empty, the address PC of the first instruction of the initialization BP value and the expression record table of the initialization register are REGRECNull, initialize temporary variable expression record Table TMPRECInitializing global taint table GLOBALTAINTRECIs empty;
s3022: for each instruction executed, taint tracking is performed according to a given instruction type.
As a further improvement of the invention: the instruction types given in step S3022 include:
s30221: if the instruction is an arithmetic instruction, a shift instruction, or a data bit width conversion instruction, the expression may be expressed as value OP (EXP)A1,EXPA2,EXPA3,...,EXPAn) First, according to the taint propagation rule, by checking the parameter EXPA1,EXPA2,EXPA3,...,EXPAnCalculating the dirty mark of target data, and secondly constructing a dirty expression according to the instruction type, namely in a temporary variable expression record table TMPRECAdding expression TMPREC[value]=OP(EXPA1,EXPA2,EXPA3,...,EXPAn) Wherein OP is the current instruction symbol;
s30222: for the memory write instruction store (addr) value, the written target address and the expression type of the data to be written are checked, and the following judgment is carried out: if destination address Addr is in the form of "SP + immvalue" at the top of the stack and immvalue is a positive number, indicating that the current instruction is a prepare-to-transfer parameter on the stack, then an expression STORE is created in the STORE expression record table as followsREC[addr]=<SP,immvalue,TMPREC[value]>(ii) a If destination address addr is in the form of "BP + immvalue" and immvalue is negative, indicating that the current instruction is writing a local variable value onto the stack, then the store expression is rememberedCreating an expression STORE in a tableREC[addr]=<BP,immvalue,TMPREC[value]>(ii) a If the destination address addr is in the form of "BP + immvalue" and immvalue is a positive number, which indicates that the current instruction is to read the return parameter of the subfunction on the stack, then an expression STORE is created in the storage expression record table as followsREC[addr]=<SP,immvalue,TMPREC[value]>(ii) a If the expression of the data to be written is in the shape of 'regID', and the expression of the destination address is in the shape of 'BP-immvalue' and immvalue is a positive number, then the operand is said to enter the subfunction code in the write BP, then the subfunction code in the BPRECPressing a BP value in the instruction, and updating the BP value to be a current instruction address PC; for other memory STORE instructions, an expression STORE is created in the STORE expression record table as followsREC[addr]=TMPREC[value];
S30223: for the memory read instruction value ═ load (addr), the written target address and the expression type of the data to be written are checked, and the following judgments are made: if addr is an immediate value in the form of "immvalue", meaning that data is loaded from a specific address, then table TMP is recorded in the temporary variable expressionRECAdding expression TMPREC[value]=<Load,addr>(ii) a If addr is like "<Add,regSP,immvalue>"and immvalaue is a positive number, and there are two cases, the first being if the expression is<SP,immvalue,BP>In STORERECIf it exists, it means that the result of the function is taken from the stack, then from STORERECIn which expressions are fetched<SP,immvalue,BP>The result being the result of the Load instruction, i.e. at TMPRECAdding expression TMPREC[value]=STOREREC[SP,immvalue,BP](ii) a The second case is if the expression is<SP,immvalue,BP>In STORERECIf there is no corresponding record in the temporary variable expression table TMPRECAdding expression TMPREC[value]=Load(TMPREC[addr]) (ii) a If addr is like "<Add,regBP,immvalue>"and immvalue is negative number, this indicates that the instruction is to read the value of the local variable, and the judgment is made in two casesIf an expression is given<regBP,immvalue,BP>In STORERECIf there is a corresponding record in, then from STORERECIn which expressions are fetched<regBP,immvalue,BP>The result being the result of the Load instruction, i.e. at TMPRECAdding expression TMPREC[value]=STOREREC[regBP,immvalue,BP](ii) a The second case is if the expression is<regBP,immvalue,BP>In STORERECIf no corresponding record exists, an error is reported and the program is exited because the program for specified analysis tries to read an uninitialized local variable; if addr is like "<Add,regBP,immvalue>"and immvalue is a positive number, indicating that the instruction is to read the value of the incoming parameter, deciding in two cases, if the expression is<regSP,immvalue,BPREC.Peek()>In STORERECIf there is a corresponding record in, then from STORERECIn which expressions are fetched<regSP,immvalue,BPRECThe Peek () result is the result of the Load instruction, i.e. at TMPRECAdding expression TMPREC[value]=STOREREC[<regSP,immvalue,BPREC.Peek()](ii) a The second case is if the expression is<regSP,immvalue,BPREC.Peek()>In STORERECIf no corresponding record exists, an error is reported and the program is quitted because the specified analysis program tries to read an uninitialized input parameter;
s30224: if the register write instruction put (regid) is true, the value expression is judged, and if the register write instruction put (regid) is immediate, the register expression record table REG is used for recording the value expressionREC[regID]Adds a record REGREC[regID]Immvalue; if immvalue is not an immediate, then it is determined by the following categories: if regID is SP register, indicating a push operation, register expression record table REGREC[regID]Adds a record REGREC[regID]=TMPREC[value](ii) a If regID is the BP register and value is like<Load,EXP>The expression (b) indicates that the ebp value is recovered after the function call is ended, and at this time, the current BP needs to be modified into BPRECPeek () and BPRECPop operation BP ofRECPop () withSynchronously modifying SP ═ SPRECPeek (), and performing SPRECPop operation SP ofRECPop (); if regID is the BP register and value is like<Sub,EXP>The expression (2) is that the stack-pressing operation BP of the BPREC is firstly carried outRECPush (SP), modifying BP and SP simultaneously to the current instruction address, i.e. BP ═ SP ═ PC; if regID is a register other than SP and BP registers, it is determined in the following categories: if the expression of value is in the form of<Load,BP,imm>By modifying the REGREC[regID]=STOREREC[<regBP,imm](ii) a Otherwise REGREC[regID]=TMPREC[value];
S30225: if the register read instruction value is get (regID), whether regID is in REG is judgedRECThere is a record, if there is a corresponding record, at the TMPRECAdding expression TMPREC[value]=<REGREC[regID]>If there is no corresponding record, then an initial expression regID INIT is created for that register in the temporary variable expression TMPRECAdding expression TMPREC[value]=<regID_INIT>。
As a further improvement of the invention: the step S4 is as follows:
s401: defining a set of information T to be analyzedI=TGUTLUTFWherein T isG={Tg1,Tg2,...,Tgm},TL={Tl1,Tl2,...,Tlm},TF={Tf1,Tf2,...,Tfm};
S402: for TITo construct a data flow relationship.
As a further improvement of the invention: the process of step S402 includes:
s4021: if t is a global variable, then from the STORE expression record Table STORERECInquiring whether an expression with t as an index exists, and if so, taking exp-STOREREC[t]Then the taint of global variable t is marked as GLOBALTAINTREC[exp]While, from taint source to global variable tThe data flow relationship is a calculation relationship expressed by an expression exp of an instruction sequence operation;
s4022: if t is a local variable, the table TMP is recorded from the temporary expressionRECInquiring whether an expression with t as an index exists, and if so, taking exp as TMPREC[t]Then the smear of the local variable t is marked as GLOBALTAINTREC[exp]Meanwhile, the data flow relationship from the taint source to the local variable t is a calculation relationship expressed by an expression exp of the instruction sequence operation;
s4023: if t is a parameter of the system function call, if t is a register type, the REG is recorded from the register expression record tableRECInquiring whether an expression with t as an index exists, and if so, taking exp as REGREC[t]Then the dirty point of parameter t is marked as GLOBALTAINTREC[exp]Meanwhile, the data flow relationship from the taint source to the parameter t is a calculation relationship expressed by an expression exp of the instruction sequence operation; if the parameter t is of the shape "<BP,imm>"parameters passed by the stack", the following expression exp is constructed<SP、TMPREC[imm]>And take exp1 ═ STOREREC[exp]Then the dirty point of parameter t is marked as GLOBALTAINTREC[exp1]Meanwhile, the data flow relationship from the taint source to the parameter t is the computational relationship represented by the expression exp1 of the instruction sequence operation.
Compared with the prior art, the invention has the advantages that:
compared with assembly languages such as X86, MIPS, ARM and the like, the intermediate language not only standardizes and unifies the description of the operation instruction, but also has rich expression capability. The intermediate language such as VEX can express the process that the comparison instruction in the X86 assembly language implicitly affects the mark register by using the displayed intermediate language instruction, so as to capture more detailed instruction operation semantics. The invention mainly utilizes the advantages of the intermediate language in expressing the semantic aspect of the program instruction to define the taint propagation rule; secondly, the requirements of the software data flow analysis on the aspects of accuracy, comprehensiveness, rich information and the like are met by establishing a fine-grained taint tracking method with a data flow operation relation.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the drawings and specific examples.
As shown in FIG. 1, the software dataflow analysis method based on the intermediate language and taint analysis of the invention includes the following steps:
step S1: an instruction format and an expression format are defined.
In order to describe the taint tracking process based on the intermediate language, firstly, the general instruction type of the intermediate language is uniformly described, and formats such as temporary variable expressions, register expressions, taint mark expressions and the like are constructed and used for data representation in the taint analysis process.
Step S2: establishing a taint propagation rule based on an intermediate language, and expressing a taint mark by using a taint _ label;
step S3: dynamically tracking and analyzing the flow of program data based on the intermediate language;
step S4: constructing a stain source and a global variable T based on tracked stain information during program executionGLocal variable TLSystem call function parameter TFThe data flow relationships between.
In a specific application example, step S1 includes:
s101: an intermediate language generic instruction format representation;
arithmetic and shift instruction classes: a subtraction instruction Sub, an addition instruction Add, a logical right shift instruction Sar, a logical left shift instruction Shl, Or an instruction Or;
memory access class: a memory read instruction Load and a memory write instruction Store;
register operation class: writing a register instruction Put and reading a register instruction Get;
comparison class: comparing the command Cmp;
branch type: a conditional branch IF-THEN-ELSE instruction;
data bit width conversion class: 16-bit to 32-bit instructions 16Uto32, 8-bit to 32-bit instructions 8Uto32, 1-bit to 32-bit to 1Uto32, 32-bit to 1-bit to1, etc.;
in addition, the address of the currently executed instruction is represented by PC.
Step S102: recording expression format expression of the taint information;
storing an expression record table: using STORERECThe representation is used for recording the read-write record of the memory address;
a stack bottom pointer record table: by BPRECThe representation is used for saving a stack bottom pointer in the analysis of the using stack type function calling code and assisting in judging the incoming parameters and caller receiving parameters in the function calling process; BP (Back propagation) ofRECIs a stack structure, Peek denotes fetching the last element on the stack, Pop denotes popping the last element on the stack, and Push denotes pushing an element to the end of the stack.
A stack top pointer record table: with SPRECThe representation is used for saving a stack top pointer in the analysis of the using stack type function calling code and assisting in judging the incoming parameters and the transmission parameters in the function calling process; SPRECAlso a stack structure, Peek denotes fetching the last element on the stack, Pop denotes popping the last element on the stack, and Push denotes pushing an element to the end of the stack.
Register expression: using REGRECThe expression is used for recording the reading and writing of the register in the taint tracking process, the register is indexed according to the sequence number in the register expression recording table, and only one latest value is recorded in the register in the table in consideration of the temporary use of the register;
temporary variable expressions: by TMPRECThe representation is used for recording other temporary operation information except memory read-write and register access in the taint tracking process;
global taint table: using GLOBALTAINTRECRepresents: for recording global taint information during taint tracking.
In a specific application example, the taint propagation rules in step S2 include the following categories:
s201: for common arithmetic instructions and shift instructions, if some source data s has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s202: for the register reading and register writing instructions, if some source data s has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s203: for the comparison instruction, if some source data s of the comparison instruction has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s204: for a conditional branch instruction, if some source data s of the branch condition has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s205: for the memory address read instruction, if the accessed memory address is the global variable s, the taint mark of the target data t inherits the taint mark of the global variable s, that is, t.tint _ label is s.tint _ label; if the accessed memory address is an expression and the expression is marked as taint, the returned reading result is taint data, namely t.tint _ label ═ s.tint _ label;
s206: for the memory address write instruction, if the accessed memory data s has a taint mark, the returned read result is taint data, that is, t.tint _ label is s.tint _ label;
s207: for the data bit width conversion instruction, if the source data s has a taint label, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s208: in other cases where the above condition is not satisfied, the taint mark of the destination data t is clear, i.e., t.tint _ label ═ 0.
In a specific application example, the dynamic tracking analysis process in step S3 is as follows:
s301: for a given program or a specified function, a source of contamination S is determined, the type of contamination source comprising a parameter of the contamination source SAGlobal variable sewage source SGAnd user input sewage source SI. The choice of the source of the sewage may be specified by configuration;
s302: converting the target program into an intermediate language by using an existing intermediate language conversion tool, tracking and analyzing the processing flow of the CPU to the intermediate language instruction, wherein the processing flow of the stain tracking comprises the following steps:
s3021: initializing storage expression record table STORERECInitializing the bottom of stack pointer BP for nullRECRecording stack empty, initializing stack top SPRECThe record stack is empty, the address PC of the first instruction of the initialization BP value and the expression record table of the initialization register are REGRECNull, initialize temporary variable expression record Table TMPRECInitializing global taint table GLOBALTAINTRECIs empty;
s3022: for each instruction executed, taint tracking is performed according to several instruction types as follows:
s30221: if the instruction is an arithmetic instruction, a shift instruction, or a data bit width conversion instruction, the expression may be expressed as value OP (EXP)A1,EXPA2,EXPA3,...,EXPAn) First, according to the taint propagation rule, by checking the parameter EXPA1,EXPA2,EXPA3,...,EXPAnCalculating the dirty mark of target data, and secondly constructing a dirty expression according to the instruction type, namely in a temporary variable expression record table TMPRECAdding expression TMPREC[value]=OP(EXPA1,EXPA2,EXPA3,...,EXPAn) Wherein OP is the current instruction symbol;
s30222: for the memory write instruction store (addr) value, the written target address and the expression type of the data to be written are checked, and the following judgment is carried out: if destination address Addr is in the form of "SP + immvalue" at the top of the stack and immvalue is a positive number, indicating that the current instruction is a prepare-to-transfer parameter on the stack, then an expression STORE is created in the STORE expression record table as followsREC[addr]=<SP,immvalue,TMPREC[value]>(ii) a If destination address addr is in the form of "BP + immvalue" and immvalue is negative, it indicates that the current instruction is writing a local variable valueOn the stack, an expression STORE is created in the storage expression record table as followsREC[addr]=<BP,immvalue,TMPREC[value]>(ii) a If the destination address addr is in the form of "BP + immvalue" and immvalue is a positive number, which indicates that the current instruction is to read the return parameter of the subfunction on the stack, then an expression STORE is created in the storage expression record table as followsREC[addr]=<SP,immvalue,TMPREC[value]>(ii) a If the expression of the data to be written is in the shape of 'regID', and the expression of the destination address is in the shape of 'BP-immvalue' and immvalue is a positive number, then the operand is said to enter the subfunction code in the write BP, then the subfunction code in the BPRECPressing a BP value in the instruction, and updating the BP value to be a current instruction address PC; for other memory STORE instructions, an expression STORE is created in the STORE expression record table as followsREC[addr]=TMPREC[value]。
S30223: for the memory read instruction value ═ load (addr), the written target address and the expression type of the data to be written are checked, and the following judgments are made: if addr is an immediate value in the form of "immvalue", meaning that data is loaded from a specific address, then table TMP is recorded in the temporary variable expressionRECAdding expression TMPREC[value]=<Load,addr>(ii) a If addr is like "<Add,regSP,immvalue>"and immvalaue is a positive number, and there are two cases, the first being if the expression is<SP,immvalue,BP>In STORERECIf it exists, it means that the result of the function is taken from the stack, then from STORERECIn which expressions are fetched<SP,immvalue,BP>The result being the result of the Load instruction, i.e. at TMPRECAdding expression TMPREC[value]=STOREREC[SP,immvalue,BP](ii) a The second case is if the expression is<SP,immvalue,BP>In STORERECIf there is no corresponding record in the temporary variable expression table TMPRECAdding expression TMPREC[value]=Load(TMPREC[addr]) (ii) a If addr is like "<Add,regBP,immvalue>"and immvalue is a negative number, this indicates that the instruction is a read local variableIs determined in two cases, if the expression is<regBP,immvalue,BP>In STORERECIf there is a corresponding record in, then from STORERECIn which expressions are fetched<regBP,immvalue,BP>The result being the result of the Load instruction, i.e. at TMPRECAdding expression TMPREC[value]=STOREREC[regBP,immvalue,BP](ii) a The second case is if the expression is<regBP,immvalue,BP>In STORERECIf no corresponding record exists, an error is reported and the program is exited because the program for specified analysis tries to read an uninitialized local variable; if addr is like "<Add,regBP,immvalue>"and immvalue is a positive number, indicating that the instruction is to read the value of the incoming parameter, deciding in two cases, if the expression is<regSP,immvalue,BPREC.Peek()>In STORERECIf there is a corresponding record in, then from STORERECIn which expressions are fetched<regSP,immvalue,BPRECThe Peek () result is the result of the Load instruction, i.e. at TMPRECAdding expression TMPREC[value]=STOREREC[<regSP,immvalue,BPREC.Peek()](ii) a The second case is if the expression is<regSP,immvalue,BPREC.Peek()>In STORERECIf no corresponding record exists, an error is reported and the process exits because the program specifying the analysis attempts to read an uninitialized input parameter.
S30224: if the register write instruction put (regid) is true, the value expression is judged, and if the register write instruction put (regid) is immediate, the register expression record table REG is used for recording the value expressionREC[regID]Adds a record REGREC[regID]Immvalue; if immvalue is not an immediate, then it is determined by the following categories: if regID is SP register, indicating a push operation, register expression record table REGREC[regID]Adds a record REGREC[regID]=TMPREC[value](ii) a If regID is the BP register and value is like<Load,EXP>The expression (b) indicates that the ebp value is recovered after the function call is ended, and at this time, the current BP needs to be modified into BPRECPeek () and BPRECOut of the stackOperating BPRECPop () with simultaneous and simultaneous modification of SP ═ SPRECPeek (), and performing SPRECPop operation SP ofRECPop (); if regID is the BP register and value is like<Sub,EXP>The expression (2) is that the stack-pressing operation BP of the BPREC is firstly carried outRECPush (SP), modifying BP and SP simultaneously to the current instruction address, i.e. BP ═ SP ═ PC; if regID is a register other than SP and BP registers, it is determined in the following categories: if the expression of value is in the form of<Load,BP,imm>By modifying the REGREC[regID]=STOREREC[<regBP,imm](ii) a Otherwise REGREC[regID]=TMPREC[value]。
S30225: if the register read instruction value is get (regID), whether regID is in REG is judgedRECThere is a record, if there is a corresponding record, at the TMPRECAdding expression TMPREC[value]=<REGREC[regID]>If there is no corresponding record, then an initial expression regID INIT is created for that register in the temporary variable expression TMPRECAdding expression TMPREC[value]=<regID_INIT>。
In a specific application example, the step S4 is as follows:
s401: defining a set of information T to be analyzedI=TGUTLUTFWherein T isG={Tg1,Tg2,...,Tgm},TL={Tl1,Tl2,...,Tlm},TF={Tf1,Tf2,...,Tfm};
S402: for TIThe data flow relationship construction of each element t in (1) is as follows:
s4021: if t is a global variable, then from the STORE expression record Table STORERECInquiring whether an expression with t as an index exists, and if so, taking exp-STOREREC[t]Then the taint of global variable t is marked as GLOBALTAINTREC[exp]Meanwhile, the data flow relationship from the taint source to the global variable t is the meter represented by the expression exp of the instruction sequence operationCalculating the relation;
s4022: if t is a local variable, the table TMP is recorded from the temporary expressionRECInquiring whether an expression with t as an index exists, and if so, taking exp as TMPREC[t]Then the smear of the local variable t is marked as GLOBALTAINTREC[exp]Meanwhile, the data flow relationship from the taint source to the local variable t is a calculation relationship expressed by an expression exp of the instruction sequence operation;
s4023: if t is a parameter of the system function call, if t is a register type, the REG is recorded from the register expression record tableRECInquiring whether an expression with t as an index exists, and if so, taking exp as REGREC[t]Then the dirty point of parameter t is marked as GLOBALTAINTREC[exp]Meanwhile, the data flow relationship from the taint source to the parameter t is a calculation relationship expressed by an expression exp of the instruction sequence operation; if the parameter t is of the shape "<BP,imm>"parameters passed by the stack", the following expression exp is constructed<SP、TMPREC[imm]>And take exp1 ═ STOREREC[exp]Then the dirty point of parameter t is marked as GLOBALTAINTREC[exp1]Meanwhile, the data flow relationship from the taint source to the parameter t is the computational relationship represented by the expression exp1 of the instruction sequence operation.
The following is a specific application example of the present invention. The VEX IR intermediate language with strong expression capability is taken as an example to explain how the invention carries out software data flow analysis based on the proposed taint analysis method. In contrast to assembly language, VEX IR provides a single-entry, multi-exit form of code blocks, similar to the intermediate representation of a compiler. The registers of the VEX IR are denoted by t0, t1, and the like. The VEX IR based taint analysis method and software dataflow analysis process are explained as follows.
Step S1: the VEX IR-based instruction format and expression format are defined.
(1.1) VEX IR Universal instruction Format representation
Arithmetic and shift instruction classes: a subtraction instruction Sub, an addition instruction Add, a logic right shift instruction Sar, a logic left shift instruction Shl, Or an instruction Or and an assignment instruction; meanwhile, according to the difference of the operation bit width, different extended instructions can be used, for example, the Add instruction has three specific representations of Add16, Add32 and Add 64. These extended instructions do not affect the taint propagation rules.
Memory access class: i32 as a 32-bit memory read LDle and I8 as an 8-bit memory read instruction LDle and a memory write instruction STle;
register operation class: a register writing instruction Put and a register reading instruction Get I32;
comparison class: a 32-bit equivalent compare instruction CmpEQ32, a 32-bit less than or equal to compare instruction CmPLE 32;
branch type: a conditional branch IF-THEN-ELSE instruction, an X86 special purpose computation flag register instruction X86g _ computation _ flags _ c, an ARM special purpose computation flag register instruction armmg _ computation _ flag _ z, and so on;
data bit width conversion class: 16-bit to 32-bit instructions 16Uto32, 8-bit to 32-bit instructions 8Uto32, 1-bit to 32-bit to 1Uto32, 32-bit to 1-bit to1, etc.;
in addition, the address of the currently executed instruction is represented by PC.
(1.2) presentation of taint information record expression Format
According to the description and the format definition in the technical scheme, a taint information recording expression comprising the following contents is established: STORE expression record table STORERECStack top pointer recording table BPRECThe bottom pointer table SPRECRegister expression record list REGRECTemporary variable expression TMPRECGlobal taint table GLOBALTAINTRECThe address at which the current instruction executes is denoted by PC.
Step S2: and constructing a taint propagation rule based on the VEX IR intermediate language, wherein the taint propagation rule is consistent with the taint propagation rule described in the technical scheme.
Step S3: the analysis flow is dynamically tracked based on program data flow of the VEX IR intermediate representation, the tracking process of the part is related to the actual analysis program, and the part adopts the program example shown in the following table to explain the stain tracking process. In the table, (a) part represents C language code, (b) part represents X86 assembly language code, and (C) part is a conversion for the program into a VEX IR intermediate language representation based on the VEX conversion tool.
Figure BDA0001818963530000121
(a) C language example code
Figure BDA0001818963530000122
Figure BDA0001818963530000131
(b) X86 assembly language example
Figure BDA0001818963530000132
Figure BDA0001818963530000141
(c) Corresponding VEX IR representation
(3.1) this step defines the global variable c as the point of contamination source, corresponding to the VEX code in part (c) of the table, since c is the pointer of the data structure com, and the access to the c member includes the address 0x0804A01C and the neighboring address 0x0804A020, thus defining the point of contamination source as the global variable contamination source SGTwo DWORD memory blocks pointed to by the address 0x0804A01C and 0x0804A020, namely, the address 0x0804A01C are taint data;
(3.2) converting the X86 assembly code into a VEX intermediate representation by using an existing pyVEX open source tool, executing the VEX code in the part (c) in the table by using the symbolic executor of the angr in the part, and simultaneously tracking and analyzing the processing flow of the angr on the VEX instruction, wherein the specific processing flow of the taint tracking is as follows:
(3.2.1) initializing storage expression record Table STORERECInitializing the stack top pointer BP for nullRECRecording stack empty, initializing stack bottom SPRECThe record stack is empty, the address PC of the first instruction of the initialization BP value and the expression record table of the initialization register are REGRECNull, initialize temporary variable expression record Table TMPRECInitializing global taint table GLOBALTAINTRECIs empty;
(3.2.2) for each VEX instruction in part (c) of the execution table, performing taint tracking according to the following instruction types:
(3.2.2.1) if it is the addition instruction Add32, the subtraction instruction Sub32 and the assignment instruction "═ as in the case of the instruction t4 ═ Sub32(t3,0x00000010) at 0x80483EA, performing a taint calculation by checking the parameters t3 and 0x00000010 according to the taint propagation rule and performing a taint marking on the destination data t4, and extracting the expression TMP of t3 at the same timeREC[t3]And recording the table TMP in the temporary variable expressionRECAdding expression TMPREC[t4]=Sub32(TMPREC[t3,0x00000010]);
(3.2.2.2) for a memory write command, such as memory write command STle (0x0804A01C) 0x00000001 at 0x80483ED, check the dirty mark of t1, if t1 is dirty data, then t0 is dirty data, and change global dirty table GLOBALTAINTREC[TMPREC(t1)]At 1, since the address 0x0804a01C is marked as a stain source, it is necessary to add a record GLOBALTAINT in the global stain tableREC["0x0804A01C"]1 is ═ 1; at the same time, the following judgments are made: whether destination address t0 is an expression shaped as "stack top SP + OFFSET", whether destination address t0 is an expression shaped as "BP + immvalue" with immvalue being a negative number, and if destination address t0 is an expression shaped as "BP + immvalue" with immvalue being a positive number, where destination address t0 is a constant, so that an expression STORE is created directly in the stored expression record table as followsREC[”0x0804a01c']='0x00000001'。
(3.2.2.3) for the memory read instruction LDle, if the expression t5 at 0x8048401 is LDle: I32(0x0804a01C), the type of the target address to be written and the expression of the data to be written are checked, and the determination is made as follows because addr is an immediate value like "immvaliue", which means that data is loaded from the specific address 0x0804a01C, and the first step is performedTaint propagation due to the presence of the expression '0x0804A01C' in the global taint table and GLOBALTAINTREC["0x0804A01C"]1, so there is a need to add the record GLOBALTAINT in the global taint tableREC[<LDle:I32,"0x0804A01C">]1, while recording the table TMP in temporary variable expressionsRECAdding expression TMPREC[t5]=<LDle:I32,”0x0804A01C”>。
(3.2.2.4) if the instruction is a register write instruction, if the instruction PUT (offset 44) is t5, first determining whether the expression of t5 is immediate immevalue, SP register, BP register, where t5 is a temporary register; performing taint judgment in the analysis process, and inquiring TMP (Trimethoprim) by acquiring expression of t5RECObtaining the expression exp ═ TMPREC[t5]=<LDle:I32,”0x0804A01C”>Query global taint table GLOBALTAINTREC['<LDle:I32,"0x0804A01C'>"]Since t5 is taint data, and thus register number 44 is also taint data here, the global taint table is added with the record GLOBALTAINTREC['reg44']1, then add the record REG in the register record tableREC[reg44]=TMPREC[t5]=<LDle:I32,"0x0804a01c">。
(3.2.2.5) if it is a register read instruction, such as t0 GET: I32(offset 28) at 0x80483E4, then determine if register REG28 is at REGRECThere is a record, if there is a corresponding record, at the TMPRECAdding expression TMPREC[t]=<REGREC[reg28]>If no corresponding record exists, an initial expression reg28_ INIT is created for the register in the temporary variable expression TMPRECAdding expression TMPREC[t0]=<reg28_INIT>。
Step S4: during or after the VEX code is executed, a stain source and a global variable T can be constructed based on tracked stain informationGLocal variable TLSystem call function parameter TFThe data flow relationship between the system function and the global variable represented by '0x0804A01C' and the second parameter t1 (in assembly language) called by the system function printf is explained hereIs [ esp +4 ] at address 0x08048420]Write operation), the specific steps are as follows:
(4.1) the variable t1 to be analyzed is the second parameter of the printf function call at address 0x 08048427;
(4.2) for the element t1, the data flow relation construction process is as follows:
(4.2.1) firstly judging whether the type of t1 is a global variable, if not, skipping to the step 4.2.2;
(4.2.2) by determining the type of t1 as a local variable, the table TMP is recorded from the temporary expression firstRECQuery if there is an expression indexed by t1, because there is an Add32 instruction at address 0x8048420 that assigns t1, therefore t1 must be at TMPRECThe table shows that exp is TMPREC[t1]. The memory store instruction at address 0x8048420 is STle (t1) ═ t2, so the dirty tag value at t1 is determined by t2, and according to the instruction at address 0x8048420 t2 ═ GET: I32(offset 16), so the dirty tag value at t2 is determined by register reg 16; according to the instructions t0 ═ LDle, I32(0x0804A020) and PUT (offset ═ 16) ═ t0 at address 0x8048415, the dirty label at t2 is finally determined from the dirty label at memory address 0x0804A 020. Since step 3.1 marks the stain source at 0x0804A020, the stain label at t1 calculated here is 1, i.e. at time t1 is contaminated by the global variable c. Meanwhile, according to the analysis of the instructions in the taint tracking process, the flow relation from the taint source c to the local variable t1 is t1 ═ STle (GET: I32(LDle: I32(0x0804A020))), and the expression describes the data flow calculation relation from the taint source to the taint data.
(4.2.3) according to the instruction STle (t1) ═ t2 at the address 0x8048420, the expression of the destination address t1 is Add32(GET: I32(reg24),0x00000004), according to the parameter passing rule of the actual program to be analyzed, the second parameter of the print function in this example is t1 ═ Add32(GET: I32(reg24),0x00000004), therefore, the t1 register points to the second parameter of the printf function, therefore, the stain label of the second parameter called by the system function printf is 1, and the stain relation calculation method of the parameter to the stain source 0x0804A020 is STle (GET: I32(LDle: I32(0x0804A 020)).
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims (8)

1. A software data flow analysis method based on intermediate language and taint analysis is characterized by comprising the following steps:
step S1: defining an instruction format and an expression format; uniformly describing the general instruction type of an intermediate language, and constructing a temporary variable expression, a register expression and a taint mark expression for data representation in the taint analysis process;
step S2: establishing a taint propagation rule based on an intermediate language, and expressing a taint mark by using a taint _ label;
step S3: dynamically tracking and analyzing the flow of program data based on the intermediate language;
step S4: constructing a stain source and a global variable T based on tracked stain information during program executionGLocal variable TLSystem call function parameter TFThe data flow relationships between.
2. The method for analyzing software dataflow based on middle language and taint analysis according to claim 1, wherein the step S1 includes:
s101: an intermediate language generic instruction format representation;
arithmetic and shift instruction classes: a subtraction instruction Sub, an addition instruction Add, a logical right shift instruction Sar, a logical left shift instruction Shl, Or an instruction Or;
memory access class: a memory read instruction Load and a memory write instruction Store;
register operation class: writing a register instruction Put and reading a register instruction Get;
comparison class: comparing the command Cmp;
branch type: a conditional branch IF-THEN-ELSE instruction;
data bit width conversion class: 16-bit to 32-bit instructions 16Uto32, 8-bit to 32-bit instructions 8Uto32, 1-bit to 32-bit to 1Uto32, 32-bit to 1-bit instructions 32to 1;
representing the address of the current execution instruction by using PC;
step S102: recording expression format expression of the taint information;
storing an expression record table: using STORERECThe representation is used for recording the read-write record of the memory address;
a stack bottom pointer record table: by BPRECThe representation is used for saving a stack bottom pointer in the analysis of the using stack type function calling code and assisting in judging the incoming parameters and caller receiving parameters in the function calling process; BP (Back propagation) ofRECThe method is a stack structure, wherein Peek represents the last element on a stack, Pop represents the last element on a Pop stack, and Push represents the pressing of an element to the last element of the stack;
a stack top pointer record table: with SPRECThe representation is used for saving a stack top pointer in the analysis of the using stack type function calling code and assisting in judging the incoming parameters and the transmission parameters in the function calling process; SPRECThe stack structure is also adopted, Peek represents that the last element on the stack is taken, Pop represents that the last element on the stack is popped, and Push represents that an element is pushed to the end of the stack;
register expression: using REGRECThe expression is used for recording the reading and writing of a register in the taint tracking process, the register is indexed according to the sequence number in a register expression recording table, and only one latest value is recorded in one register in the table;
temporary variable expressions: by TMPRECThe representation is used for recording other temporary operation information except memory read-write and register access in the taint tracking process;
global taint table: using GLOBALTAINTRECRepresents: for recording global taint information during taint tracking.
3. The method for analyzing software dataflow based on intermediate language and taint analysis according to claim 1, wherein the taint propagation rules in step S2 include the following categories:
s201: for common arithmetic instructions and shift instructions, if some source data s carries a dirty mark, the destination data t is dirty data, i.e., t.tint _ label is s.tint _ label;
s202: for the register reading and register writing instructions, if some source data s has a taint mark, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s203: for the comparison instruction, if some source data s of the comparison instruction has a taint mark, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s204: for a conditional branch instruction, if some source data s of the branch condition has a dirty mark, the destination data t is dirty data, that is, t.tint _ label is s.tint _ label;
s205: for the memory address read instruction, if the accessed memory address is the global variable s, the taint mark of the target data t inherits the taint mark of the global variable s, that is, t.tint _ label is s.tint _ label; if the accessed memory address is an expression and the expression is marked as taint, the returned reading result is taint data, namely t.tint _ label ═ s.tint _ label;
s206: for the memory address write instruction, if the accessed memory data s has a taint mark, the returned read result is taint data, that is, t.tint _ label is s.tint _ label;
s207: for the data bit width conversion instruction, if the source data s has a taint mark, the destination data t is taint data, that is, t.tint _ label is s.tint _ label;
s208: in other cases where the above condition is not satisfied, the taint mark of the destination data t is clear, i.e., t.tint _ label ═ 0.
4. The method for analyzing software dataflow based on intermediate language and taint analysis as claimed in claim 1 or 2, wherein the dynamic trace analysis procedure in step S3 is as follows:
s301: for a given program or a specified function, a source of contamination S is determined, the type of contamination source comprising a parameter of the contamination source SAGlobal variable sewage source SGAnd user input sewage source SIThe choice of the source of the point of insult may be specified by configuration;
s302: and converting the target program into the intermediate language by using the existing intermediate language conversion tool, and tracking and analyzing the processing flow of the intermediate language instruction by the CPU.
5. The method for analyzing software dataflow based on intermediate language and taint analysis according to claim 4, wherein the processing flow of taint tracking in step S302 is as follows:
s3021: initializing storage expression record table STORERECInitializing the bottom of stack pointer BP for nullRECRecording stack empty, initializing stack top SPRECThe record stack is empty, the address PC of the first instruction of the initialization BP value and the expression record table of the initialization register are REGRECNull, initialize temporary variable expression record Table TMPRECInitializing global taint table GLOBALTAINTRECIs empty;
s3022: for each instruction executed, taint tracking is performed according to a given instruction type.
6. The method for analyzing software dataflow based on intermediate language and taint analysis according to claim 5, wherein the predetermined instruction type in step S3022 includes:
s30221: if the instruction is an arithmetic instruction, a shift instruction, or a data bit width conversion instruction, the expression may be expressed as value OP (EXP)A1,EXPA2,EXPA3,...,EXPAn) First, according to the taint propagation rule, by checking the parameter EXPA1,EXPA2,EXPA3,...,EXPAnCalculating the taint mark of target data, and secondly constructing a taint expression according to the instruction type, namely recording the taint mark in a temporary variable expressionTMP (Trimethoprim)RECAdding expression TMPREC[value]=OP(EXPA1,EXPA2,EXPA3,...,EXPAn) Wherein OP is the current instruction symbol;
s30222: for the memory write instruction store (addr) value, the written target address and the expression type of the data to be written are checked, and the following judgment is carried out: if destination address addr is in the form of "SP + immvalue" at the top of the stack and immvalue is a positive number, indicating that the current instruction is a prepare-to-transfer parameter on the stack, then an expression STORE is created in the STORE expression record table as followsREC[addr]=<SP,immvalue,TMPREC[value]>(ii) a If destination address addr is in the form of "BP + immvalue" and immvalue is negative, indicating that the current instruction is writing the value of the local variable onto the stack, then an expression STORE is created in the table of stored expression records as followsREC[addr]=<BP,immvalue,TMPREC[value]>(ii) a If the destination address addr is in the form of "BP + immvalue" and immvalue is a positive number, which indicates that the current instruction is to read the return parameter of the subfunction on the stack, then an expression STORE is created in the storage expression record table as followsREC[addr]=<SP,immvalue,TMPREC[value]>(ii) a If the expression of the data to be written is in the shape of 'regID', and the expression of the destination address is in the shape of 'BP-immvalaue' and immvalaue is a positive number, then the data to be written enters the subfunction code when the BP is written, and then the BP is used for reading the data to be writtenRECPressing a BP value in the instruction, and updating the BP value to be a current instruction address PC; for other memory STORE instructions, an expression STORE is created in the STORE expression record table as followsREC[addr]=TMPREC[value];
S30223: for the memory read instruction value ═ load (addr), the written target address and the expression type of the data to be written are checked, and the following judgments are made: if addr is an immediate value in the form of "immvalue", meaning that data is loaded from a specific address, then table TMP is recorded in the temporary variable expressionRECAdding expression TMPREC[value]=<Load,addr>(ii) a If addr is like "<Add,regSP,immvalue>"and immvalaue is a positive number, and there are two cases, the first being if the expression is<SP,immvalue,BP>In STORERECIf it exists, it means that the result of the function is taken from the stack, then from STORERECIn which expressions are fetched<SP,immvalue,BP>The result being the result of the Load instruction, i.e. at TMPRECAdding expression TMPREC[value]=STOREREC[SP,immvalue,BP](ii) a The second case is if the expression is<SP,immvalue,BP>In STORERECIf there is no corresponding record in the temporary variable expression table TMPRECAdding expression TMPREC[value]=Load(TMPREC[addr]) (ii) a If addr is like "<Add,regBP,immvalue>"and immvalue is a negative number, indicating that the instruction is to read the value of a local variable, and making a decision in two cases, if the expression is<regBP,immvalue,BP>In STORERECIf there is a corresponding record in, then from STORERECIn which expressions are fetched<regBP,immvalue,BP>The result being the result of the Load instruction, i.e. at TMPRECAdding expression TMPREC[value]=STOREREC[regBP,immvalue,BP](ii) a The second case is if the expression is<regBP,immvalue,BP>In STORERECIf no corresponding record exists, an error is reported and the program is exited because the program for specified analysis tries to read an uninitialized local variable; if addr is like "<Add,regBP,immvalue>"and immvalue is a positive number, indicating that the instruction is to read the value of the incoming parameter, deciding in two cases, if the expression is<regSP,immvalue,BPREC.Peek()>In STORERECIf there is a corresponding record in, then from STORERECIn which expressions are fetched<regSP,immvalue,BPRECThe Peek () result is the result of the Load instruction, i.e. at TMPRECAdding expression TMPREC[value]=STOREREC[<regSP,immvalue,BPREC.Peek()](ii) a The second case is if the expression is<regSP,immvalue,BPREC.Peek()>In STORERECIf no corresponding record exists, an error is reported and the program specified for analysis attempts to read an uninitialized recordInputting parameters;
s30224: if the register write instruction put (regid) is true, the value expression is judged, and if the register write instruction put (regid) is immediate, the register expression record table REG is used for recording the value expressionREC[regID]Adds a record REGREC[regID]Immvalue; if immvalue is not an immediate, then it is determined by the following categories: if regID is SP register, indicating a push operation, register expression record table REGREC[regID]Adds a record REGREC[regID]=TMPREC[value](ii) a If regID is the BP register and value is like<Load,EXP>The expression (b) indicates that the ebp value is recovered after the function call is ended, and at this time, the current BP needs to be modified into BPRECPeek () and BPRECPop operation BP ofRECPop () with simultaneous and simultaneous modification of SP ═ SPRECPeek (), and performing SPRECPop operation SP ofRECPop (); if regID is the BP register and value is like<Sub,EXP>The expression (2) is that the stack-pressing operation BP of the BPREC is firstly carried outRECPush (SP), modifying BP and SP simultaneously to the current instruction address, i.e. BP ═ SP ═ PC; if regID is a register other than SP and BP registers, it is determined in the following categories: if the expression of value is in the form of<Load,BP,imm>By modifying the REGREC[regID]=STOREREC[<regBP,imm](ii) a Otherwise REGREC[regID]=TMPREC[value];
S30225: if the register read instruction value is get (regID), whether regID is in REG is judgedRECThere is a record, if there is a corresponding record, at the TMPRECAdding expression TMPREC[value]=<REGREC[regID]>If there is no corresponding record, then an initial expression regID INIT is created for that register in the temporary variable expression TMPRECAdding expression TMPREC[value]=<regID_INIT>。
7. The method for analyzing software dataflow based on intermediate language and taint analysis according to claim 1 or 2, wherein the step S4 is as follows:
s401: defining a set of information T to be analyzedI=TG∪TL∪TFWherein T isG={Tg1,Tg2,...,Tgm},TL={Tl1,Tl2,...,Tlm},TF={Tf1,Tf2,...,Tfm};
S402: for TITo construct a data flow relationship.
8. The method for analyzing software dataflow based on intermediate language and taint analysis according to claim 7, wherein the process of step S402 includes:
s4021: if t is a global variable, then from the STORE expression record Table STORERECInquiring whether an expression with t as an index exists, and if so, taking exp-STOREREC[t]Then the taint of global variable t is marked as GLOBALTAINTREC[exp]Meanwhile, the data flow relationship from the taint source to the global variable t is a calculation relationship expressed by an expression exp of the instruction sequence operation;
s4022: if t is a local variable, the table TMP is recorded from the temporary expressionRECInquiring whether an expression with t as an index exists, and if so, taking exp as TMPREC[t]Then the smear of the local variable t is marked as GLOBALTAINTREC[exp]Meanwhile, the data flow relationship from the taint source to the local variable t is a calculation relationship expressed by an expression exp of the instruction sequence operation;
s4023: if t is a parameter of the system function call, if t is a register type, the REG is recorded from the register expression record tableRECInquiring whether an expression with t as an index exists, and if so, taking exp as REGREC[t]Then the dirty point of parameter t is marked as GLOBALTAINTREC[exp]Meanwhile, the data flow relationship from the taint source to the parameter t is a calculation relationship expressed by an expression exp of the instruction sequence operation; if the parameter t is of the shape "<BP,imm>"parameters passed by the stack, then constructExpression exp ═<SP、TMPREC[imm]>And take exp1 ═ STOREREC[exp]Then the dirty point of parameter t is marked as GLOBALTAINTREC[exp1]Meanwhile, the data flow relationship from the taint source to the parameter t is the computational relationship represented by the expression exp1 of the instruction sequence operation.
CN201811156016.6A 2018-09-30 2018-09-30 Software data flow analysis method based on intermediate language and taint analysis Active CN109324971B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811156016.6A CN109324971B (en) 2018-09-30 2018-09-30 Software data flow analysis method based on intermediate language and taint analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811156016.6A CN109324971B (en) 2018-09-30 2018-09-30 Software data flow analysis method based on intermediate language and taint analysis

Publications (2)

Publication Number Publication Date
CN109324971A CN109324971A (en) 2019-02-12
CN109324971B true CN109324971B (en) 2021-06-25

Family

ID=65266237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811156016.6A Active CN109324971B (en) 2018-09-30 2018-09-30 Software data flow analysis method based on intermediate language and taint analysis

Country Status (1)

Country Link
CN (1) CN109324971B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254945A (en) * 2021-06-08 2021-08-13 中国人民解放军国防科技大学 Static detection method, system and medium for web vulnerability based on taint analysis

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222512B (en) * 2019-05-21 2021-04-20 华中科技大学 Software vulnerability intelligent detection and positioning method and system based on intermediate language
CN111291373B (en) * 2020-02-03 2022-06-14 思客云(北京)软件技术有限公司 Method, apparatus and computer-readable storage medium for analyzing data pollution propagation
CN114968752A (en) * 2021-02-25 2022-08-30 北京嘀嘀无限科技发展有限公司 Method and device for determining assignment element, computer equipment and storage medium
CN113176990B (en) * 2021-03-25 2022-10-18 中国人民解放军战略支援部队信息工程大学 Taint analysis framework and method supporting correlation analysis among data
CN115329346B (en) * 2022-10-09 2023-03-24 支付宝(杭州)信息技术有限公司 Method and device for detecting side channel loophole

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103200203A (en) * 2013-04-24 2013-07-10 中国人民解放军理工大学 Semantic-level protocol format inference method based on execution trace
CN103714288A (en) * 2013-12-26 2014-04-09 华中科技大学 Data stream tracking method
US9245125B2 (en) * 2014-02-27 2016-01-26 Nec Laboratories America, Inc. Duleak: a scalable app engine for high-impact privacy leaks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080184208A1 (en) * 2007-01-30 2008-07-31 Sreedhar Vugranam C Method and apparatus for detecting vulnerabilities and bugs in software applications

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103200203A (en) * 2013-04-24 2013-07-10 中国人民解放军理工大学 Semantic-level protocol format inference method based on execution trace
CN103714288A (en) * 2013-12-26 2014-04-09 华中科技大学 Data stream tracking method
US9245125B2 (en) * 2014-02-27 2016-01-26 Nec Laboratories America, Inc. Duleak: a scalable app engine for high-impact privacy leaks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于污点指针的二进制代码缺陷检测;刘杰等;《计 算 机 工 程》;20121231;全文 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254945A (en) * 2021-06-08 2021-08-13 中国人民解放军国防科技大学 Static detection method, system and medium for web vulnerability based on taint analysis

Also Published As

Publication number Publication date
CN109324971A (en) 2019-02-12

Similar Documents

Publication Publication Date Title
CN109324971B (en) Software data flow analysis method based on intermediate language and taint analysis
CN109426615B (en) Inter-process null pointer dereference detection method, system, device, and medium
CN110287702B (en) Binary vulnerability clone detection method and device
US10839312B2 (en) Warning filter based on machine learning
CN104636256A (en) Memory access abnormity detecting method and memory access abnormity detecting device
US8589888B2 (en) Demand-driven analysis of pointers for software program analysis and debugging
JP5523872B2 (en) Program dynamic analysis method and apparatus
TWI556170B (en) Projecting native application programming interfaces of an operating system into other programming languages (2)
US10354069B2 (en) Automated reverse engineering
CN100440163C (en) Method and system for analysis processing of computer program
CN110213243B (en) Industrial communication protocol reverse analysis method based on dynamic taint analysis
FR2982385A1 (en) DISTRIBUTED COMPILATION OPERATION WITH MANAGEMENT OF A SIGNATURE OF INSTRUCTION
US8510604B2 (en) Static data race detection and analysis
KR20100106409A (en) Multi language software code analysis
KR101979329B1 (en) Method and apparatus for tracking security vulnerable input data of executable binaries thereof
CN112560043A (en) Vulnerability similarity measurement method based on context semantics
CN107526970A (en) The method of bug when being run based on binary detection of platform
CN114168747A (en) Knowledge base construction method and system based on cloud service
CN112948828A (en) Binary program malicious code detection method, terminal device and storage medium
CN113468525A (en) Similar vulnerability detection method and device for binary program
Zhao et al. Haepg: An automatic multi-hop exploitation generation framework
Sen et al. Executable analysis using abstract interpretation with circular linear progressions
CN114663753A (en) Production task online monitoring method and system
CN113778838A (en) Binary program dynamic taint analysis method and device
US20140137083A1 (en) Instrumenting computer program code by merging template and target code methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant