CN101923618A - Hidden Markov model based method for detecting assembler instruction level vulnerability - Google Patents

Hidden Markov model based method for detecting assembler instruction level vulnerability Download PDF

Info

Publication number
CN101923618A
CN101923618A CN2010102570228A CN201010257022A CN101923618A CN 101923618 A CN101923618 A CN 101923618A CN 2010102570228 A CN2010102570228 A CN 2010102570228A CN 201010257022 A CN201010257022 A CN 201010257022A CN 101923618 A CN101923618 A CN 101923618A
Authority
CN
China
Prior art keywords
leak
instruction
assembly instruction
assembly
fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010102570228A
Other languages
Chinese (zh)
Other versions
CN101923618B (en
Inventor
王崑声
李宁
胡昌振
白昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
No710 Institute Of China Aerospace Science And Technology Corp
Beijing Institute of Technology BIT
Original Assignee
No710 Institute Of China Aerospace Science And Technology Corp
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by No710 Institute Of China Aerospace Science And Technology Corp, Beijing Institute of Technology BIT filed Critical No710 Institute Of China Aerospace Science And Technology Corp
Priority to CN201010257022.8A priority Critical patent/CN101923618B/en
Publication of CN101923618A publication Critical patent/CN101923618A/en
Application granted granted Critical
Publication of CN101923618B publication Critical patent/CN101923618B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a hidden Markov model based method for detecting assembler instruction level vulnerabilities, belonging to the technical field of information security. The method comprises the following steps of: (1) constructing a vulnerability instruction library (VIL); (2) respectively selecting a plurality of executable programs with the vulnerability as training data of the vulnerability for each vulnerability in the vulnerability instruction library by aiming at all the vulnerabilities in the vulnerability instruction library constructed in the step (1); (3) obtaining assembler instruction segments of the training data of each vulnerability in the vulnerability instruction library; (4) obtaining a numerical code sequence of the training data; (5) sequentially obtaining a parameter lambda r=(Ar, Br, pi r) of the corresponding hidden Markov model of each vulnerability in the vulnerability instruction library; and (6) recognizing the vulnerability of an executable program to be detected. Compared with the prior art, the hidden Markov model based method for detecting the assembler instruction level vulnerability has the following advantages of establishing a model for the assembler instruction with context correlation and recognizing vulnerability characteristics by using the HMM (Hidden Markov Model) for the first time, increasing vulnerability detection efficiency and reducing error report rate and missing report rate.

Description

A kind of method for detecting assembler instruction level vulnerability based on hidden Markov model
Technical field
The present invention relates to a kind of Hole Detection method of assembler instruction level, particularly a kind of method for detecting assembler instruction level vulnerability based on hidden Markov model belongs to field of information security technology.
Background technology
Along with computer technology rapid development, the level of informatization of human society is more and more higher, and the politics of entire society, economy, military affairs, culture and other field are also more and more higher to the degree of dependence of computer information system.In this case, the security of computer system has obtained people and has more and more paid close attention to.Yet writing of large software, system needs various programmers to finish jointly, and they are divided into some plates with software or system, and the division of labor is write, and then gathers, test; Repair at last again, issue, it almost is inevitable therefore having security breaches in software.Software vulnerability be meant be introduced in the software design implementation procedure, in the defective of aspects such as data access or behavior logic.The usually victim utilization of these leaks, thus make program behavior run counter to certain security strategy.In most of the cases, therefore the source code non-availability of software systems must be analyzed the executable program of software, and this analysis at executable program is very difficult, because therefrom be difficult to find typical context dependence.For these reasons, at present the research of assembler instruction level vulnerability detection technique is more and more come into one's own.
In order to detect leak, corresponding analytical approach need have high coverage rate and low rate of false alarm/rate of failing to report.The dynamic testing method of robotization can (be usually located in the virtual machine) executive routine usually in controllable environment, the execution that monitors assembly instruction simultaneously is to analyze the behavioural characteristic of executable program.Yet the problem of performance analysis is that it must move executable program, and must carry out assembly instruction as much as possible, and this can produce a large amount of system overheads.Particularly when software systems entered dormant state, dynamic testing method must be waited for the execution next time of software, and this means longer analysis time and more system overhead.Carry out the traversal strategy owing to be subject to, dynamic analysing method is only better to analyzing the Malware effect at present, because the Malware volume is less.When analysis comprised the software of magnanimity assembly instruction, limited assembly instruction implementation strategy can't effectively be analyzed the behavioural characteristic of software systems, also can't potential security breaches be positioned.
The source code or the assembly instruction of static detection method executable program are analyzed, and can reduce system overhead when improving the analysis coverage rate.This type of analytical approach can be found more security breaches than dynamic analysing method.In addition, static analysis method can cover all possible assembly instruction.
People such as Csaba Nagy have proposed a kind of static analysis method (being called " method 1 " in an embodiment), and this method is by importing the location of relevant source code realization to leak with the user in identification and the tracking software.This method realizes analytic process by the scan source code, and does not need to move executable program or carry out assembly instruction.In the scanning process to source code, this method adopts the stain labelling technique that security breaches are positioned.Therefore, in the time can't obtaining source code, this method just can't be followed the tracks of the data stream of software, that is to say, the analytic target of this method is subject to the open source software system.
People such as Pattabiraman have developed cover security breaches detection systems (being called " method 2 " in an embodiment).The source code fragment that this systematic analysis is relevant with key variables.This method judges to need to analyze which fragment by the control flow analysis method, thereby is that these source code fragment structures detect expression formula.Because this method also belongs to the source code analysis method, so this method can't be analyzed executable program.
Hassen
Figure BSA00000234795700021
Developed an analysis tool external member, this external member combines multiple static analysis Method and kit for (being called " method 3 " in an embodiment).Executable program or assembly instruction that this external member adopts the static analysis method to come analysis software.The method that this external member adopts is to generate abstract constitutional diagram, and with this constitutional diagram as intermediate structure, to share its output and to produce analysis result.Though adopt the method can reduce rate of false alarm, this method can only the less Malware of code analysis amount, and can't the bigger software of analytical scale.
In the prior art, static state or the dynamic analysing method that can effectively analyze executable program or assembly instruction are arranged seldom.Existing assembly instruction analyzer need scan and analyze each bar assembly instruction, and this method has caused very high system overhead and a large amount of alert alarms of mistake.In addition owing to be difficult to positioning security leak from a large amount of assembly instructions that does not have remarkable relevance, therefore existing method can be from assembly instruction detected security breaches ratio greatly about about 50%, can not satisfy the demand of practical application.
The present invention uses an important prior art: Hidden Markov Model (HMM).
Hidden Markov model is the statistical method that a kind of effective description is present in the data sequence with incidence relation on the discrete time section.
The theoretical foundation of Hidden Markov Model (HMM) was set up by people such as Baum before and after 1970, there are the Baker of CMU and the people such as Jelinek of IBM to apply it among the speech recognition subsequently, because people such as Bell laboratory Rabiner are in the introduction explained the profound in simple terms of the mid-80 to Hidden Markov Model (HMM), the researcher who just makes Hidden Markov Model (HMM) be engaged in speech processes by countries in the world is gradually understood and is familiar with, and then becomes a generally acknowledged research focus.
Hidden Markov Model (HMM) is to grow up on the basis of Markov chain (a kind of finite state machine).At first introduce two notions: state set and observation sequence.State set refers to whole state (S that Hidden Markov Model (HMM) has 1..., S i..., S N), wherein N is the state sum, value is a positive integer; Observation sequence is a data sequence with context relation, uses v 1..., v t..., v TExpression, v 1..., v t..., v TBe observation sequence, wherein a v t=c m, represent that the value of the element of moment t in this sequence is c m, the span of m is: 1≤m≤M, M represent the sum of the exportable element of each state (value), its span is a positive integer.
Because practical problems is described more more complicated than Markov chain model, the element in the observed observation sequence be not with the Markov chain in state corresponding one by one, but interrelate by one group of probability distribution.Like this, stand in observer's angle, can only see observed value, corresponding one by one unlike observed value in the Markov chain model and state.Therefore, we can not directly see state, but the existence and the characteristic thereof of removing the perception state by a stochastic process, Here it is so-called " concealing " Markov model, i.e. Hidden Markov Model.It is a dual random process:
First stochastic process is the Markov chain, and this is basic stochastic process, has described the transition probability between the state, this probability a Ij=P (q T+1=S j| q t=S i) describe, be illustrated in t jumps to state j constantly from state i transition probability a IjAt q t=S iIn, q tExpression t shape variable during the moment, its value is S iI, the span of j is: 1≤i, j≤N and be positive integer.
Second stochastic process state has been described and the observation sequence studied between the statistics corresponding relation, this concerns by state output probability b i(c m)=P (v t=c m| q t=S i) describe, expression t is during the moment, and state i exports element v in the observation sequence t=c mProbability.
In addition, when using Hidden Markov Model (HMM) to describe a stochastic process, need the initial of designated state, so each state also has an initial probability π iHidden Markov Model (HMM) model with 3 states of a full syndeton is example (as shown in Figure 1) below, introduces the parameter group of describing Hidden Markov Model (HMM):
For this Hidden Markov Model (HMM), state-transition matrix A is made up of the state transition probability between all states, the transition probability of each line display from a certain state to other all states, and therefore the element sum in every row is 1:
A = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33
The output probability matrix B is made up of the output probability of all states, because each state has M output probability, so the size of this matrix is 3 * M, and each row element sum is 1:
B = b 1 ( c 1 ) . . . b 1 ( c M ) b 2 ( c 1 ) . . . b 2 ( c M ) b 3 ( c 1 ) . . . b 3 ( c M )
Usually, the Hidden Markov Model (HMM) model is equiprobable from a certain state, so the initial probability matrix π of each state is set to:
π={π 1=1/3,π 2=1/3,π 3=1/3}
Like this, a Hidden Markov Model (HMM) can be described by parameter group λ={ A, B, π }.Say that vividerly Hidden Markov Model (HMM) can be divided into two parts, one is the Markov chain, is described by { A, π }, and generation is output as status switch; Another part is a stochastic process, and { B} describes, and generation is output as the observed value sequence by the probability output matrix.Therefore, for an observation sequence v with time span T 1..., v t..., v T, the correspondence that it is implicit one group of status switch q 1..., q t..., q T, q wherein t∈ (S 1..., S i..., S N), Fig. 2 has showed the composition of Hidden Markov Model (HMM).
The process of using Hidden Markov Model (HMM) to classify is as follows:
1. train the Hidden Markov Model (HMM) parameter:
For belong to together classification r (1≤r≤R and for positive integer, R by the sum of research classification, span is a positive integer) one group of observation sequence O r={ o 1..., o k..., o K, (1≤k≤K and be positive integer, K is the sum that belongs to the observation sequence of r classification, span is a positive integer)) o wherein k={ v 1..., v t..., v T, utilize the adjustment Hidden Markov Model (HMM) model parameter of Baum-Welch algorithm iteration, obtain the optimum Hidden Markov Model (HMM) parameter lambda that can describe this classification observation data attribute J={ A J, B J, π J.In training process, we adopt the mode of learning that supervision is arranged, and are Hidden Markov Model (HMM) model of observation data training of each class.
2. to the identification of observation sequence to be measured
A given observed value sequence o to be measured k={ v 1..., v t..., v T, and a model parameter group λ r={ A r, B r, π r, calculate the probability P (o that this observation sequence takes place on this model k| λ r), be to estimate the method how given model mates observation sequence to be measured.
When observation sequence to be measured was classified, we need mate the Hidden Markov Model (HMM) of observation sequence and each classification (suppose total C class), and this observation sequence is incorporated in the classification that Yu Yuqi has maximum similarity.
In machine language, have very strong logical relation between the assembly instruction, one section code is exactly a data sequence with incidence relation in essence.Incidence relation in the assembly instruction is meant, the probability that present instruction exists only depends on the probability that an instruction exists, and irrelevant with the instruction of front more, also is not subjected to the influence of instructing later, and this just in time has typical Markov property.
Summary of the invention
The objective of the invention is deficiency, propose a kind of method for detecting assembler instruction level vulnerability based on hidden Markov model at above-mentioned prior art existence.
The objective of the invention is to be achieved through the following technical solutions.
A kind of method for detecting assembler instruction level vulnerability based on hidden Markov model, its concrete operations step is as follows:
Step 1, structure leak instruction database.
The leak instruction database is used for storing the feature of known bugs and known bugs, and a leak comprises 3 attributes: leak title, assembly instruction frame and numerical code; Each assembly instruction frame comprises 1 or many assembly instructions; Each numerical code comprises two parts, and wherein a part is unique coding of each leak, and another part is the serial number of the assembly instruction that comprises in the assembly instruction frame of this leak correspondence.For each leak, its concrete building method is:
Step 1.1: use static dis-assembling analysis tool to carry out dis-assembling at the one or more executable program that comprises a kind of software vulnerability, obtain the function structure figure of whole functions in the executable program; Each function structure figure is called an assembly instruction fragment.
The corresponding function of each functional arrangement, each function structure figure includes but not limited to following information: the control flow structure of the system call of using in the register that uses in global variable that uses in function name, function parameter, rreturn value type, the function and local variable, the function, the function, all assembly instructions that function comprises, function, to the class higher level lanquage note of key instruction.
The control flow structure of described function is based on that jump instruction divides.
Step 1.2: extract the assembly instruction fragment that produces this leak in the assembly instruction fragment that from step 1.1, obtains.
(1) the assembly instruction fragment of stack overflow leak, comprise following assembly instruction sequence: 1. register ESP moves to low address, opens up new stack space; 2. the length of reproducting content treated in register ECX record; 3. register ESI writes down a pointer value, and this pointed is treated the first address of reproducting content; 4. register EDI writes down a pointer value, the first address of the stack space that will use when this pointed is duplicated; 5. the reproducting content for the treatment of that ESI is pointed to is that unit copies in the stack space of EDI sensing with the double word; 6. will treat that the reproducting content rest parts is that unit copies in the remaining stack space with the byte; 7. ESP points to theoretic return address.
(2) the assembly instruction fragment of heap Overflow Vulnerability comprises following assembly instruction sequence: 1. call the HeapCreate function, create heap space P, comprise 2 heap pieces, promptly pile piece A and heap piece B; 2. call the HeapAlloc function, be heap piece A dynamic assignment space; 3. the content replication that surpasses heap piece A size in heap piece A; 4. call the HeapAlloc function, be heap piece B dynamic assignment space, the piece owner pointer of piling piece B this moment is covered by the content that exceeds among the heap piece A.
(3) the assembly instruction fragment of format output leak comprises following assembly instruction sequence: 1. use within three instructions before of call instruction calls format output function, do not have push offset instruction; 2. the push instruction number that occurs continuously before " % " number in the pushoffset instruction note and the push offset instruction is not inconsistent; 3. in push offset instruction note, occur " %n ".
Step 1.3: use in the assembly instruction fragment of this leak of generation that dynamic dis-assembling analysis tool obtains from step 1.2 and extract the assembly instruction that produces this leak successively, the ordered set that these assembly instructions are formed is called the assembly instruction frame.Be specially:
Step 1.3.1: executable program described in the operating procedure 1.1 under normal condition, use the assembly instruction fragment of this leak of generation that dynamic dis-assembling analysis tool extracts in the tracking step 1.2 successively, note in each assembly instruction fragment that produces this leak assembly instruction performed in this operational process, the ordered set that the assembly instruction of these execution is formed is called assembly instruction set 1;
Step 1.3.2: activate this leak, use dynamic dis-assembling analysis tool executable program described in the operating procedure 1.1 once more, the assembly instruction fragment of this leak of generation that extracts in the tracking step 1.2, note in each assembly instruction fragment assembly instruction performed in this operational process successively, the ordered set that the assembly instruction of these execution is formed is called assembly instruction set 2;
Step 1.3.3: it is right successively the assembly instruction in assembly instruction set 1 and the assembly instruction set 2 to be carried out matching ratio, and is divided into following three kinds of situations and handles:
Situation 1: for the assembly instruction that appears at simultaneously in assembly instruction set 1 and the assembly instruction set 2, be considered as not being to trigger the assembly instruction of this leak, do deletion and handle;
Situation 2: in compiling instruction set 1, occurring, still in assembly instruction set 2, do not have the assembly instruction of appearance, do discard processing; It is performed when normal operation that these assembly instructions of doing discard processing are executable programs, and when triggering leak unenforced assembly instruction, therefore be considered as not being to trigger the assembly instruction of this leak.
Situation 3: occur for gathering in 2, but in compiling instruction set 1, do not have the ordered set of the assembly instruction composition of appearance to be called the assembly instruction frame in assembly instruction; Storage and the relevant assembly instruction of triggering leak in the assembly instruction frame.
Can obtain producing the assembly instruction frame of this leak through the operation of above-mentioned steps.
Dynamically the dis-assembling analysis tool includes but not limited to: the BinNavi of Paimei, Sabre company.
Step 1.4: structure leak instruction database.
The order of extraction successively of the assembly instruction in the assembly instruction frame that obtains according to step 1.3 is that every assembly instruction increases a numerical code.
By the operation of repeated execution of steps 1.1, can construct the leak instruction database that comprises a plurality of leaks to step 1.4.
Step 2, at the whole leaks in the leak instruction database of constructing in the step 1, the every kind of leak that is respectively in the leak instruction database is chosen a plurality of training datas that contain the executable program of this leak as this leak.Every kind of leak V in the leak instruction database iExpression, the training data T of this leak iExpression; I represents unique coding of each leak in the leak instruction database.
Step 3, obtain the assembly instruction fragment of the training data of each leak in the leak instruction database.
The training data of each leak in the leak instruction database that obtains at step 2 uses static dis-assembling analysis tool to carry out dis-assembling, obtains in the executable program all function structure figure of functions; Each function structure figure is called an assembly instruction fragment.
Preferably, static dis-assembling analysis tool is IDA Pro described in step 1.1, the step 3.
Step 4, obtain the numerical code sequence of training data.
Assembly instruction in the assembly instruction fragment of the training data of each leak in the leak instruction database that step 3 is obtained adopts the structuring disposal route to handle respectively successively, obtains the numerical code sequence of the assembly instruction fragment of the training data of each leak in the leak instruction database;
The concrete operations step of described structuring disposal route is:
Step 4.1: (all assembly instructions in the leak instruction database are used P respectively with each the bar assembly instruction in the leak instruction database successively 1~P SumExpression, wherein, sum is a positive integer, the quantity of the assembly instruction that expression leak instruction database comprises) in numeral remove, (new character strings is used Q respectively to form new character strings 1~Q SumExpression).
Step 4.2: the quantity usage counter A counting of the assembly instruction that the leak instruction database comprises, the count value of counter A is represented (m 〉=1 and m are positive integer) with m; The initial value of setting m is 1.
Step 4.3: whether the count value m that judges counter A is not more than the number N of assembly instruction fragment of the training data of m leak, and N is a positive integer; If this condition is set up, the then operation of execution in step 4.4; Otherwise, end operation.
Step 4.4: each the bar assembly instruction in the assembly instruction fragment of the training data of m leak (is used P ' respectively 1~P ' Sum 'Expression, wherein, sum ' be a positive integer, represents the quantity of the assembly instruction in the assembly instruction fragment of training data of m leak) in numeral remove, the composition new character strings (is used Q ' respectively 1~Q ' Sum 'Expression);
Step 4.5: the quantity of the assembly instruction in the assembly instruction fragment of the training data of m leak is counted with counter B, and the count value of counter B is represented (i is a positive integer) with i; The initial value of setting the count value i of counter B is 1.
Step 4.6: whether the count value i that judges the counter B of the assembly instruction in the assembly instruction fragment of training data of m leak is not more than the quantity sum ' of the assembly instruction in the assembly instruction fragment of training data of m leak, if this condition is set up, the then operation of execution in step 4.7; Otherwise, the operation of execution in step 4.9.
Step 4.7: successively the numeral in each the bar assembly instruction in the assembly instruction fragment of the training data of m leak is removed, formed new character strings Q ' iRemove with the numeral that step 4.1 obtains, form new character strings Q in each the bar assembly instruction in the leak instruction database 1~Q SumContrast one by one, and the method calculating character string Q ' of employing character match iRespectively with character string Q 1~Q SumSimilarity (use S respectively 1~S SumExpression).If similarity S 1~S SumAll less than a certain pre-set threshold, character string Q ' then iCorresponding numerical code is 0; Otherwise, if similarity S 1~S SumIn a maximal value S is only arranged Max, character string Q ' then iCorresponding numerical code is maximal value S MaxCorresponding assembly instruction is in the numerical code of leak instruction database; Otherwise, if similarity S 1~S SumIn maximal value S MaxMore than one, can any one maximal value S of picked at random Max, character string Q ' then iCorresponding numerical code is this maximal value S MaxCorresponding assembly instruction is in the numerical code of leak instruction database.
Step 4.8: the count value i value of the counter B of the assembly instruction in the assembly instruction fragment of the training data of m leak is increased 1, and repeated execution of steps 4.6 is to step 4.8.
Step 4.9: the value of the count value m of the counter A of the assembly instruction that the leak instruction database is comprised increases 1, and repeated execution of steps 4.3 is to step 4.9.
Through a complete operation of structuring disposal route, can obtain belonging to the numerical code sequence O of a leak in the leak instruction database r={ o 1..., o k..., o N, be also referred to as one group of observation sequence of this leak; 1≤k≤N and k are positive integer, and N is the number of assembly instruction fragment of the training data of this leak, and each the assembly instruction fragment in the assembly instruction fragment of the training data of this leak produces the observation sequence of this leak.1≤r≤N ', N ' is the quantity of the leak that comprises in the leak instruction database, r and N ' are positive integer.
Step 5, obtain the parameter lambda of the Hidden Markov Model (HMM) of each the leak correspondence in the leak instruction database successively r={ A r, B r, π r.
Utilize the training data training Hidden Markov Model (HMM) of each leak in the leak instruction database respectively, obtain the parameter lambda of the Hidden Markov Model (HMM) of each the leak correspondence in the leak instruction database r={ A r, B r, π r.
One group of observation sequence O for each leak in the leak instruction database r={ o 1..., o k..., o N, utilize the Baum-Welch algorithm iteration to adjust the parameter of Hidden Markov Model (HMM) model, obtain the optimum Hidden Markov Model (HMM) parameter lambda that can describe this classification leak r={ A r, B r, π r, be the Hidden Markov Model (HMM) model of each leak training in the leak instruction database.
The Hidden Markov Model (HMM) model of each leak in the leak instruction database that step 6, use step 5 obtain carries out leak identification to executable program to be measured; Be specially:
Step 6.1: use static dis-assembling analysis tool to carry out dis-assembling at executable program to be measured, and obtain the function structure figure of whole functions in the executable program, and each function structure figure is called an assembly instruction fragment.
Step 6.2: successively the assembly instruction in the assembly instruction fragment of executable program to be measured is handled, obtained the numerical code sequence o '={ v of executable program to be measured 1..., v t, be also referred to as the observation sequence of this executable program to be measured; Wherein, v 1..., v tNumerical code for an assembly instruction correspondence in the assembly instruction fragment of this executable program to be measured; T is a positive integer, represents the quantity of the assembly instruction in the assembly instruction fragment of executable program to be measured.The concrete operations step is:
Step 6.2.1: the quantity of the assembly instruction in the assembly instruction fragment of executable program to be measured is counted with counter C, and the count value of counter C represents that with j j is a positive integer; The initial value of setting the count value j of counter C is 1;
Step 6.2.2: whether the value of count value j of judging the counter C of the assembly instruction in the assembly instruction fragment of executable program to be measured is not more than the quantity of the assembly instruction in the assembly instruction fragment of executable program to be measured, if this condition is set up, the then operation of execution in step 6.2.3; Otherwise, the operation of execution in step 4.9.
Step 6.2.3: successively each the bar assembly instruction in the assembly instruction fragment of executable program to be measured (is used C jExpression) numeral in is removed, and forms new character strings (with C ' jExpression) numeral with in each the bar assembly instruction in the leak instruction database that obtains with step 4.1 is removed, and forms new character strings Q 1~Q SumContrast one by one, and the method calculating character string C ' of employing character match jRespectively with character string Q 1~Q SumSimilarity (use S ' respectively 1~S ' SumExpression).If similarity S ' 1~S ' SumAll less than a certain pre-set threshold, character string C ' then jCorresponding numerical code is 0; Otherwise, if similarity S ' 1~S ' SumIn a maximal value S ' is only arranged Max, character string C ' then jCorresponding numerical code is maximal value S ' MaxCorresponding assembly instruction is in the numerical code of leak instruction database; Otherwise, if similarity S ' 1~S ' SumIn maximal value S ' MaxMore than one, can any one maximal value S ' of picked at random Max, character string C ' then jCorresponding numerical code is this maximal value S ' MaxCorresponding assembly instruction is in the numerical code of leak instruction database.
Step 6.2.4: the count value j value that will represent the counter C of the assembly instruction in the assembly instruction fragment of executable program to be measured increases 1, and repeated execution of steps 6.2.2 is to step 6.2.4.
Through the operation of step 6.2.1, can obtain the numerical code sequence o '={ v of executable program to be measured to step 6.2.4 1..., v t.
Step 6.3: the parameter lambda of utilizing the Hidden Markov Model (HMM) of each the leak correspondence in the leak instruction database that step 5 obtains according to the numerical code sequence o ' of the executable program to be measured that obtains in the step 6.2 respectively r={ A r, B r, π r, calculate the observation sequence o '={ v of this executable program to be measured 1..., v tAt this Hidden Markov Model (HMM) λ r={ A r, B r, π rThe last probability P that takes place (o ' | λ r), promptly obtain P (o ' | λ 1) to P (o ' | λ N ').
Step 6.4: get P (o ' | λ 1) to P (o ' | λ N ') in maximal value, if this maximal value greater than some pre-set threshold, the represented leak of Hidden Markov Model (HMM) that executable program then to be measured contains this maximal value correspondence.
Operation through above-mentioned steps can detect the leak that exists in the executable program to be measured.
Beneficial effect
The method that the present invention proposes compared with the prior art has following advantage:
(1) method of the present invention is directly analyzed assembly instruction, and can accurately discern leak in a less assembly instruction scope, has improved the efficient of Hole Detection, has reduced rate of false alarm and rate of failing to report simultaneously;
(2) the leak instruction database provides a kind of mechanism, and non-structured assembly instruction is mapped as structurized numerical code sequence, and these numerical code sequences can be handled by the pattern recognition model of classics;
(3) first Hidden Markov Model (HMM) is applied to the assembly instruction that has context relation is carried out modeling and the leak feature is discerned.
Description of drawings
Fig. 1 is the Hidden Markov Model (HMM) synoptic diagram of 3 states of the full syndeton of prior art;
Fig. 2 is the composition synoptic diagram of the Hidden Markov Model (HMM) of prior art.
Embodiment
Below in conjunction with specific embodiment technical solution of the present invention is described in detail.
Present embodiment adopts the inventive method that size is carried out Hole Detection at 30 executable programs between the 50KB to 100KB.Step is as follows:
Step 1,30 sizes of employing executable program structure leak instruction database about 10KB, the leak instruction database comprises 3 leaks, is specially:
Step 1.1: use static dis-assembling analysis tool to carry out dis-assembling at 10 executable programs that comprise a kind of software vulnerability, obtain the function structure figure of whole functions in the executable program; Each function structure figure is called an assembly instruction fragment.
The corresponding function of each functional arrangement, each functional arrangement includes but not limited to following information: the control flow structure of the system call of using in the register that uses in global variable that uses in function name, function parameter, rreturn value type, the function and local variable, the function, the function, all assembly instructions that function comprises, function, to the class higher level lanquage note of key instruction.
The control flow structure of described function is based on that jump instruction divides.
Step 1.2: extract the assembly instruction fragment that produces following 3 kinds of leaks in the assembly instruction fragment that from step 1.1, obtains: the assembly instruction fragment of (1) stack overflow leak; (2) the assembly instruction fragment of heap Overflow Vulnerability; (3) the assembly instruction fragment of format output leak.
Step 1.3: extract the assembly instruction that produces this leak successively in the assembly instruction fragment of this leak of generation that the dynamic dis-assembling analysis tool of use Paimei obtains from step 1.2, the ordered set that these assembly instructions are formed is called the assembly instruction frame.Be specially:
Step 1.3.1: executable program described in the operating procedure 1.1 under normal condition, use the assembly instruction fragment of this leak of generation that dynamic dis-assembling analysis tool extracts in the tracking step 1.2 successively, note in each assembly instruction fragment that produces this leak assembly instruction performed in this operational process, the ordered set that the assembly instruction of these execution is formed is called assembly instruction set 1;
Step 1.3.2: activate this leak, use dynamic dis-assembling analysis tool executable program described in the operating procedure 1.1 once more, the assembly instruction fragment of this leak of generation that extracts in the tracking step 1.2, note in each assembly instruction fragment assembly instruction performed in this operational process successively, the ordered set that the assembly instruction of these execution is formed is called assembly instruction set 2;
Step 1.3.3: it is right successively the assembly instruction in assembly instruction set 1 and the assembly instruction set 2 to be carried out matching ratio, and is divided into following three kinds of situations and handles:
Situation 1: for the assembly instruction that appears at simultaneously in assembly instruction set 1 and the assembly instruction set 2, be considered as not being to trigger the assembly instruction of this leak, do deletion and handle;
Situation 2: in compiling instruction set 1, occurring, still in assembly instruction set 2, do not have the assembly instruction of appearance, do discard processing; It is performed when normal operation that these assembly instructions of doing discard processing are executable programs, and when triggering leak unenforced assembly instruction, therefore be considered as not being to trigger the assembly instruction of this leak.
Situation 3: occur for gathering in 2, but in compiling instruction set 1, do not have the ordered set of the assembly instruction composition of appearance to be called the assembly instruction frame in assembly instruction; Storage and the relevant assembly instruction of triggering leak in the assembly instruction frame.
Can obtain producing the assembly instruction frame of this leak through the operation of above-mentioned steps.
Step 1.4: structure leak instruction database.
The order of extraction successively of the assembly instruction in the assembly instruction frame that obtains according to step 1.3 is that every assembly instruction increases a numerical code.Numerical code is 5 decimal numbers, two codings of representing the leak type in the left side wherein, and other three is the serial number of assembly instruction sequence of the leak correspondence of a certain type.
By the operation of repeated execution of steps 1.1, can construct the leak instruction database that comprises 3 leaks to step 1.4.
By the operation of above-mentioned steps, constructed 1 leak instruction database, wherein comprise totally 35 assembly instructions of representing these 3 kinds of leaks.
Table 1 leak instruction database structure
Numerical code Assembly instruction
Figure BSA00000234795700161
Figure BSA00000234795700171
Step 2, at the whole leaks in the leak instruction database of constructing in the step 1, be respectively every kind of leak and choose a plurality of executable programs that contain this leak.Every kind of leak V in the leak instruction database iExpression, the training data T of this leak iExpression; I represents unique coding of each leak in the leak instruction database.
Step 2, at 3 kinds of leaks, for each the leak type in the leak instruction database, choose the training data of 12 executable programs respectively as this leak.Every kind of leak V in the leak instruction database iExpression, the training data T of this leak iExpression; I represents unique coding of each leak in the leak instruction database.Estimate partially with the nothing that obtains experimental result in order to improve data volume, adopted cross-validation rules.Test is split as two groups at random with executable code: 5 programs are used for training, and remain 7 programs and are used for testing.System performance is averaged the result as final discrimination by all programs being carried out 100 incompatible assessments of random groups.
Step 3, obtain the assembly instruction fragment of the training data of each leak in the leak instruction database.
The training data of each leak in the leak instruction database that obtains at step 2 uses the static dis-assembling analysis tool of IDA Pro to carry out dis-assembling, obtains in the executable program all function structure figure of functions; Each function structure figure is called an assembly instruction fragment.
Step 4, obtain the numerical code sequence of training data.
Assembly instruction in the assembly instruction fragment of the training data of each leak in the leak instruction database that step 3 is obtained adopts the structuring disposal route to handle respectively successively, obtains the numerical code sequence of the assembly instruction fragment of the training data of each leak in the leak instruction database;
The concrete operations step of described structuring disposal route is:
Step 4.1: (all assembly instructions in the leak instruction database are used P respectively with each the bar assembly instruction in the leak instruction database successively 1~P SumExpression, wherein, sum is a positive integer, the quantity of the assembly instruction that expression leak instruction database comprises) in numeral remove, (new character strings is used Q respectively to form new character strings 1~Q SumExpression).
Step 4.2: the quantity usage counter A counting of the assembly instruction that the leak instruction database comprises, the count value of counter A is represented (m 〉=1 and m are positive integer) with m; The initial value of setting m is 1.
Step 4.3: whether the count value m that judges counter A is not more than the number N of assembly instruction fragment of the training data of m leak, and N is a positive integer; If this condition is set up, the then operation of execution in step 4.4; Otherwise, end operation.
Step 4.4: each the bar assembly instruction in the assembly instruction fragment of the training data of m leak (is used P ' respectively 1~P ' Sum 'Expression, wherein, sum ' be a positive integer, represents the quantity of the assembly instruction in the assembly instruction fragment of training data of m leak) in numeral remove, the composition new character strings (is used Q ' respectively 1~Q ' Sum 'Expression);
Step 4.5: the quantity of the assembly instruction in the assembly instruction fragment of the training data of m leak is counted with counter B, and the count value of counter B is represented (i is a positive integer) with i; The initial value of setting the count value i of counter B is 1.
Step 4.6: whether the count value i that judges the counter B of the assembly instruction in the assembly instruction fragment of training data of m leak is not more than the quantity sum ' of the assembly instruction in the assembly instruction fragment of training data of m leak, if this condition is set up, the then operation of execution in step 4.7; Otherwise, the operation of execution in step 4.9.
Step 4.7: successively the numeral in each the bar assembly instruction in the assembly instruction fragment of the training data of m leak is removed, formed new character strings Q ' iRemove with the numeral that step 4.1 obtains, form new character strings Q in each the bar assembly instruction in the leak instruction database 1~Q SumContrast one by one, and the method calculating character string Q ' of employing character match iRespectively with character string Q 1~Q SumSimilarity (use S respectively 1~S SumExpression).If similarity S 1~S SumAll less than a certain pre-set threshold, character string Q ' then iCorresponding numerical code is 0; Otherwise, if similarity S 1~S SumIn a maximal value S is only arranged Max, character string Q ' then iCorresponding numerical code is maximal value S MaxCorresponding assembly instruction is in the numerical code of leak instruction database; Otherwise, if similarity S 1~S SumIn maximal value S MaxMore than one, can any one maximal value S of picked at random Max, character string Q ' then iCorresponding numerical code is this maximal value S MaxCorresponding assembly instruction is in the numerical code of leak instruction database.
Step 4.8: the count value i value of the counter B of the assembly instruction in the assembly instruction fragment of the training data of m leak is increased 1, and repeated execution of steps 4.6 is to step 4.8.
Step 4.9: the value of the count value m of the counter A of the assembly instruction that the leak instruction database is comprised increases 1, and repeated execution of steps 4.3 is to step 4.9.
Through a complete operation of structuring disposal route, can obtain belonging to the numerical code sequence O of a leak in the leak instruction database r={ o 1..., o k..., o N, be also referred to as one group of observation sequence of this leak; 1≤k≤N and k are positive integer, and N is the number of assembly instruction fragment of the training data of this leak, and each the assembly instruction fragment in the assembly instruction fragment of the training data of this leak produces the observation sequence of this leak.1≤r≤N ', N ' is the quantity of the leak that comprises in the leak instruction database, r and N ' are positive integer.
Step 5, obtain the parameter lambda of the Hidden Markov Model (HMM) of each the leak correspondence in the leak instruction database successively r={ A r, B r, π r.
Utilize the training data training Hidden Markov Model (HMM) of each leak in the leak instruction database respectively, obtain the parameter lambda of the Hidden Markov Model (HMM) of each the leak correspondence in the leak instruction database r={ A r, B r, π r.
One group of observation sequence O for each leak in the leak instruction database r={ o 1..., o k..., o N, utilize the Baum-Welch algorithm iteration to adjust the parameter of Hidden Markov Model (HMM) model, obtain the optimum Hidden Markov Model (HMM) parameter lambda that can describe this classification leak r={ A r, B r, π r, be the Hidden Markov Model (HMM) model of each leak training in the leak instruction database.
In the present embodiment, for each leak type is all set up 1 Hidden Markov Model (HMM).Therefore 3 Hidden Markov Model (HMM) parameter sets have been set up altogether.In initialization procedure, the Hidden Markov Model (HMM) state is taken as even distribution.B rIn each row in element all be assigned 1/k, k=35 wherein, k are the quantity of the assembly instruction that comprises of leak instruction database.The transition probability of each state also equates.Therefore, in the process of using the Baum-Welch algorithm to train, needing the entire quantity of estimated parameters in the 3 state Hidden Markov Model (HMM) is 114, wherein 9 state exchange+3*35 output probabilities.
In test process, use 7 test procedures, these 7 executable programs are no leak executable program or do not contain the executable program of this 3 class leak.Therefore, the test result of generation does not all match with this 3 class leak.Get these results' average, the value that obtains thus is as the threshold value of judging the leak type affiliation.After test process in, less than the test result of this value, all be considered as not matching with the known bugs type.
The Hidden Markov Model (HMM) model of each leak in the leak instruction database that step 6, use step 5 obtain carries out leak identification to executable program to be measured; Be specially:
Step 6.1: use static dis-assembling analysis tool to carry out dis-assembling at executable program to be measured, and obtain the function structure figure of whole functions in the executable program, and each function structure figure is called an assembly instruction fragment.
Step 6.2: successively the assembly instruction in the assembly instruction fragment of executable program to be measured is handled, obtained the numerical code sequence o '={ v of executable program to be measured 1..., v t, be also referred to as the observation sequence of this executable program to be measured; Wherein, v 1..., v tNumerical code for an assembly instruction correspondence in the assembly instruction fragment of this executable program to be measured; T is a positive integer, represents the quantity of the assembly instruction in the assembly instruction fragment of executable program to be measured.The concrete operations step is:
Step 6.2.1: the quantity of the assembly instruction in the assembly instruction fragment of executable program to be measured is counted with counter C, and the count value of counter C represents that with j j is a positive integer; The initial value of setting the count value j of counter C is 1;
Step 6.2.2: whether the value of count value j of judging the counter C of the assembly instruction in the assembly instruction fragment of executable program to be measured is not more than the quantity of the assembly instruction in the assembly instruction fragment of executable program to be measured, if this condition is set up, the then operation of execution in step 6.2.3; Otherwise, the operation of execution in step 4.9.
Step 6.2.3: successively each the bar assembly instruction in the assembly instruction fragment of executable program to be measured (is used C jExpression) numeral in is removed, and forms new character strings (with C ' jExpression) numeral with in each the bar assembly instruction in the leak instruction database that obtains with step 4.1 is removed, and forms new character strings Q 1~Q SumContrast one by one, and the method calculating character string C ' of employing character match jRespectively with character string Q 1~Q SumSimilarity (use S ' respectively 1~S ' SumExpression).If similarity S ' 1~S ' SumAll less than a certain pre-set threshold, character string C ' then jCorresponding numerical code is 0; Otherwise, if similarity S ' 1~S ' SumIn a maximal value S ' is only arranged Max, character string C ' then jCorresponding numerical code is maximal value S ' MaxCorresponding assembly instruction is in the numerical code of leak instruction database; Otherwise, if similarity S ' 1~S ' SumIn maximal value S ' MaxMore than one, can any one maximal value S ' of picked at random Max, character string C ' then jCorresponding numerical code is this maximal value S ' MaxCorresponding assembly instruction is in the numerical code of leak instruction database.
Step 6.2.4: the count value j value that will represent the counter C of the assembly instruction in the assembly instruction fragment of executable program to be measured increases 1, and repeated execution of steps 6.2.2 is to step 6.2.4.
Through the operation of step 6.2.1, can obtain the numerical code sequence o '={ v of executable program to be measured to step 6.2.4 1..., v t.
Step 6.3: the parameter lambda of utilizing the Hidden Markov Model (HMM) of each the leak correspondence in the leak instruction database that step 5 obtains according to the numerical code sequence o ' of the executable program to be measured that obtains in the step 6.2 respectively r={ A r, B r, π r, calculate the observation sequence o '={ v of this executable program to be measured 1..., v tAt this Hidden Markov Model (HMM) λ r={ A r, B r, π rThe last probability P that takes place (o ' | λ r), promptly obtain P (o ' | λ 1) to P (o ' | λ N ').
Step 6.4: get P (o ' | λ 1) to P (o ' | λ N ') in maximal value, if this maximal value greater than some pre-set threshold, the represented leak of Hidden Markov Model (HMM) that executable program then to be measured contains this maximal value correspondence.
Operation through above-mentioned steps can detect the leak that exists in the executable program to be measured.
Table 2 has showed that discrimination increases the value of Hidden Markov Model (HMM) status number from 1 to 7 along with the increase of Hidden Markov Model (HMM) status number.When the Hidden Markov Model (HMM) status number greater than 7 the time, discrimination is marked change no longer.This means when status number more after a little while, Hidden Markov Model (HMM) state deficiency thinks that context-sensitive numerical code sequence carries out modeling.Though bigger status number can provide desirable discrimination, the calculation cost of this moment is also very high, because need to calculate more Hidden Markov Model (HMM) parameter.Therefore, we select 7 state Hidden Markov Model (HMM) to experimentize.For the effect of the inventive method is described, adopt method 1, method 2 and method 3 to detect respectively to identical test procedure; Because method 1 and method 2 based on source code analysis, therefore in the test of these two methods, have been used the source code of 36 test procedure correspondences.
The discrimination that table 2 obtains based on different Hidden Markov Model (HMM) status numbers
Figure BSA00000234795700221
Table 3 has shown that the method among the present invention compares with the performance of other three kinds of methods for different leak types.When analyzing the stack overflow leak, because its feature is relatively easily discerned, the method among the present invention has identified 89.1% leak (rate of failing to report is 10.9%), and rate of false alarm is 21.9%.By contrast, the rate of false alarm of method 1,2,3 is all more than 22%, and rate of failing to report is more than 11%.For the heap Overflow Vulnerability, because its feature is obvious not as the stack overflow leak, the method among the present invention has produced 30.7% rate of false alarm, and rate of failing to report is 21.4%.Nonetheless, the rate of failing to report of the method among the present invention is lower than method 1 and 3, and rate of false alarm is lower than method 1 and method 2.For format output leak, the method among the present invention has produced 20.7% rate of failing to report, and then between 20.9%-22.5%, the rate of false alarm of the method among the present invention is 17.7% to the rate of failing to report of other method, be lower than method 1 19.2% and method 2 18.5%.
Finally, average rate of false alarm shows that the method among the present invention is more accurate on the recognition capability to leak.And average rate of failing to report shows that the inventive method can provide the leak discrimination higher than source code testing tool, just lower rate of failing to report.Therefore, the method among the present invention can be used as the sorter and the recognizer of other more eurypalynous leaks of identification.
The method among table 3 the present invention and the performance of additive method are relatively
Figure BSA00000234795700231
In the present embodiment, all experiments all are to carry out on desk-top compatible, and this machine is equipped with Pentium Dual Core 2.20GHzCPU, and memory size is 2GB, and operating system is Windows XP SP2.
The above only is a preferred implementation of the present invention; should be understood that; for those skilled in the art; under the prerequisite that does not break away from the principle of the invention; can also make some improvement; perhaps part technical characterictic wherein is equal to replacement, these improvement and replace and also should be considered as protection scope of the present invention.

Claims (8)

1. method for detecting assembler instruction level vulnerability based on hidden Markov model, its concrete operations step is as follows:
Step 1, structure leak instruction database;
The leak instruction database is used for storing the feature of known bugs and known bugs, and a leak comprises 3 attributes: leak title, assembly instruction frame and numerical code; Each assembly instruction frame comprises 1 or many assembly instructions; Each numerical code comprises two parts, and wherein a part is unique coding of each leak, and another part is the serial number of the assembly instruction that comprises in the assembly instruction frame of this leak correspondence; For each leak, its concrete building method is:
Step 1.1: use static dis-assembling analysis tool to carry out dis-assembling at the one or more executable program that comprises a kind of software vulnerability, obtain the function structure figure of whole functions in the executable program; Each function structure figure is called an assembly instruction fragment;
Step 1.2: extract the assembly instruction fragment that produces this leak in the assembly instruction fragment that from step 1.1, obtains;
Step 1.3: use in the assembly instruction fragment of this leak of generation that dynamic dis-assembling analysis tool obtains from step 1.2 and extract the assembly instruction that produces this leak successively, the ordered set that these assembly instructions are formed is called the assembly instruction frame; Be specially:
Step 1.3.1: executable program described in the operating procedure 1.1 under normal condition, use the assembly instruction fragment of this leak of generation that dynamic dis-assembling analysis tool extracts in the tracking step 1.2 successively, note in each assembly instruction fragment that produces this leak assembly instruction performed in this operational process, the ordered set that the assembly instruction of these execution is formed is called assembly instruction set 1;
Step 1.3.2: activate this leak, use dynamic dis-assembling analysis tool executable program described in the operating procedure 1.1 once more, the assembly instruction fragment of this leak of generation that extracts in the tracking step 1.2, note in each assembly instruction fragment assembly instruction performed in this operational process successively, the ordered set that the assembly instruction of these execution is formed is called assembly instruction set 2;
Step 1.3.3: it is right successively the assembly instruction in assembly instruction set 1 and the assembly instruction set 2 to be carried out matching ratio, and is divided into following three kinds of situations and handles:
Situation 1: for the assembly instruction that appears at simultaneously in assembly instruction set 1 and the assembly instruction set 2, be considered as not being to trigger the assembly instruction of this leak, do deletion and handle;
Situation 2: in compiling instruction set 1, occurring, still in assembly instruction set 2, do not have the assembly instruction of appearance, do discard processing; It is performed when normal operation that these assembly instructions of doing discard processing are executable programs, and when triggering leak unenforced assembly instruction, therefore be considered as not being to trigger the assembly instruction of this leak;
Situation 3: occur for gathering in 2, but in compiling instruction set 1, do not have the ordered set of the assembly instruction composition of appearance to be called the assembly instruction frame in assembly instruction; Storage and the relevant assembly instruction of triggering leak in the assembly instruction frame;
Can obtain producing the assembly instruction frame of this leak through the operation of above-mentioned steps;
Step 1.4: structure leak instruction database;
The order of extraction successively of the assembly instruction in the assembly instruction frame that obtains according to step 1.3 is that every assembly instruction increases a numerical code;
By the operation of repeated execution of steps 1.1, can construct the leak instruction database that comprises a plurality of leaks to step 1.4;
Step 2, at the whole leaks in the leak instruction database of constructing in the step 1, the every kind of leak that is respectively in the leak instruction database is chosen a plurality of training datas that contain the executable program of this leak as this leak; Every kind of leak V in the leak instruction database iExpression, the training data T of this leak iExpression; I represents unique coding of each leak in the leak instruction database;
Step 3, obtain the assembly instruction fragment of the training data of each leak in the leak instruction database;
The training data of each leak in the leak instruction database that obtains at step 2 uses static dis-assembling analysis tool to carry out dis-assembling, obtains in the executable program all function structure figure of functions; Each function structure figure is called an assembly instruction fragment;
Step 4, obtain the numerical code sequence of training data;
Assembly instruction in the assembly instruction fragment of the training data of each leak in the leak instruction database that step 3 is obtained adopts the structuring disposal route to handle respectively successively, obtains the numerical code sequence of the assembly instruction fragment of the training data of each leak in the leak instruction database;
The concrete operations step of described structuring disposal route is:
Step 4.1: successively the numeral in each the bar assembly instruction in the leak instruction database is removed, formed new character strings; All assembly instructions in the leak instruction database are used P respectively 1~P SumExpression, wherein, sum is a positive integer, the quantity of the assembly instruction that expression leak instruction database comprises; Use Q respectively by the new character strings that each the bar assembly instruction in the leak instruction database produces 1~Q SumExpression;
Step 4.2: the quantity usage counter A counting of the assembly instruction that the leak instruction database comprises, the count value of counter A represents that with m m 〉=1 and m are positive integer; The initial value of setting m is 1;
Step 4.3: whether the count value m that judges counter A is not more than the number N of assembly instruction fragment of the training data of m leak, and N is a positive integer; If this condition is set up, the then operation of execution in step 4.4; Otherwise, end operation;
Step 4.4: the numeral in each the bar assembly instruction in the assembly instruction fragment of the training data of m leak is removed, formed new character strings; Each bar assembly instruction in the assembly instruction fragment of the training data of m leak is used P ' respectively 1~P ' Sum 'Expression, wherein, sum ' is a positive integer, represents the quantity of the assembly instruction in the assembly instruction fragment of training data of m leak; The new character strings that each bar assembly instruction in the assembly instruction fragment of the training data of m leak produces is used Q ' respectively 1~Q ' Sum 'Expression;
Step 4.5: the quantity of the assembly instruction in the assembly instruction fragment of the training data of m leak is counted with counter B, and the count value of counter B represents that with i i is a positive integer; The initial value of setting the count value i of counter B is 1;
Step 4.6: whether the count value i that judges the counter B of the assembly instruction in the assembly instruction fragment of training data of m leak is not more than the quantity sum ' of the assembly instruction in the assembly instruction fragment of training data of m leak, if this condition is set up, the then operation of execution in step 4.7; Otherwise, the operation of execution in step 4.9;
Step 4.7: successively the numeral in each the bar assembly instruction in the assembly instruction fragment of the training data of m leak is removed, formed new character strings Q ' iRemove with the numeral that step 4.1 obtains, form new character strings Q in each the bar assembly instruction in the leak instruction database 1~Q SumContrast one by one, and the method calculating character string Q ' of employing character match iRespectively with character string Q 1~Q SumSimilarity, use S respectively 1~S SumExpression; If similarity S 1~S SumAll less than a certain pre-set threshold, character string Q ' then iCorresponding numerical code is 0; Otherwise, if similarity S 1~S SumIn a maximal value S is only arranged Max, character string Q ' then iCorresponding numerical code is maximal value S MaxCorresponding assembly instruction is in the numerical code of leak instruction database; Otherwise, if similarity S 1~S SumIn maximal value S MaxMore than one, can any one maximal value S of picked at random Max, character string Q ' then iCorresponding numerical code is this maximal value S MaxCorresponding assembly instruction is in the numerical code of leak instruction database;
Step 4.8: the count value i value of the counter B of the assembly instruction in the assembly instruction fragment of the training data of m leak is increased 1, and repeated execution of steps 4.6 is to step 4.8;
Step 4.9: the value of the count value m of the counter A of the assembly instruction that the leak instruction database is comprised increases 1, and repeated execution of steps 4.3 is to step 4.9;
Through a complete operation of structuring disposal route, can obtain belonging to the numerical code sequence O of a leak in the leak instruction database r={ o 1..., o k..., o N, be also referred to as one group of observation sequence of this leak; 1≤k≤N and k are positive integer, and N is the number of assembly instruction fragment of the training data of this leak, and each the assembly instruction fragment in the assembly instruction fragment of the training data of this leak produces the observation sequence of this leak; 1≤r≤N ', N ' is the quantity of the leak that comprises in the leak instruction database, r and N ' are positive integer;
Step 5, obtain the parameter lambda of the Hidden Markov Model (HMM) of each the leak correspondence in the leak instruction database successively r={ A r, B r, π r;
Utilize the training data training Hidden Markov Model (HMM) of each leak in the leak instruction database respectively, obtain the parameter lambda of the Hidden Markov Model (HMM) of each the leak correspondence in the leak instruction database r={ A r, B r, π r;
One group of observation sequence O for each leak in the leak instruction database r={ o 1..., o k..., o N, utilize the Baum-Welch algorithm iteration to adjust the parameter of Hidden Markov Model (HMM) model, obtain the optimum Hidden Markov Model (HMM) parameter lambda that can describe this classification leak r={ A r, B r, π r, be the Hidden Markov Model (HMM) model of each leak training in the leak instruction database;
The Hidden Markov Model (HMM) model of each leak in the leak instruction database that step 6, use step 5 obtain carries out leak identification to executable program to be measured; Be specially:
Step 6.1: use static dis-assembling analysis tool to carry out dis-assembling at executable program to be measured, and obtain the function structure figure of whole functions in the executable program, and each function structure figure is called an assembly instruction fragment;
Step 6.2: successively the assembly instruction in the assembly instruction fragment of executable program to be measured is handled, obtained the numerical code sequence o '={ v of executable program to be measured 1..., v t, be also referred to as the observation sequence of this executable program to be measured; Wherein, v 1..., v tNumerical code for an assembly instruction correspondence in the assembly instruction fragment of this executable program to be measured; T is a positive integer, represents the quantity of the assembly instruction in the assembly instruction fragment of executable program to be measured; The concrete operations step is:
Step 6.2.1: the quantity of the assembly instruction in the assembly instruction fragment of executable program to be measured is counted with counter C, and the count value of counter C represents that with j j is a positive integer; The initial value of setting the count value j of counter C is 1;
Step 6.2.2: whether the value of count value j of judging the counter C of the assembly instruction in the assembly instruction fragment of executable program to be measured is not more than the quantity of the assembly instruction in the assembly instruction fragment of executable program to be measured, if this condition is set up, the then operation of execution in step 6.2.3; Otherwise, the operation of execution in step 4.9;
Step 6.2.3: successively with each the bar assembly instruction C in the assembly instruction fragment of executable program to be measured jIn numeral remove, form new character strings C ' jRemove with the numeral that step 4.1 obtains, form new character strings Q in each the bar assembly instruction in the leak instruction database 1~Q SumContrast one by one, and the method calculating character string C ' of employing character match jRespectively with character string Q 1~Q SumSimilarity, use S ' respectively 1~S ' SumExpression; If similarity S ' 1~S ' SumAll less than a certain pre-set threshold, character string C ' then jCorresponding numerical code is 0; Otherwise, if similarity S ' 1~S ' SumIn a maximal value S ' is only arranged Max, character string C ' then jCorresponding numerical code is maximal value S ' MaxCorresponding assembly instruction is in the numerical code of leak instruction database; Otherwise, if similarity S ' 1~S ' SumIn maximal value S ' MaxMore than one, can any one maximal value S ' of picked at random Max, character string C ' then jCorresponding numerical code is this maximal value S ' MaxCorresponding assembly instruction is in the numerical code of leak instruction database;
Step 6.2.4: the count value j value that will represent the counter C of the assembly instruction in the assembly instruction fragment of executable program to be measured increases 1, and repeated execution of steps 6.2.2 is to step 6.2.4;
Through the operation of step 6.2.1, can obtain the numerical code sequence o '={ v of executable program to be measured to step 6.2.4 1..., v t;
Step 6.3: the parameter lambda of utilizing the Hidden Markov Model (HMM) of each the leak correspondence in the leak instruction database that step 5 obtains according to the numerical code sequence o ' of the executable program to be measured that obtains in the step 6.2 respectively r={ A r, B r, π r, calculate the observation sequence o '={ v of this executable program to be measured 1..., v tAt this Hidden Markov Model (HMM) λ r={ A r, B r, π rThe last probability P that takes place (o ' | λ r), promptly obtain P (o ' | λ 1) to P (o ' | λ N ');
Step 6.4: get P (o ' | λ 1) to P (o ' | λ N ') in maximal value, if this maximal value greater than some pre-set threshold, the represented leak of Hidden Markov Model (HMM) that executable program then to be measured contains this maximal value correspondence;
Operation through above-mentioned steps can detect the leak that exists in the executable program to be measured.
2. a kind of method for detecting assembler instruction level vulnerability based on hidden Markov model as claimed in claim 1 is characterized in that: the figure of function structure described in the step 1.1 includes but not limited to following information: the control flow structure of the system call of using in the register that uses in global variable that uses in function name, function parameter, rreturn value type, the function and local variable, the function, the function, all assembly instructions that function comprises, function, to the class higher level lanquage note of key instruction.
3. a kind of method for detecting assembler instruction level vulnerability based on hidden Markov model as claimed in claim 2 is characterized in that: the control flow structure of described function is based on that jump instruction divides.
4. a kind of method for detecting assembler instruction level vulnerability as claimed in claim 1 based on hidden Markov model, it is characterized in that: the method that extracts the assembly instruction fragment that produces leak described in the step 1.2 comprises and is specially the method for the assembly instruction fragment of extracting the stack overflow leak:
The assembly instruction fragment of stack overflow leak, comprise following assembly instruction sequence: 1. register ESP moves to low address, opens up new stack space; 2. the length of reproducting content treated in register ECX record; 3. register ESI writes down a pointer value, and this pointed is treated the first address of reproducting content; 4. register EDI writes down a pointer value, the first address of the stack space that will use when this pointed is duplicated; 5. the reproducting content for the treatment of that ESI is pointed to is that unit copies in the stack space of EDI sensing with the double word; 6. will treat that the reproducting content rest parts is that unit copies in the remaining stack space with the byte; 7. ESP points to theoretic return address.
5. a kind of method for detecting assembler instruction level vulnerability as claimed in claim 1 based on hidden Markov model, it is characterized in that: the method that extracts the assembly instruction fragment that produces leak described in the step 1.2 comprises the method for the assembly instruction fragment of extracting the heap Overflow Vulnerability, is specially:
The assembly instruction fragment of heap Overflow Vulnerability comprises following assembly instruction sequence: 1. call the HeapCreate function, create heap space P, comprise 2 heap pieces, promptly pile piece A and heap piece B; 2. call the HeapAlloc function, be heap piece A dynamic assignment space; 3. the content replication that surpasses heap piece A size in heap piece A; 4. call the HeapAlloc function, be heap piece B dynamic assignment space, the piece owner pointer of piling piece B this moment is covered by the content that exceeds among the heap piece A.
6. a kind of method for detecting assembler instruction level vulnerability as claimed in claim 1 based on hidden Markov model, it is characterized in that: the method that extracts the assembly instruction fragment that produces leak described in the step 1.2 comprises the method for the assembly instruction fragment of extracting format output leak, is specially:
The assembly instruction fragment of format output leak comprises following assembly instruction sequence: 1. use within three instructions before of call instruction calls format output function, do not have push offset instruction; 2. the push instruction number that occurs continuously before " % " number in the push offset instruction note and the push offset instruction is not inconsistent; 3. in push offset instruction note, occur " %n ".
7. a kind of method for detecting assembler instruction level vulnerability based on hidden Markov model as claimed in claim 1 is characterized in that: preferred, static dis-assembling analysis tool is IDA Pro described in step 1.1, the step 3.
8. a kind of method for detecting assembler instruction level vulnerability based on hidden Markov model as claimed in claim 1 is characterized in that: dynamic dis-assembling analysis tool includes but not limited to described in the step 1.3: the BinNavi of Paimei, Sabre company.
CN201010257022.8A 2010-08-19 2010-08-19 Hidden Markov model based method for detecting assembler instruction level vulnerability Expired - Fee Related CN101923618B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010257022.8A CN101923618B (en) 2010-08-19 2010-08-19 Hidden Markov model based method for detecting assembler instruction level vulnerability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010257022.8A CN101923618B (en) 2010-08-19 2010-08-19 Hidden Markov model based method for detecting assembler instruction level vulnerability

Publications (2)

Publication Number Publication Date
CN101923618A true CN101923618A (en) 2010-12-22
CN101923618B CN101923618B (en) 2011-12-21

Family

ID=43338548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010257022.8A Expired - Fee Related CN101923618B (en) 2010-08-19 2010-08-19 Hidden Markov model based method for detecting assembler instruction level vulnerability

Country Status (1)

Country Link
CN (1) CN101923618B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622302A (en) * 2011-01-26 2012-08-01 中国科学院高能物理研究所 Recognition method for fragment data type
CN103020529A (en) * 2012-10-31 2013-04-03 中国航天科工集团第二研究院七○六所 Software vulnerability analytical method based on scene model
CN105528286A (en) * 2015-09-28 2016-04-27 北京理工大学 System call-based software behavior assessment method
CN105938532A (en) * 2015-11-25 2016-09-14 北京匡恩网络科技有限责任公司 Large-scale sampling and bug analysis method for firmware samples
CN106709335A (en) * 2015-11-17 2017-05-24 阿里巴巴集团控股有限公司 Vulnerability detection method and apparatus
CN106845226A (en) * 2016-12-26 2017-06-13 中国电子科技集团公司第三十研究所 A kind of rogue program analysis method
CN107526967A (en) * 2017-07-05 2017-12-29 阿里巴巴集团控股有限公司 A kind of risk Address Recognition method, apparatus and electronic equipment
CN108959084A (en) * 2018-06-29 2018-12-07 西北大学 A method of the Markov forecast techniques loophole quantity based on exponential smoothing and similarity
CN111428246A (en) * 2020-03-30 2020-07-17 电子科技大学 Logic vulnerability deep mining method oriented to autonomous chip hardware security
CN111444509A (en) * 2018-12-27 2020-07-24 北京奇虎科技有限公司 CPU vulnerability detection method and system based on virtual machine
CN112579713A (en) * 2019-09-29 2021-03-30 中国移动通信集团辽宁有限公司 Address recognition method and device, computing equipment and computer storage medium
CN113688395A (en) * 2021-07-29 2021-11-23 深圳开源互联网安全技术有限公司 Vulnerability detection method and device for web application program and computer readable storage medium
CN114338806A (en) * 2022-02-28 2022-04-12 湖南云畅网络科技有限公司 Synchronous message processing method and system
CN114500043A (en) * 2022-01-25 2022-05-13 山东省计算中心(国家超级计算济南中心) Internet of things firmware vulnerability detection method and system based on homology analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060023723A1 (en) * 2004-07-27 2006-02-02 Michele Morara Object oriented library for markov chain monte carlo simulation
CN101436128A (en) * 2007-11-16 2009-05-20 北京邮电大学 Software test case automatic generating method and system
CN101571828A (en) * 2009-06-11 2009-11-04 北京航空航天大学 Method for detecting code security hole based on constraint analysis and model checking

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060023723A1 (en) * 2004-07-27 2006-02-02 Michele Morara Object oriented library for markov chain monte carlo simulation
CN101436128A (en) * 2007-11-16 2009-05-20 北京邮电大学 Software test case automatic generating method and system
CN101571828A (en) * 2009-06-11 2009-11-04 北京航空航天大学 Method for detecting code security hole based on constraint analysis and model checking

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622302B (en) * 2011-01-26 2014-10-29 中国科学院高能物理研究所 Recognition method for fragment data type
CN102622302A (en) * 2011-01-26 2012-08-01 中国科学院高能物理研究所 Recognition method for fragment data type
CN103020529A (en) * 2012-10-31 2013-04-03 中国航天科工集团第二研究院七○六所 Software vulnerability analytical method based on scene model
CN103020529B (en) * 2012-10-31 2015-12-09 中国航天科工集团第二研究院七○六所 A kind of software vulnerability analytical approach based on model of place
CN105528286A (en) * 2015-09-28 2016-04-27 北京理工大学 System call-based software behavior assessment method
CN106709335B (en) * 2015-11-17 2020-12-04 阿里巴巴集团控股有限公司 Vulnerability detection method and device
CN106709335A (en) * 2015-11-17 2017-05-24 阿里巴巴集团控股有限公司 Vulnerability detection method and apparatus
CN105938532A (en) * 2015-11-25 2016-09-14 北京匡恩网络科技有限责任公司 Large-scale sampling and bug analysis method for firmware samples
CN105938532B (en) * 2015-11-25 2018-03-16 北京匡恩网络科技有限责任公司 It is a kind of to firmware sample on a large scale sampling and leak analysis method
CN106845226A (en) * 2016-12-26 2017-06-13 中国电子科技集团公司第三十研究所 A kind of rogue program analysis method
CN107526967A (en) * 2017-07-05 2017-12-29 阿里巴巴集团控股有限公司 A kind of risk Address Recognition method, apparatus and electronic equipment
CN108959084A (en) * 2018-06-29 2018-12-07 西北大学 A method of the Markov forecast techniques loophole quantity based on exponential smoothing and similarity
CN108959084B (en) * 2018-06-29 2022-03-25 西北大学 Markov vulnerability prediction quantity method based on smoothing method and similarity
CN111444509A (en) * 2018-12-27 2020-07-24 北京奇虎科技有限公司 CPU vulnerability detection method and system based on virtual machine
CN111444509B (en) * 2018-12-27 2024-05-14 北京奇虎科技有限公司 CPU vulnerability detection method and system based on virtual machine
CN112579713A (en) * 2019-09-29 2021-03-30 中国移动通信集团辽宁有限公司 Address recognition method and device, computing equipment and computer storage medium
CN112579713B (en) * 2019-09-29 2023-11-21 中国移动通信集团辽宁有限公司 Address recognition method, address recognition device, computing equipment and computer storage medium
CN111428246A (en) * 2020-03-30 2020-07-17 电子科技大学 Logic vulnerability deep mining method oriented to autonomous chip hardware security
CN111428246B (en) * 2020-03-30 2023-04-18 电子科技大学 Logic vulnerability deep mining method oriented to autonomous chip hardware security
CN113688395A (en) * 2021-07-29 2021-11-23 深圳开源互联网安全技术有限公司 Vulnerability detection method and device for web application program and computer readable storage medium
CN113688395B (en) * 2021-07-29 2023-08-11 深圳开源互联网安全技术有限公司 Vulnerability detection method and device for web application program and computer readable storage medium
CN114500043A (en) * 2022-01-25 2022-05-13 山东省计算中心(国家超级计算济南中心) Internet of things firmware vulnerability detection method and system based on homology analysis
CN114338806A (en) * 2022-02-28 2022-04-12 湖南云畅网络科技有限公司 Synchronous message processing method and system

Also Published As

Publication number Publication date
CN101923618B (en) 2011-12-21

Similar Documents

Publication Publication Date Title
CN101923618B (en) Hidden Markov model based method for detecting assembler instruction level vulnerability
US20230259621A1 (en) Stacking-ensemble-based apt organization identification method and system, and storage medium
CN110135157B (en) Malicious software homology analysis method and system, electronic device and storage medium
CN108718310B (en) Deep learning-based multilevel attack feature extraction and malicious behavior identification method
CN112132179A (en) Incremental learning method and system based on small number of labeled samples
Murtaza et al. A host-based anomaly detection approach by representing system calls as states of kernel modules
CN111797241B (en) Event Argument Extraction Method and Device Based on Reinforcement Learning
CN106570513A (en) Fault diagnosis method and apparatus for big data network system
CN102291392A (en) Hybrid intrusion detection method based on bagging algorithm
CN102682089A (en) Method for data dimensionality reduction by identifying random neighbourhood embedding analyses
CN101187872A (en) Program kind distinguishing method based on behavior, device and program control method and device
CN102298681B (en) Software identification method based on data stream sliced sheet
CN113742205A (en) Code vulnerability intelligent detection method based on man-machine cooperation
CN114218580A (en) Intelligent contract vulnerability detection method based on multi-task learning
Yu et al. Prediction of protein-coding small ORFs in multi-species using integrated sequence-derived features and the random forest model
CN114416783A (en) Method and device for evaluating dynamic cost of OLAP (on-line analytical processing) query engine
CN112529082A (en) System portrait construction method, device and equipment
Mostofi et al. Explainable safety risk management in construction with unsupervised learning
CN103065047A (en) Terrorism behavior prediction method based on terrorist organization background knowledge subspace
CN117743601A (en) Natural resource knowledge graph completion method, device, equipment and medium
Scharl et al. The stochastic QT-clust algorithm: evaluation of stability and variance on time-course microarray data
CN115240775A (en) Cas protein prediction method based on stacking ensemble learning strategy
Kelil et al. A general measure of similarity for categorical sequences
CN117556425B (en) Intelligent contract vulnerability detection method, system and equipment based on graph neural network
CN113901452B (en) Sub-graph fuzzy matching security event identification method based on information entropy

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20111221

Termination date: 20120819