CN106909844A - The sorting technique and device of a kind of application program sample - Google Patents

The sorting technique and device of a kind of application program sample Download PDF

Info

Publication number
CN106909844A
CN106909844A CN201510971488.7A CN201510971488A CN106909844A CN 106909844 A CN106909844 A CN 106909844A CN 201510971488 A CN201510971488 A CN 201510971488A CN 106909844 A CN106909844 A CN 106909844A
Authority
CN
China
Prior art keywords
application program
function
decompiling
sequence
program sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510971488.7A
Other languages
Chinese (zh)
Inventor
杨康
陈卓
唐海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510971488.7A priority Critical patent/CN106909844A/en
Publication of CN106909844A publication Critical patent/CN106909844A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

This application discloses the sorting technique and device of a kind of application program sample, to realize that sample is classified automatically.Wherein method includes:Obtain the first virtual machine execution file of the first application program sample in application program sample set to be sorted;The first function message structure that decompiling obtains decompiling is carried out to the first virtual machine execution file, the first function command sequence in the first function message structure of decompiling is extracted;Obtain the second virtual machine execution file of the second application program sample in application program sample set to be sorted;The second function message structure that decompiling obtains decompiling is carried out to the second virtual machine execution file, the second function command sequence in the second function message structure of decompiling is extracted;Determine the editing distance between first function command sequence and second function command sequence;Judge editing distance whether less than predetermined threshold value;If so, the first application program sample and the second application program sample are divided into same category.

Description

The sorting technique and device of a kind of application program sample
Technical field
The application is related to intelligent terminal security technology area, more particularly to a kind of classification side of application program sample Method and device.
Background technology
With development in science and technology, intelligent terminal has increasing function.For example, the mobile phone of people is from tradition GSM, TDMA digital mobile phone turned to possess can process multimedia resource, provide web page browsing, The smart mobile phone of the much informations such as videoconference, ecommerce service.However, the increasingly various mobile phone of kind Malicious code is attacked and the increasingly serious personal data safety problem of situation is also following, more and more It is bitter that mobile phone viruses endure it to the fullest extent by smart phone user.
At present, to improve the efficiency that mobile phone viruses are recognized, industry can be by the phase previously according to application program Like spending, substantial amounts of application program sample is classified, to obtain by similarity multiple application programs higher The family that sample is constituted.So, when mobile phone viruses application is recognized, if finding, certain application program belongs to virus The family of application, then directly can be determined that it is Virus Sample.In the prior art, typically by artificial sieve Select mode to classify substantial amounts of application program sample, with more and more, the people of mobile platform application Work point class it is inefficient.
The content of the invention
The embodiment of the present application provide it is a kind of overcome above mentioned problem or solve the above problems at least in part should With the sorting technique and device of program sample.
The embodiment of the present application uses following technical proposals:
A kind of sorting technique of application program sample, including:
The first virtual machine for obtaining the first application program sample in application program sample set to be sorted is held Style of writing part;
The first function message structure that decompiling obtains decompiling is carried out to first virtual machine execution file, Extract the first function command sequence in the first function message structure of the decompiling;
The second virtual machine for obtaining the second application program sample in application program sample set to be sorted is held Style of writing part;
The second function message structure that decompiling obtains decompiling is carried out to second virtual machine execution file, Extract the second function command sequence in the second function message structure of the decompiling;
Determine the editing distance between the first function command sequence and the second function command sequence;
Judge the editing distance whether less than predetermined threshold value;
If so, the first application program sample and the second application program sample are divided into same class Not.
Preferably, the first function command sequence is the machine code included by the first function message structure The sequence that preceding n-bit character in field per a line is constituted, the second function command sequence is by described second The sequence that preceding n-bit character in the machine code field that function information structure is included per a line is constituted.
Preferably, whether the editing distance is judged less than before predetermined threshold value, and methods described also includes:
Determine the character sum of the first function command sequence or the second function command sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default Numerical value is between 0~1.
Preferably, whether the editing distance is judged less than before predetermined threshold value, and methods described also includes:
Determine the number of characters sum of the first function command sequence and the second function command sequence;
The product of the number of characters sum and default value is defined as the predetermined threshold value;Wherein, it is described pre- If numerical value is between 0~1.
A kind of sorting technique of application program sample, including:
The first virtual machine for obtaining the first application program sample in application program sample set to be sorted is held Style of writing part;
The first function message structure that decompiling obtains decompiling is carried out to first virtual machine execution file, Extract the first memonic symbol sequence in the first function message structure of the decompiling;
The second virtual machine for obtaining the second application program sample in application program sample set to be sorted is held Style of writing part;
The second function message structure that decompiling obtains decompiling is carried out to second virtual machine execution file, Extract the second memonic symbol sequence in the second function message structure of the decompiling;
Determine the editing distance between the first memonic symbol sequence and the second memonic symbol sequence;
Judge the editing distance whether less than predetermined threshold value;
If so, the first application program sample and the second application program sample are divided into same class Not.
Preferably, the first memonic symbol sequence is the code field included by the first function message structure In function character composition sequence, the second memonic symbol sequence is by the second function message structure bag The sequence of the function character composition in the code field for containing.
Preferably, whether the editing distance is judged less than before predetermined threshold value, and methods described also includes:
Determine the character sum of the first memonic symbol sequence or the second memonic symbol sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default Numerical value is between 0~1.
A kind of sorter of application program sample, including:
First acquisition unit, for obtaining the first application program sample in application program sample set to be sorted This first virtual machine execution file;
First extraction unit, decompiling is obtained for carrying out decompiling to first virtual machine execution file First function message structure, extracts the first function sequence of instructions in the first function message structure of the decompiling Row;
Second acquisition unit, for obtaining the second application program sample in application program sample set to be sorted This second virtual machine execution file;
Second extraction unit, decompiling is obtained for carrying out decompiling to second virtual machine execution file Second function message structure, extracts the second function sequence of instructions in the second function message structure of the decompiling Row;
Determining unit, for determining between the first function command sequence and the second function command sequence Editing distance;
Judging unit, for judging the editing distance whether less than predetermined threshold value;
Taxon, for when the editing distance is less than predetermined threshold value, by the first application program sample This and the second application program sample are divided into same category.
Preferably, the first function command sequence is the machine code included by the first function message structure The sequence that preceding n-bit character in field per a line is constituted, the second function command sequence is by described second The sequence that preceding n-bit character in the machine code field that function information structure is included per a line is constituted.
Preferably, described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really The character sum of the fixed first function command sequence or the second function command sequence;The character is total Number is defined as the predetermined threshold value with the product of default value;Wherein, the default value is between 0~1.
Preferably, described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really The number of characters sum of the fixed first function command sequence and the second function command sequence;By the character Number sum is defined as the predetermined threshold value with the product of default value;Wherein, the default value is between 0~1 Between.
A kind of sorter of application program sample, including:
First acquisition unit, for obtaining the first application program sample in application program sample set to be sorted This first virtual machine execution file;
First extraction unit, decompiling is obtained for carrying out decompiling to first virtual machine execution file First function message structure, extracts the first memonic symbol sequence in the first function message structure of the decompiling Row;
Second acquisition unit, for obtaining the second application program sample in application program sample set to be sorted This second virtual machine execution file;
Second extraction unit, decompiling is obtained for carrying out decompiling to second virtual machine execution file Second function message structure, extracts the second memonic symbol sequence in the second function message structure of the decompiling Row;
Determining unit, for determining the volume between the first memonic symbol sequence and the second memonic symbol sequence Collect distance;
Judging unit, for judging the editing distance whether less than predetermined threshold value;
Taxon, for when the editing distance is less than predetermined threshold value, by the first application program sample This and the second application program sample are divided into same category.
Preferably, the first memonic symbol sequence is the code field included by the first function message structure In function character composition sequence, the second memonic symbol sequence is by the second function message structure bag The sequence of the function character composition in the code field for containing.
Preferably, described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really The character sum of the fixed first memonic symbol sequence or the second memonic symbol sequence;By the character sum with The product of default value is defined as the predetermined threshold value;Wherein, the default value is between 0~1.
Above-mentioned at least one technical scheme that the embodiment of the present application is used can reach following beneficial effect:
By the virtual machine to first, second application program sample in application program sample set to be sorted Perform file to be analyzed and decompiling, respectively obtain above-mentioned first, second application program sample corresponding the First, second function command sequence (or memonic symbol sequence), then, is determined above-mentioned using editing distance algorithm The editing distance of first, second function instruction sequence (or memonic symbol sequence), by judge above-mentioned editor away from From whether be less than predetermined threshold value, and less than when, above-mentioned first, second application program sample is divided into together One classification.By the above method, can be by the relatively near (volume of similarity in application program sample set to be sorted Volume distance is less than predetermined threshold value) application program sample be divided into same class, family's sample is obtained, so that real The automatic classification of sample, improves the efficiency of classification in existing application program sample set.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes of the application Point, the schematic description and description of the application is used to explain the application, does not constitute to the application not Work as restriction.In the accompanying drawings:
Fig. 1 is the flow chart of the sorting technique of the application program sample of offer in the embodiment of the application one;
Fig. 2 be the embodiment of the present application in showing for the function information structure that decompiling is obtained is carried out to dex files Example;
Fig. 3 is the flow chart of the sorting technique of the application program sample of offer in another embodiment of the application;
Fig. 4 is the module map of the sorter of the application program sample of offer in the embodiment of the application one.
Specific embodiment
It is specifically real below in conjunction with the application to make the purpose, technical scheme and advantage of the application clearer Apply example and corresponding accompanying drawing is clearly and completely described to technical scheme.Obviously, it is described Embodiment is only some embodiments of the present application, rather than whole embodiments.Based on the implementation in the application Example, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made Example, belongs to the scope of the application protection.
By taking Android (Android) operating system as an example, including application layer (app layers) and system framework Layer (framework layers), is then not covered as other layer of the application being possible to include is divided from function. Wherein, usual app layers can be understood as upper strata, be responsible for the interface with user mutual, such as application program dimension Different types of click on content is recognized so as to show different context menu etc. when shield and the click page. Generally framework layer as intermediate layer, the major responsibility of this layer is, the user that app layer is obtained asks Ask, such as start and preserve picture etc with program, clickthrough, click, forward and gone toward lower floor;By lower floor The content handled well, or by message, or upper strata is distributed to by middle-agent's class, to user Show.
Dalvik is the Java Virtual Machine for Android platform.Dalvik is by optimization, it is allowed to limited Internal memory in run the example of multiple virtual machines simultaneously, and each Dalvik is using independent as one Linux processes are performed.Independent process can prevent all programs when virtual machine crashes to be all closed. Dalvik virtual machine can be supported to have been converted into the Java application journeys of dex (Dalvik Executable) form The operation of sequence, dex forms are a kind of compressed format for aiming at Dalvik designs, are adapted to internal memory and processor speed The limited system of degree.
It can be seen that, in android system, dex files can be directly at Dalvik virtual machine (Dalvik VM) The virtual machine execution file of middle load operating.By ADT (Android Development Tools), pass through Java source codes, can be converted to dex files by complicated compiling.Dex files are directed to embedded system The result of optimization, the instruction code of Dalvik virtual machine is not the Java Virtual Machine instruction code of standard, but is made With oneself exclusive a set of instruction set.Many class names, constant character string are shared in dex files, has been made Its volume is smaller, and operational efficiency is also higher.
Below in conjunction with accompanying drawing, the technical scheme that each embodiment of the application is provided is described in detail.
Fig. 1 is the flow of the sorting technique of the application program sample of offer in the embodiment of the application one, including:
S101:Obtain the first application program sample in application program sample set to be sorted first is virtual Machine performs file.
The purpose of the application be to application program sample set Q in application program sample some to be sorted Classified automatically according to similarity.Above-mentioned virtual machine execution file is, for example, dex files.As it was previously stated, Android operation system includes application layer (app layers) and system framework layer (framework layers), The application focuses on the research and improvement to app layers.But, it will be appreciated by those skilled in the art that working as Android During startup, Dalvik VM monitor all of program (APK file) and framework, and for they create one Individual dependency tree.DalvikVM for each program optimization code and is stored by this dependency tree In Dalvik cachings (dalvik-cache).So, all programs operationally can all be used and optimized Code.When a program (or framework storehouse) is changed, Dalvik VM will re-optimization code And deposited again in the buffer.It is to deposit the Program Generating on system in cache/dalvik-cache Dex files, and data/dalvik-cache then be deposit data/app generation dex files.It is, The application focuses on the analysis and treatment carried out to the dex files of data/app generations, it should be appreciated that, For the dex files of the Program Generating on system, the theoretical and operation of the application is equally applicable.
Mode on obtaining dex files, can be by parsing APK (Android Package, Android Installation kit) obtain.APK file is in fact a compressed package of zip forms, but suffix name is modified to apk, After UnZip is decompressed, it is possible to obtain Dex files.
S102:The first function information that decompiling obtains decompiling is carried out to first virtual machine execution file Structure, extracts the first function command sequence in the first function message structure of the decompiling.The application reality Apply in example, the first function command sequence is the machine code field included by the first function message structure In per a line preceding n-bit character constitute command sequence.
Decompiling is carried out to dex files (or is:Dis-assembling) there are various ways.
First way is that dex files are parsed according to dex file formats, obtains the letter of each class Number information structure;According to the field in function information structure, determine the function of dex files position and Size, obtains the function information structure of decompiling.Wherein, by analytical function information structure, referred to The list of the bytecode array field for showing the function position of dex files and the function size for indicating dex files Length field, so that it is determined that the position of the function of dex files and size.
The second way is, using dex file decompiling instruments, dex file reverses to be compiled as into virtual machine word Section code.
Such as preceding introduction, Dalvik virtual machine operation is Dalvik bytecodes, and it is with a dex (Dalvik Executable) executable file form is present, and Dalvik virtual machine performs generation by explaining dex files Code.There are some instruments at present, can be by DEX file dis-assembling into Dalvik assembly codes.This kind of dex texts Part decompiling instrument includes:baksmali、Dedexer1.26、dexdump、dexinspecto03-12-12r、 IDA Pro, androguard, dex2jar, 010Editor etc..
It can be seen that, by the decompiling to dex files, all function information structures of decompiling can be obtained. Wherein, function information structure performs code comprising function, is by virtual machine instructions sequence in the embodiment of the present application Row and virtual machine memonic symbol Sequence composition, such as following example, by Dalvik VM command sequence and The memonic symbol Sequence composition function information structure of Dalvik VM.
For example, shown in Fig. 2 being to carry out the function letter that decompiling is obtained in the embodiment of the present application to dex files Cease the example of structure.It can be seen that, dex files are decompiled into the command sequence and Dalvik VM of Dalvik VM Memonic symbol sequence.
Such as the example of figure 2 above, each in machine code field in the function information structure that decompiling is obtained Capable preceding 2 numerals refer to make sequence (upper example left side is by circle part), and the corresponding part of command sequence It is memonic symbol (upper example right side, is partly enclosed, not all selections).Memonic symbol is primarily to convenient use Family exchanges and written in code.As above example, dex files can be obtained by the sequence of instructions of function by decompiling It is classified as:“12 54 38 71 0c 6e 0c 6e 0a 38 54 54 6e 0c 6e 54 6e 0c 6e 0c 38 72 0a 39 12 38 54 6e 54 71 0e 01 28 54 13 6e”。
Memonic symbol sequence is:
“const/4iget-object if-eqz invoke-static move-result-object invoke-virtual move-result-object invoke-virtual move-result if-eqz iget-object iget-object invoke-virtual move-result-object invoke-virtual iget-object invoke-virtual move-result-object invoke-virtual move-result-object if-eqz invoke-interface move-result if-nez const/4if-eqz iget-object invoke-virtual iget-object invoke-static return-void move goto iget-object const/16invoke-virtual”。
S103:Obtain the second application program sample in application program sample set to be sorted second is virtual Machine performs file.
S104:The second function information that decompiling obtains decompiling is carried out to second virtual machine execution file Structure, extracts the second function command sequence in the second function message structure of the decompiling.The application reality Apply in example, the second function command sequence is the machine code field included by the second function message structure In per a line preceding n-bit character constitute command sequence.
The detailed process of above-mentioned steps S103 and S104 is referred to the content of above-mentioned steps S101 and S102.
S105:Determine editor between the first function command sequence and the second function command sequence away from From.
Above-mentioned first application program sample and the second application program sample are above-mentioned application program samples to be sorted Any two samples not being classified in this set Q.In actual application, when to above-mentioned application journey When the first sample of sequence sample set Q is classified (any classification is not divided also), can be with newly-built one Above-mentioned first sample is simultaneously included into category A by classification A, so, to above-mentioned application program sample set Q Second sample when being classified, you can by judging whether second sample belongs to above-mentioned first sample It is similar, if so, then continue that second sample is included into the classification A of above-mentioned first sample, if it is not, A newly-built classification B can then be continued and second sample is included into category B, by that analogy.
S106:Judge the editing distance whether less than predetermined threshold value.
In the embodiment of the present application, determine that above-mentioned first, second function refers to by way of calculating editing distance Make the similarity of sequence.Editing distance (Edit Distance), refers to two also known as Levenshtein distances Between word string, the minimum edit operation number of times as needed for changes into another.Such as:Calculate cafe With the editing distance of coffee, it is by the process that cafe operations are coffee: Cafe → caffe → coffe → coffee, then it is 3 to obtain editing distance.Typically, for two function instructions Sequence, if the editing distance between the two function instruction sequences is smaller, shows the two function instruction sequences Row similarity is higher, that is, shows that the code of above-mentioned first, second application program sample to be sorted has been got over May belong to of the same clan or generic.
S107:If the editing distance is less than predetermined threshold value, by the first application program sample and described the Two application program samples are divided into same category.
For example, being by the first function command sequence that step S102 is obtained:“12 54 38 71 0c 6e 0c 6e 0a 38 54 54 6e 0c 6e 54 6e 0c 6e 0c 38 72 0a 39 12 38 54 6e 54 71 0e 01 28 54 13 6e”。
The second function command sequence obtained by step S104 is:“1238 54 71 0c 6e 0c 6e 0a 38 54 54 6e 0c 6e 54 6e 0c 6e 0c 38 72 0a 39 12 38 54 6e 54 71 0e 01 28 54 13 6e”。
By calculating the editing distance of above-mentioned first, second function instruction sequence, editing distance=4 are obtained, it is false If predetermined threshold value is 5, then find that the editing distance of above-mentioned two command sequence is less than by comparison above-mentioned pre- If threshold value, therefore can determine that the similarity of above-mentioned first, second function instruction sequence meets preset requirement, That is, above-mentioned first, second application program sample belongs to of the same clan or generic.
In the embodiment of the present application, the accuracy of above-mentioned predetermined threshold value is determined to improve, and then improve sample point The accuracy of class, before above-mentioned steps S106, methods described can also include:
Determine the character sum of the first function command sequence or the second function command sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default Numerical value is between 0~1.
For example, it may be determined that first function command sequence obtained above:“1238 54 71 0c 6e 0c 6e 0a 38 54 54 6e 0c 6e 54 6e 0c 6e 0c 38 72 0a 39 12 38 54 6e 54 71 0e 01 28 54 13 6e " or above-mentioned second function command sequences:“1238 54 71 0c 6e 0c 6e 0a 38 54 54 6e 0c 6e The character sum of the 6e of 54 54 71 0e of 6e 0c 6e 0c 38 72 0a, 39 12 38 54 6e 01 28 54 13 " is 72, It is 0.05 (between 0~1) that default value can then be set, and may finally determine that predetermined threshold value is 72*0.05≈4.Wherein, the default value also can be empirical value.By above-mentioned steps, can will be similar The application program sample of the function instruction sequence that degree reaches more than 95% is divided into same classification.
In the application another kind embodiment, before above-mentioned steps S107, can also include:It is determined that described The number of characters sum of first function command sequence and the second function command sequence;By the number of characters sum It is defined as the predetermined threshold value with the product of default value;Wherein, the default value is between 0~1. For example, the number of characters sum of first, second function instruction sequence is 144, then with the number of characters sum Above-mentioned predetermined threshold value is determined with the product of default value.
It should be noted that the application is not limited being carried out to malicious code using which kind of malicious code protectiving scheme Detection, it is for instance possible to use sample characteristics killing (characteristic value scanning) presented hereinbefore, based on virtual machine Killing or heuristic killing, it can in addition contain carry out similar sample clustering.And, for matching algorithm It is not restricted, it is for instance possible to use fuzzy matching algorithm presented hereinbefore or Similarity matching algorithm etc..
Fig. 3 is the flow of the sorting technique of the application program sample of offer in another embodiment of the application, including:
S201:Obtain the first application program sample in application program sample set to be sorted first is virtual Machine performs file;
S202:The first function information that decompiling obtains decompiling is carried out to first virtual machine execution file Structure, extracts the first memonic symbol sequence in the first function message structure of the decompiling;The application is implemented In example, the first memonic symbol sequence is the letter in the code field included by the first function message structure The sequence of digit composition.
S203:Obtain the second application program sample in application program sample set to be sorted second is virtual Machine performs file;
S204:The second function information that decompiling obtains decompiling is carried out to second virtual machine execution file Structure, extracts the second memonic symbol sequence in the second function message structure of the decompiling;Described second helps The sequence that function character in the code field that note symbol sequence is included by the second function message structure is constituted Row.
S205:Determine the editing distance between the first memonic symbol sequence and the second memonic symbol sequence;
S206:Judge the editing distance whether less than predetermined threshold value;
S207:If the editing distance is less than predetermined threshold value, by the first application program sample and described the Two application program samples are divided into same category.
In the embodiment of the present application, with reference to the above, then judge the editing distance whether less than predetermined threshold value The step of before, methods described also includes:Determine the first memonic symbol sequence or the second memonic symbol sequence The character sum of row;Character sum and the product of default value are defined as the predetermined threshold value;Wherein, The default value is between 0~1.The accuracy of sample classification can be improved by said process.
It can be seen that, in the method that the various embodiments described above are provided, by application program sample set to be sorted In the virtual machine execution file of first, second application program sample be analyzed and decompiling, respectively obtain The corresponding first, second function instruction sequence (or memonic symbol sequence) of above-mentioned first, second application program sample, Then, above-mentioned first, second function instruction sequence (or memonic symbol sequence) is determined using editing distance algorithm Editing distance, by whether judging above-mentioned editing distance less than predetermined threshold value, and less than when, will be above-mentioned First, second application program sample is divided into same category.By the above method, to be sorted can be answered Divided with the application program sample of similarity in program sample set relatively near (editing distance is less than predetermined threshold value) To same class, family's sample is obtained, so as to realize the automatic classification of sample in application program sample set, carried The efficiency of high-class.
Fig. 4 is the module map of the sorter of the application program sample of offer in the embodiment of the application one.Wherein, The function of each unit is similar with the function of each step in the above method in the device, therefore the device is referred to State the particular content of embodiment of the method.The device includes:
First acquisition unit 401, journey is applied for obtaining in application program sample set to be sorted first First virtual machine execution file of sequence sample;
First extraction unit 402, for being carried out to first virtual machine execution file, decompiling is counter to be compiled The first function message structure translated, the first function extracted in the first function message structure of the decompiling refers to Make sequence;
Second acquisition unit 403, journey is applied for obtaining in application program sample set to be sorted second Second virtual machine execution file of sequence sample;
Second extraction unit 404, for being carried out to second virtual machine execution file, decompiling is counter to be compiled The second function message structure translated, the second function extracted in the second function message structure of the decompiling refers to Make sequence;
Determining unit 405, for determining the first function command sequence with the second function command sequence Between editing distance;
Judging unit 406, for judging the editing distance whether less than predetermined threshold value;
Taxon 407, for when the editing distance is less than predetermined threshold value, journey being applied by described first Sequence sample and the second application program sample are divided into same category.
In the embodiment of the present application, the accuracy of above-mentioned predetermined threshold value is determined to improve, and then improve sample point The accuracy of class, described device also includes:
Predetermined threshold value determining unit, for by the first application program sample and second application program Before sample is divided into same category, the first function command sequence or the second function sequence of instructions are determined The character sum of row;Character sum and the product of default value are defined as the predetermined threshold value;Wherein, The default value is between 0~1.
In the embodiment of the present application, the accuracy of above-mentioned predetermined threshold value is determined to improve, and then improve sample point The accuracy of class, described device also includes:
Predetermined threshold value determining unit, for by the first application program sample and second application program Before sample is divided into same category, the first function command sequence and the second function sequence of instructions are determined The number of characters sum of row;The product of the number of characters sum and default value is defined as the predetermined threshold value; Wherein, the default value is between 0~1.
In the embodiment of the present application, first extraction unit 402 specifically for:
The first virtual machine execution file is parsed according to virtual machine execution file form, obtains each class Function information structure;According to the field in the function information structure, determine that first virtual machine is held The position of the function of style of writing part and size, obtain the first function message structure of the decompiling;
First extraction unit 404 specifically for:
The second virtual machine execution file is parsed according to virtual machine execution file form, obtains each class Function information structure;According to the field in the function information structure, determine that second virtual machine is held The position of the function of style of writing part and size, obtain the second function message structure of the decompiling.
Embodiment as an alternative, the sorter of above-mentioned application program sample, including:
First acquisition unit 401, journey is applied for obtaining in application program sample set to be sorted first First virtual machine execution file of sequence sample;
First extraction unit 402, for being carried out to first virtual machine execution file, decompiling is counter to be compiled The first function message structure translated, extracts the first memonic symbol in the first function message structure of the decompiling Sequence;
Second acquisition unit 403, journey is applied for obtaining in application program sample set to be sorted second Second virtual machine execution file of sequence sample;
Second extraction unit 404, for being carried out to second virtual machine execution file, decompiling is counter to be compiled The second function message structure translated, extracts the second memonic symbol in the second function message structure of the decompiling Sequence;
Determining unit 405, for determining between the first memonic symbol sequence and the second memonic symbol sequence Editing distance;
Judging unit 406, for judging the editing distance whether less than predetermined threshold value;
Taxon 407, for when the editing distance is less than predetermined threshold value, journey being applied by described first Sequence sample and the second application program sample are divided into same category.
In the embodiment of the present application, the accuracy of above-mentioned predetermined threshold value is determined to improve, and then improve sample point The accuracy of class, described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really The character sum of the fixed first memonic symbol sequence or the second memonic symbol sequence;By the character sum with The product of default value is defined as the predetermined threshold value;Wherein, the default value is between 0~1.
In the device that the various embodiments described above are provided, by the in application program sample set to be sorted First, the virtual machine execution file of the second application program sample is analyzed and decompiling, respectively obtains above-mentioned First, the corresponding first, second function instruction sequence of the second application program sample (or memonic symbol sequence), with Afterwards, above-mentioned first, second function instruction sequence (or memonic symbol sequence) is determined using editing distance algorithm Editing distance, by whether judging above-mentioned editing distance less than predetermined threshold value, and less than when, by above-mentioned the First, the second application program sample is divided into same category.By the above method, can be by application to be sorted The application program sample of similarity relatively near (editing distance is less than predetermined threshold value) is divided into program sample set Same class, obtains family's sample, so as to realize the automatic classification of sample in application program sample set, improves The efficiency of classification.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot Close the form of the embodiment in terms of software and hardware.And, the application can be used and wherein wrapped at one or more Containing computer usable program code computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) on implement computer program product form.
The application is produced with reference to the method according to the embodiment of the present application, equipment (system) and computer program The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions realize flow chart and / or block diagram in each flow and/or the flow in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions to all-purpose computer, special-purpose computer, insertion can be provided The processor of formula processor or other programmable data processing devices is producing a machine so that by calculating The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one The device of the function of being specified in individual flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or the treatment of other programmable datas to set In the standby computer-readable memory for working in a specific way so that storage is in the computer-readable memory Instruction produce include the manufacture of command device, the command device realization in one flow of flow chart or multiple The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made Obtain and series of operation steps is performed on computer or other programmable devices to produce computer implemented place Reason, so as to the instruction performed on computer or other programmable devices is provided for realizing in flow chart one The step of function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/defeated Outgoing interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory And/or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory (RAM). Internal memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by appointing What method or technique realizes information Store.Information can be computer-readable instruction, data structure, program Module or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), its The random access memory (RAM) of his type, read-only storage (ROM), electrically erasable are read-only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, tape magnetic Disk storage or other magnetic storage apparatus or any other non-transmission medium, can be used for storage can be calculated The information that equipment is accessed.Defined according to herein, computer-readable medium does not include temporary computer-readable matchmaker Body (transitory media), such as data-signal and carrier wave of modulation.
Also, it should be noted that term " including ", "comprising" or its any other variant be intended to non-row His property is included, so that process, method, commodity or equipment including a series of key elements not only include Those key elements, but also other key elements including being not expressly set out, or also include for this process, Method, commodity or the intrinsic key element of equipment.In the absence of more restrictions, by sentence " including One ... " key element that limits, it is not excluded that in the process including the key element, method, commodity or set Also there is other identical element in standby.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer journey Sequence product.Therefore, the application can using complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.And, the application can be used and wherein include calculating at one or more Machine usable program code computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, Optical memory etc.) on implement computer program product form.
Embodiments herein is the foregoing is only, the application is not limited to.For this area skill For art personnel, the application can have various modifications and variations.All institutes within spirit herein and principle Any modification, equivalent substitution and improvements of work etc., within the scope of should be included in claims hereof.

Claims (10)

1. a kind of sorting technique of application program sample, it is characterised in that including:
The first virtual machine for obtaining the first application program sample in application program sample set to be sorted is held Style of writing part;
The first function message structure that decompiling obtains decompiling is carried out to first virtual machine execution file, Extract the first function command sequence in the first function message structure of the decompiling;
The second virtual machine for obtaining the second application program sample in application program sample set to be sorted is held Style of writing part;
The second function message structure that decompiling obtains decompiling is carried out to second virtual machine execution file, Extract the second function command sequence in the second function message structure of the decompiling;
Determine the editing distance between the first function command sequence and the second function command sequence;
Judge the editing distance whether less than predetermined threshold value;
If so, the first application program sample and the second application program sample are divided into same class Not.
2. the method for claim 1, it is characterised in that the first function command sequence be by The sequence that preceding n-bit character in the machine code field that the first function message structure is included per a line is constituted, The second function command sequence is each in the machine code field included by the second function message structure The sequence of capable preceding n-bit character composition.
3. the method for claim 1, it is characterised in that judge whether the editing distance is less than Before predetermined threshold value, methods described also includes:
Determine the character sum of the first function command sequence or the second function command sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default Numerical value is between 0~1.
4. the method for claim 1, it is characterised in that judge whether the editing distance is less than Before predetermined threshold value, methods described also includes:
Determine the number of characters sum of the first function command sequence and the second function command sequence;
The product of the number of characters sum and default value is defined as the predetermined threshold value;Wherein, it is described pre- If numerical value is between 0~1.
5. a kind of sorting technique of application program sample, it is characterised in that including:
The first virtual machine for obtaining the first application program sample in application program sample set to be sorted is held Style of writing part;
The first function message structure that decompiling obtains decompiling is carried out to first virtual machine execution file, Extract the first memonic symbol sequence in the first function message structure of the decompiling;
The second virtual machine for obtaining the second application program sample in application program sample set to be sorted is held Style of writing part;
The second function message structure that decompiling obtains decompiling is carried out to second virtual machine execution file, Extract the second memonic symbol sequence in the second function message structure of the decompiling;
Determine the editing distance between the first memonic symbol sequence and the second memonic symbol sequence;
Judge the editing distance whether less than predetermined threshold value;
If so, the first application program sample and the second application program sample are divided into same class Not.
6. method as claimed in claim 5, it is characterised in that the first memonic symbol sequence is by institute State the sequence of the function character composition in the code field that first function message structure is included, second mnemonic(al) The sequence that function character in the code field that symbol sequence is included by the second function message structure is constituted.
7. method as claimed in claim 5, it is characterised in that judge whether the editing distance is less than Before predetermined threshold value, methods described also includes:
Determine the character sum of the first memonic symbol sequence or the second memonic symbol sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default Numerical value is between 0~1.
8. a kind of sorter of application program sample, it is characterised in that including:
First acquisition unit, for obtaining the first application program sample in application program sample set to be sorted This first virtual machine execution file;
First extraction unit, decompiling is obtained for carrying out decompiling to first virtual machine execution file First function message structure, extracts the first function sequence of instructions in the first function message structure of the decompiling Row;
Second acquisition unit, for obtaining the second application program sample in application program sample set to be sorted This second virtual machine execution file;
Second extraction unit, decompiling is obtained for carrying out decompiling to second virtual machine execution file Second function message structure, extracts the second function sequence of instructions in the second function message structure of the decompiling Row;
Determining unit, for determining between the first function command sequence and the second function command sequence Editing distance;
Judging unit, for judging the editing distance whether less than predetermined threshold value;
Taxon, for when the editing distance is less than predetermined threshold value, by the first application program sample This and the second application program sample are divided into same category.
9. device as claimed in claim 8, it is characterised in that the first function command sequence be by The sequence that preceding n-bit character in the machine code field that the first function message structure is included per a line is constituted, The second function command sequence is each in the machine code field included by the second function message structure The sequence of capable preceding n-bit character composition.
10. device as claimed in claim 8, it is characterised in that described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really The character sum of the fixed first function command sequence or the second function command sequence;The character is total Number is defined as the predetermined threshold value with the product of default value;Wherein, the default value is between 0~1.
CN201510971488.7A 2015-12-22 2015-12-22 The sorting technique and device of a kind of application program sample Pending CN106909844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510971488.7A CN106909844A (en) 2015-12-22 2015-12-22 The sorting technique and device of a kind of application program sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510971488.7A CN106909844A (en) 2015-12-22 2015-12-22 The sorting technique and device of a kind of application program sample

Publications (1)

Publication Number Publication Date
CN106909844A true CN106909844A (en) 2017-06-30

Family

ID=59201066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510971488.7A Pending CN106909844A (en) 2015-12-22 2015-12-22 The sorting technique and device of a kind of application program sample

Country Status (1)

Country Link
CN (1) CN106909844A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491718A (en) * 2018-02-13 2018-09-04 北京兰云科技有限公司 A kind of method and device for realizing information classification
CN109558735A (en) * 2018-12-03 2019-04-02 杭州安恒信息技术股份有限公司 A kind of rogue program sample clustering method and relevant apparatus based on machine learning
CN109977976A (en) * 2017-12-28 2019-07-05 腾讯科技(深圳)有限公司 Detection method, device and the computer equipment of executable file similarity

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761475A (en) * 2013-12-30 2014-04-30 北京奇虎科技有限公司 Method and device for detecting malicious code in intelligent terminal
CN103902910A (en) * 2013-12-30 2014-07-02 北京奇虎科技有限公司 Method and device for detecting malicious codes in intelligent terminal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761475A (en) * 2013-12-30 2014-04-30 北京奇虎科技有限公司 Method and device for detecting malicious code in intelligent terminal
CN103902910A (en) * 2013-12-30 2014-07-02 北京奇虎科技有限公司 Method and device for detecting malicious codes in intelligent terminal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵作鹏: "《面向煤矿应急管理的数据处理关键技术研究》", 30 November 2013 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977976A (en) * 2017-12-28 2019-07-05 腾讯科技(深圳)有限公司 Detection method, device and the computer equipment of executable file similarity
CN108491718A (en) * 2018-02-13 2018-09-04 北京兰云科技有限公司 A kind of method and device for realizing information classification
CN108491718B (en) * 2018-02-13 2022-03-04 北京兰云科技有限公司 Method and device for realizing information classification
CN109558735A (en) * 2018-12-03 2019-04-02 杭州安恒信息技术股份有限公司 A kind of rogue program sample clustering method and relevant apparatus based on machine learning

Similar Documents

Publication Publication Date Title
JP5992622B2 (en) Malicious application diagnostic apparatus and method
CN103761475B (en) Method and device for detecting malicious code in intelligent terminal
CN104123493B (en) The safety detecting method and device of application program
US8850581B2 (en) Identification of malware detection signature candidate code
CN104834837B (en) A kind of antialiasing method of binary code based on semanteme
CN109564608A (en) Updating virtual memory addresses of target application functions for updated versions of application binary code
CN103761476B (en) The method and device of feature extraction
EP3227797B1 (en) System and method for fast and scalable functional file correlation
CN106909841A (en) A kind of method and device for judging viral code
US20140082729A1 (en) System and method for analyzing repackaged application through risk calculation
CN105653949B (en) A kind of malware detection methods and device
CN106033416A (en) A string processing method and device
CN112148305B (en) Application detection method, device, computer equipment and readable storage medium
CN106874180A (en) Detection System And Method Thereof
CN106803040B (en) Virus characteristic code processing method and device
CN108090360B (en) Behavior feature-based android malicious application classification method and system
CN106845171A (en) A kind of Android application codes protection mechanism discrimination method
CN106250769A (en) The source code data detection method of a kind of multistage filtering and device
CN107229669A (en) Method and system for selecting the sample set on assessing website Barrien-free
CN106909844A (en) The sorting technique and device of a kind of application program sample
CN106598828A (en) Method and device for determining invalid class in source code
CN109190370B (en) Android interface similarity calculation method based on control region distribution characteristics
CN106484726A (en) A kind of page display method and device
CN109634569A (en) Process implementation method, device, equipment and readable storage medium storing program for executing based on note
CN107735792A (en) Software analysis system, software analysis method and software analysis program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170630

RJ01 Rejection of invention patent application after publication