CN106909844A - The sorting technique and device of a kind of application program sample - Google Patents
The sorting technique and device of a kind of application program sample Download PDFInfo
- Publication number
- CN106909844A CN106909844A CN201510971488.7A CN201510971488A CN106909844A CN 106909844 A CN106909844 A CN 106909844A CN 201510971488 A CN201510971488 A CN 201510971488A CN 106909844 A CN106909844 A CN 106909844A
- Authority
- CN
- China
- Prior art keywords
- application program
- function
- decompiling
- sequence
- program sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/561—Virus type analysis
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Health & Medical Sciences (AREA)
- Virology (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Stored Programmes (AREA)
Abstract
This application discloses the sorting technique and device of a kind of application program sample, to realize that sample is classified automatically.Wherein method includes:Obtain the first virtual machine execution file of the first application program sample in application program sample set to be sorted;The first function message structure that decompiling obtains decompiling is carried out to the first virtual machine execution file, the first function command sequence in the first function message structure of decompiling is extracted;Obtain the second virtual machine execution file of the second application program sample in application program sample set to be sorted;The second function message structure that decompiling obtains decompiling is carried out to the second virtual machine execution file, the second function command sequence in the second function message structure of decompiling is extracted;Determine the editing distance between first function command sequence and second function command sequence;Judge editing distance whether less than predetermined threshold value;If so, the first application program sample and the second application program sample are divided into same category.
Description
Technical field
The application is related to intelligent terminal security technology area, more particularly to a kind of classification side of application program sample
Method and device.
Background technology
With development in science and technology, intelligent terminal has increasing function.For example, the mobile phone of people is from tradition
GSM, TDMA digital mobile phone turned to possess can process multimedia resource, provide web page browsing,
The smart mobile phone of the much informations such as videoconference, ecommerce service.However, the increasingly various mobile phone of kind
Malicious code is attacked and the increasingly serious personal data safety problem of situation is also following, more and more
It is bitter that mobile phone viruses endure it to the fullest extent by smart phone user.
At present, to improve the efficiency that mobile phone viruses are recognized, industry can be by the phase previously according to application program
Like spending, substantial amounts of application program sample is classified, to obtain by similarity multiple application programs higher
The family that sample is constituted.So, when mobile phone viruses application is recognized, if finding, certain application program belongs to virus
The family of application, then directly can be determined that it is Virus Sample.In the prior art, typically by artificial sieve
Select mode to classify substantial amounts of application program sample, with more and more, the people of mobile platform application
Work point class it is inefficient.
The content of the invention
The embodiment of the present application provide it is a kind of overcome above mentioned problem or solve the above problems at least in part should
With the sorting technique and device of program sample.
The embodiment of the present application uses following technical proposals:
A kind of sorting technique of application program sample, including:
The first virtual machine for obtaining the first application program sample in application program sample set to be sorted is held
Style of writing part;
The first function message structure that decompiling obtains decompiling is carried out to first virtual machine execution file,
Extract the first function command sequence in the first function message structure of the decompiling;
The second virtual machine for obtaining the second application program sample in application program sample set to be sorted is held
Style of writing part;
The second function message structure that decompiling obtains decompiling is carried out to second virtual machine execution file,
Extract the second function command sequence in the second function message structure of the decompiling;
Determine the editing distance between the first function command sequence and the second function command sequence;
Judge the editing distance whether less than predetermined threshold value;
If so, the first application program sample and the second application program sample are divided into same class
Not.
Preferably, the first function command sequence is the machine code included by the first function message structure
The sequence that preceding n-bit character in field per a line is constituted, the second function command sequence is by described second
The sequence that preceding n-bit character in the machine code field that function information structure is included per a line is constituted.
Preferably, whether the editing distance is judged less than before predetermined threshold value, and methods described also includes:
Determine the character sum of the first function command sequence or the second function command sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default
Numerical value is between 0~1.
Preferably, whether the editing distance is judged less than before predetermined threshold value, and methods described also includes:
Determine the number of characters sum of the first function command sequence and the second function command sequence;
The product of the number of characters sum and default value is defined as the predetermined threshold value;Wherein, it is described pre-
If numerical value is between 0~1.
A kind of sorting technique of application program sample, including:
The first virtual machine for obtaining the first application program sample in application program sample set to be sorted is held
Style of writing part;
The first function message structure that decompiling obtains decompiling is carried out to first virtual machine execution file,
Extract the first memonic symbol sequence in the first function message structure of the decompiling;
The second virtual machine for obtaining the second application program sample in application program sample set to be sorted is held
Style of writing part;
The second function message structure that decompiling obtains decompiling is carried out to second virtual machine execution file,
Extract the second memonic symbol sequence in the second function message structure of the decompiling;
Determine the editing distance between the first memonic symbol sequence and the second memonic symbol sequence;
Judge the editing distance whether less than predetermined threshold value;
If so, the first application program sample and the second application program sample are divided into same class
Not.
Preferably, the first memonic symbol sequence is the code field included by the first function message structure
In function character composition sequence, the second memonic symbol sequence is by the second function message structure bag
The sequence of the function character composition in the code field for containing.
Preferably, whether the editing distance is judged less than before predetermined threshold value, and methods described also includes:
Determine the character sum of the first memonic symbol sequence or the second memonic symbol sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default
Numerical value is between 0~1.
A kind of sorter of application program sample, including:
First acquisition unit, for obtaining the first application program sample in application program sample set to be sorted
This first virtual machine execution file;
First extraction unit, decompiling is obtained for carrying out decompiling to first virtual machine execution file
First function message structure, extracts the first function sequence of instructions in the first function message structure of the decompiling
Row;
Second acquisition unit, for obtaining the second application program sample in application program sample set to be sorted
This second virtual machine execution file;
Second extraction unit, decompiling is obtained for carrying out decompiling to second virtual machine execution file
Second function message structure, extracts the second function sequence of instructions in the second function message structure of the decompiling
Row;
Determining unit, for determining between the first function command sequence and the second function command sequence
Editing distance;
Judging unit, for judging the editing distance whether less than predetermined threshold value;
Taxon, for when the editing distance is less than predetermined threshold value, by the first application program sample
This and the second application program sample are divided into same category.
Preferably, the first function command sequence is the machine code included by the first function message structure
The sequence that preceding n-bit character in field per a line is constituted, the second function command sequence is by described second
The sequence that preceding n-bit character in the machine code field that function information structure is included per a line is constituted.
Preferably, described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really
The character sum of the fixed first function command sequence or the second function command sequence;The character is total
Number is defined as the predetermined threshold value with the product of default value;Wherein, the default value is between 0~1.
Preferably, described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really
The number of characters sum of the fixed first function command sequence and the second function command sequence;By the character
Number sum is defined as the predetermined threshold value with the product of default value;Wherein, the default value is between 0~1
Between.
A kind of sorter of application program sample, including:
First acquisition unit, for obtaining the first application program sample in application program sample set to be sorted
This first virtual machine execution file;
First extraction unit, decompiling is obtained for carrying out decompiling to first virtual machine execution file
First function message structure, extracts the first memonic symbol sequence in the first function message structure of the decompiling
Row;
Second acquisition unit, for obtaining the second application program sample in application program sample set to be sorted
This second virtual machine execution file;
Second extraction unit, decompiling is obtained for carrying out decompiling to second virtual machine execution file
Second function message structure, extracts the second memonic symbol sequence in the second function message structure of the decompiling
Row;
Determining unit, for determining the volume between the first memonic symbol sequence and the second memonic symbol sequence
Collect distance;
Judging unit, for judging the editing distance whether less than predetermined threshold value;
Taxon, for when the editing distance is less than predetermined threshold value, by the first application program sample
This and the second application program sample are divided into same category.
Preferably, the first memonic symbol sequence is the code field included by the first function message structure
In function character composition sequence, the second memonic symbol sequence is by the second function message structure bag
The sequence of the function character composition in the code field for containing.
Preferably, described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really
The character sum of the fixed first memonic symbol sequence or the second memonic symbol sequence;By the character sum with
The product of default value is defined as the predetermined threshold value;Wherein, the default value is between 0~1.
Above-mentioned at least one technical scheme that the embodiment of the present application is used can reach following beneficial effect:
By the virtual machine to first, second application program sample in application program sample set to be sorted
Perform file to be analyzed and decompiling, respectively obtain above-mentioned first, second application program sample corresponding the
First, second function command sequence (or memonic symbol sequence), then, is determined above-mentioned using editing distance algorithm
The editing distance of first, second function instruction sequence (or memonic symbol sequence), by judge above-mentioned editor away from
From whether be less than predetermined threshold value, and less than when, above-mentioned first, second application program sample is divided into together
One classification.By the above method, can be by the relatively near (volume of similarity in application program sample set to be sorted
Volume distance is less than predetermined threshold value) application program sample be divided into same class, family's sample is obtained, so that real
The automatic classification of sample, improves the efficiency of classification in existing application program sample set.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes of the application
Point, the schematic description and description of the application is used to explain the application, does not constitute to the application not
Work as restriction.In the accompanying drawings:
Fig. 1 is the flow chart of the sorting technique of the application program sample of offer in the embodiment of the application one;
Fig. 2 be the embodiment of the present application in showing for the function information structure that decompiling is obtained is carried out to dex files
Example;
Fig. 3 is the flow chart of the sorting technique of the application program sample of offer in another embodiment of the application;
Fig. 4 is the module map of the sorter of the application program sample of offer in the embodiment of the application one.
Specific embodiment
It is specifically real below in conjunction with the application to make the purpose, technical scheme and advantage of the application clearer
Apply example and corresponding accompanying drawing is clearly and completely described to technical scheme.Obviously, it is described
Embodiment is only some embodiments of the present application, rather than whole embodiments.Based on the implementation in the application
Example, the every other implementation that those of ordinary skill in the art are obtained under the premise of creative work is not made
Example, belongs to the scope of the application protection.
By taking Android (Android) operating system as an example, including application layer (app layers) and system framework
Layer (framework layers), is then not covered as other layer of the application being possible to include is divided from function.
Wherein, usual app layers can be understood as upper strata, be responsible for the interface with user mutual, such as application program dimension
Different types of click on content is recognized so as to show different context menu etc. when shield and the click page.
Generally framework layer as intermediate layer, the major responsibility of this layer is, the user that app layer is obtained asks
Ask, such as start and preserve picture etc with program, clickthrough, click, forward and gone toward lower floor;By lower floor
The content handled well, or by message, or upper strata is distributed to by middle-agent's class, to user
Show.
Dalvik is the Java Virtual Machine for Android platform.Dalvik is by optimization, it is allowed to limited
Internal memory in run the example of multiple virtual machines simultaneously, and each Dalvik is using independent as one
Linux processes are performed.Independent process can prevent all programs when virtual machine crashes to be all closed.
Dalvik virtual machine can be supported to have been converted into the Java application journeys of dex (Dalvik Executable) form
The operation of sequence, dex forms are a kind of compressed format for aiming at Dalvik designs, are adapted to internal memory and processor speed
The limited system of degree.
It can be seen that, in android system, dex files can be directly at Dalvik virtual machine (Dalvik VM)
The virtual machine execution file of middle load operating.By ADT (Android Development Tools), pass through
Java source codes, can be converted to dex files by complicated compiling.Dex files are directed to embedded system
The result of optimization, the instruction code of Dalvik virtual machine is not the Java Virtual Machine instruction code of standard, but is made
With oneself exclusive a set of instruction set.Many class names, constant character string are shared in dex files, has been made
Its volume is smaller, and operational efficiency is also higher.
Below in conjunction with accompanying drawing, the technical scheme that each embodiment of the application is provided is described in detail.
Fig. 1 is the flow of the sorting technique of the application program sample of offer in the embodiment of the application one, including:
S101:Obtain the first application program sample in application program sample set to be sorted first is virtual
Machine performs file.
The purpose of the application be to application program sample set Q in application program sample some to be sorted
Classified automatically according to similarity.Above-mentioned virtual machine execution file is, for example, dex files.As it was previously stated,
Android operation system includes application layer (app layers) and system framework layer (framework layers),
The application focuses on the research and improvement to app layers.But, it will be appreciated by those skilled in the art that working as Android
During startup, Dalvik VM monitor all of program (APK file) and framework, and for they create one
Individual dependency tree.DalvikVM for each program optimization code and is stored by this dependency tree
In Dalvik cachings (dalvik-cache).So, all programs operationally can all be used and optimized
Code.When a program (or framework storehouse) is changed, Dalvik VM will re-optimization code
And deposited again in the buffer.It is to deposit the Program Generating on system in cache/dalvik-cache
Dex files, and data/dalvik-cache then be deposit data/app generation dex files.It is,
The application focuses on the analysis and treatment carried out to the dex files of data/app generations, it should be appreciated that,
For the dex files of the Program Generating on system, the theoretical and operation of the application is equally applicable.
Mode on obtaining dex files, can be by parsing APK (Android Package, Android
Installation kit) obtain.APK file is in fact a compressed package of zip forms, but suffix name is modified to apk,
After UnZip is decompressed, it is possible to obtain Dex files.
S102:The first function information that decompiling obtains decompiling is carried out to first virtual machine execution file
Structure, extracts the first function command sequence in the first function message structure of the decompiling.The application reality
Apply in example, the first function command sequence is the machine code field included by the first function message structure
In per a line preceding n-bit character constitute command sequence.
Decompiling is carried out to dex files (or is:Dis-assembling) there are various ways.
First way is that dex files are parsed according to dex file formats, obtains the letter of each class
Number information structure;According to the field in function information structure, determine the function of dex files position and
Size, obtains the function information structure of decompiling.Wherein, by analytical function information structure, referred to
The list of the bytecode array field for showing the function position of dex files and the function size for indicating dex files
Length field, so that it is determined that the position of the function of dex files and size.
The second way is, using dex file decompiling instruments, dex file reverses to be compiled as into virtual machine word
Section code.
Such as preceding introduction, Dalvik virtual machine operation is Dalvik bytecodes, and it is with a dex (Dalvik
Executable) executable file form is present, and Dalvik virtual machine performs generation by explaining dex files
Code.There are some instruments at present, can be by DEX file dis-assembling into Dalvik assembly codes.This kind of dex texts
Part decompiling instrument includes:baksmali、Dedexer1.26、dexdump、dexinspecto03-12-12r、
IDA Pro, androguard, dex2jar, 010Editor etc..
It can be seen that, by the decompiling to dex files, all function information structures of decompiling can be obtained.
Wherein, function information structure performs code comprising function, is by virtual machine instructions sequence in the embodiment of the present application
Row and virtual machine memonic symbol Sequence composition, such as following example, by Dalvik VM command sequence and
The memonic symbol Sequence composition function information structure of Dalvik VM.
For example, shown in Fig. 2 being to carry out the function letter that decompiling is obtained in the embodiment of the present application to dex files
Cease the example of structure.It can be seen that, dex files are decompiled into the command sequence and Dalvik VM of Dalvik VM
Memonic symbol sequence.
Such as the example of figure 2 above, each in machine code field in the function information structure that decompiling is obtained
Capable preceding 2 numerals refer to make sequence (upper example left side is by circle part), and the corresponding part of command sequence
It is memonic symbol (upper example right side, is partly enclosed, not all selections).Memonic symbol is primarily to convenient use
Family exchanges and written in code.As above example, dex files can be obtained by the sequence of instructions of function by decompiling
It is classified as:“12 54 38 71 0c 6e 0c 6e 0a 38 54 54 6e 0c 6e 54 6e 0c 6e 0c 38 72 0a 39 12
38 54 6e 54 71 0e 01 28 54 13 6e”。
Memonic symbol sequence is:
“const/4iget-object if-eqz invoke-static move-result-object invoke-virtual
move-result-object invoke-virtual move-result if-eqz iget-object iget-object
invoke-virtual move-result-object invoke-virtual iget-object invoke-virtual
move-result-object invoke-virtual move-result-object if-eqz invoke-interface
move-result if-nez const/4if-eqz iget-object invoke-virtual iget-object invoke-static
return-void move goto iget-object const/16invoke-virtual”。
S103:Obtain the second application program sample in application program sample set to be sorted second is virtual
Machine performs file.
S104:The second function information that decompiling obtains decompiling is carried out to second virtual machine execution file
Structure, extracts the second function command sequence in the second function message structure of the decompiling.The application reality
Apply in example, the second function command sequence is the machine code field included by the second function message structure
In per a line preceding n-bit character constitute command sequence.
The detailed process of above-mentioned steps S103 and S104 is referred to the content of above-mentioned steps S101 and S102.
S105:Determine editor between the first function command sequence and the second function command sequence away from
From.
Above-mentioned first application program sample and the second application program sample are above-mentioned application program samples to be sorted
Any two samples not being classified in this set Q.In actual application, when to above-mentioned application journey
When the first sample of sequence sample set Q is classified (any classification is not divided also), can be with newly-built one
Above-mentioned first sample is simultaneously included into category A by classification A, so, to above-mentioned application program sample set Q
Second sample when being classified, you can by judging whether second sample belongs to above-mentioned first sample
It is similar, if so, then continue that second sample is included into the classification A of above-mentioned first sample, if it is not,
A newly-built classification B can then be continued and second sample is included into category B, by that analogy.
S106:Judge the editing distance whether less than predetermined threshold value.
In the embodiment of the present application, determine that above-mentioned first, second function refers to by way of calculating editing distance
Make the similarity of sequence.Editing distance (Edit Distance), refers to two also known as Levenshtein distances
Between word string, the minimum edit operation number of times as needed for changes into another.Such as:Calculate cafe
With the editing distance of coffee, it is by the process that cafe operations are coffee:
Cafe → caffe → coffe → coffee, then it is 3 to obtain editing distance.Typically, for two function instructions
Sequence, if the editing distance between the two function instruction sequences is smaller, shows the two function instruction sequences
Row similarity is higher, that is, shows that the code of above-mentioned first, second application program sample to be sorted has been got over
May belong to of the same clan or generic.
S107:If the editing distance is less than predetermined threshold value, by the first application program sample and described the
Two application program samples are divided into same category.
For example, being by the first function command sequence that step S102 is obtained:“12 54 38 71 0c 6e 0c 6e
0a 38 54 54 6e 0c 6e 54 6e 0c 6e 0c 38 72 0a 39 12 38 54 6e 54 71 0e 01 28 54 13
6e”。
The second function command sequence obtained by step S104 is:“1238 54 71 0c 6e 0c 6e 0a 38
54 54 6e 0c 6e 54 6e 0c 6e 0c 38 72 0a 39 12 38 54 6e 54 71 0e 01 28 54 13 6e”。
By calculating the editing distance of above-mentioned first, second function instruction sequence, editing distance=4 are obtained, it is false
If predetermined threshold value is 5, then find that the editing distance of above-mentioned two command sequence is less than by comparison above-mentioned pre-
If threshold value, therefore can determine that the similarity of above-mentioned first, second function instruction sequence meets preset requirement,
That is, above-mentioned first, second application program sample belongs to of the same clan or generic.
In the embodiment of the present application, the accuracy of above-mentioned predetermined threshold value is determined to improve, and then improve sample point
The accuracy of class, before above-mentioned steps S106, methods described can also include:
Determine the character sum of the first function command sequence or the second function command sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default
Numerical value is between 0~1.
For example, it may be determined that first function command sequence obtained above:“1238 54 71 0c 6e 0c
6e 0a 38 54 54 6e 0c 6e 54 6e 0c 6e 0c 38 72 0a 39 12 38 54 6e 54 71 0e 01 28 54
13 6e " or above-mentioned second function command sequences:“1238 54 71 0c 6e 0c 6e 0a 38 54 54 6e 0c 6e
The character sum of the 6e of 54 54 71 0e of 6e 0c 6e 0c 38 72 0a, 39 12 38 54 6e 01 28 54 13 " is 72,
It is 0.05 (between 0~1) that default value can then be set, and may finally determine that predetermined threshold value is
72*0.05≈4.Wherein, the default value also can be empirical value.By above-mentioned steps, can will be similar
The application program sample of the function instruction sequence that degree reaches more than 95% is divided into same classification.
In the application another kind embodiment, before above-mentioned steps S107, can also include:It is determined that described
The number of characters sum of first function command sequence and the second function command sequence;By the number of characters sum
It is defined as the predetermined threshold value with the product of default value;Wherein, the default value is between 0~1.
For example, the number of characters sum of first, second function instruction sequence is 144, then with the number of characters sum
Above-mentioned predetermined threshold value is determined with the product of default value.
It should be noted that the application is not limited being carried out to malicious code using which kind of malicious code protectiving scheme
Detection, it is for instance possible to use sample characteristics killing (characteristic value scanning) presented hereinbefore, based on virtual machine
Killing or heuristic killing, it can in addition contain carry out similar sample clustering.And, for matching algorithm
It is not restricted, it is for instance possible to use fuzzy matching algorithm presented hereinbefore or Similarity matching algorithm etc..
Fig. 3 is the flow of the sorting technique of the application program sample of offer in another embodiment of the application, including:
S201:Obtain the first application program sample in application program sample set to be sorted first is virtual
Machine performs file;
S202:The first function information that decompiling obtains decompiling is carried out to first virtual machine execution file
Structure, extracts the first memonic symbol sequence in the first function message structure of the decompiling;The application is implemented
In example, the first memonic symbol sequence is the letter in the code field included by the first function message structure
The sequence of digit composition.
S203:Obtain the second application program sample in application program sample set to be sorted second is virtual
Machine performs file;
S204:The second function information that decompiling obtains decompiling is carried out to second virtual machine execution file
Structure, extracts the second memonic symbol sequence in the second function message structure of the decompiling;Described second helps
The sequence that function character in the code field that note symbol sequence is included by the second function message structure is constituted
Row.
S205:Determine the editing distance between the first memonic symbol sequence and the second memonic symbol sequence;
S206:Judge the editing distance whether less than predetermined threshold value;
S207:If the editing distance is less than predetermined threshold value, by the first application program sample and described the
Two application program samples are divided into same category.
In the embodiment of the present application, with reference to the above, then judge the editing distance whether less than predetermined threshold value
The step of before, methods described also includes:Determine the first memonic symbol sequence or the second memonic symbol sequence
The character sum of row;Character sum and the product of default value are defined as the predetermined threshold value;Wherein,
The default value is between 0~1.The accuracy of sample classification can be improved by said process.
It can be seen that, in the method that the various embodiments described above are provided, by application program sample set to be sorted
In the virtual machine execution file of first, second application program sample be analyzed and decompiling, respectively obtain
The corresponding first, second function instruction sequence (or memonic symbol sequence) of above-mentioned first, second application program sample,
Then, above-mentioned first, second function instruction sequence (or memonic symbol sequence) is determined using editing distance algorithm
Editing distance, by whether judging above-mentioned editing distance less than predetermined threshold value, and less than when, will be above-mentioned
First, second application program sample is divided into same category.By the above method, to be sorted can be answered
Divided with the application program sample of similarity in program sample set relatively near (editing distance is less than predetermined threshold value)
To same class, family's sample is obtained, so as to realize the automatic classification of sample in application program sample set, carried
The efficiency of high-class.
Fig. 4 is the module map of the sorter of the application program sample of offer in the embodiment of the application one.Wherein,
The function of each unit is similar with the function of each step in the above method in the device, therefore the device is referred to
State the particular content of embodiment of the method.The device includes:
First acquisition unit 401, journey is applied for obtaining in application program sample set to be sorted first
First virtual machine execution file of sequence sample;
First extraction unit 402, for being carried out to first virtual machine execution file, decompiling is counter to be compiled
The first function message structure translated, the first function extracted in the first function message structure of the decompiling refers to
Make sequence;
Second acquisition unit 403, journey is applied for obtaining in application program sample set to be sorted second
Second virtual machine execution file of sequence sample;
Second extraction unit 404, for being carried out to second virtual machine execution file, decompiling is counter to be compiled
The second function message structure translated, the second function extracted in the second function message structure of the decompiling refers to
Make sequence;
Determining unit 405, for determining the first function command sequence with the second function command sequence
Between editing distance;
Judging unit 406, for judging the editing distance whether less than predetermined threshold value;
Taxon 407, for when the editing distance is less than predetermined threshold value, journey being applied by described first
Sequence sample and the second application program sample are divided into same category.
In the embodiment of the present application, the accuracy of above-mentioned predetermined threshold value is determined to improve, and then improve sample point
The accuracy of class, described device also includes:
Predetermined threshold value determining unit, for by the first application program sample and second application program
Before sample is divided into same category, the first function command sequence or the second function sequence of instructions are determined
The character sum of row;Character sum and the product of default value are defined as the predetermined threshold value;Wherein,
The default value is between 0~1.
In the embodiment of the present application, the accuracy of above-mentioned predetermined threshold value is determined to improve, and then improve sample point
The accuracy of class, described device also includes:
Predetermined threshold value determining unit, for by the first application program sample and second application program
Before sample is divided into same category, the first function command sequence and the second function sequence of instructions are determined
The number of characters sum of row;The product of the number of characters sum and default value is defined as the predetermined threshold value;
Wherein, the default value is between 0~1.
In the embodiment of the present application, first extraction unit 402 specifically for:
The first virtual machine execution file is parsed according to virtual machine execution file form, obtains each class
Function information structure;According to the field in the function information structure, determine that first virtual machine is held
The position of the function of style of writing part and size, obtain the first function message structure of the decompiling;
First extraction unit 404 specifically for:
The second virtual machine execution file is parsed according to virtual machine execution file form, obtains each class
Function information structure;According to the field in the function information structure, determine that second virtual machine is held
The position of the function of style of writing part and size, obtain the second function message structure of the decompiling.
Embodiment as an alternative, the sorter of above-mentioned application program sample, including:
First acquisition unit 401, journey is applied for obtaining in application program sample set to be sorted first
First virtual machine execution file of sequence sample;
First extraction unit 402, for being carried out to first virtual machine execution file, decompiling is counter to be compiled
The first function message structure translated, extracts the first memonic symbol in the first function message structure of the decompiling
Sequence;
Second acquisition unit 403, journey is applied for obtaining in application program sample set to be sorted second
Second virtual machine execution file of sequence sample;
Second extraction unit 404, for being carried out to second virtual machine execution file, decompiling is counter to be compiled
The second function message structure translated, extracts the second memonic symbol in the second function message structure of the decompiling
Sequence;
Determining unit 405, for determining between the first memonic symbol sequence and the second memonic symbol sequence
Editing distance;
Judging unit 406, for judging the editing distance whether less than predetermined threshold value;
Taxon 407, for when the editing distance is less than predetermined threshold value, journey being applied by described first
Sequence sample and the second application program sample are divided into same category.
In the embodiment of the present application, the accuracy of above-mentioned predetermined threshold value is determined to improve, and then improve sample point
The accuracy of class, described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really
The character sum of the fixed first memonic symbol sequence or the second memonic symbol sequence;By the character sum with
The product of default value is defined as the predetermined threshold value;Wherein, the default value is between 0~1.
In the device that the various embodiments described above are provided, by the in application program sample set to be sorted
First, the virtual machine execution file of the second application program sample is analyzed and decompiling, respectively obtains above-mentioned
First, the corresponding first, second function instruction sequence of the second application program sample (or memonic symbol sequence), with
Afterwards, above-mentioned first, second function instruction sequence (or memonic symbol sequence) is determined using editing distance algorithm
Editing distance, by whether judging above-mentioned editing distance less than predetermined threshold value, and less than when, by above-mentioned the
First, the second application program sample is divided into same category.By the above method, can be by application to be sorted
The application program sample of similarity relatively near (editing distance is less than predetermined threshold value) is divided into program sample set
Same class, obtains family's sample, so as to realize the automatic classification of sample in application program sample set, improves
The efficiency of classification.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or meter
Calculation machine program product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or knot
Close the form of the embodiment in terms of software and hardware.And, the application can be used and wherein wrapped at one or more
Containing computer usable program code computer-usable storage medium (including but not limited to magnetic disk storage,
CD-ROM, optical memory etc.) on implement computer program product form.
The application is produced with reference to the method according to the embodiment of the present application, equipment (system) and computer program
The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions realize flow chart and
/ or block diagram in each flow and/or the flow in square frame and flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions to all-purpose computer, special-purpose computer, insertion can be provided
The processor of formula processor or other programmable data processing devices is producing a machine so that by calculating
The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one
The device of the function of being specified in individual flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or the treatment of other programmable datas to set
In the standby computer-readable memory for working in a specific way so that storage is in the computer-readable memory
Instruction produce include the manufacture of command device, the command device realization in one flow of flow chart or multiple
The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made
Obtain and series of operation steps is performed on computer or other programmable devices to produce computer implemented place
Reason, so as to the instruction performed on computer or other programmable devices is provided for realizing in flow chart one
The step of function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/defeated
Outgoing interface, network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory
And/or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory (RAM).
Internal memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be by appointing
What method or technique realizes information Store.Information can be computer-readable instruction, data structure, program
Module or other data.The example of the storage medium of computer includes, but are not limited to phase transition internal memory
(PRAM), static RAM (SRAM), dynamic random access memory (DRAM), its
The random access memory (RAM) of his type, read-only storage (ROM), electrically erasable are read-only
Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage
(CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, tape magnetic
Disk storage or other magnetic storage apparatus or any other non-transmission medium, can be used for storage can be calculated
The information that equipment is accessed.Defined according to herein, computer-readable medium does not include temporary computer-readable matchmaker
Body (transitory media), such as data-signal and carrier wave of modulation.
Also, it should be noted that term " including ", "comprising" or its any other variant be intended to non-row
His property is included, so that process, method, commodity or equipment including a series of key elements not only include
Those key elements, but also other key elements including being not expressly set out, or also include for this process,
Method, commodity or the intrinsic key element of equipment.In the absence of more restrictions, by sentence " including
One ... " key element that limits, it is not excluded that in the process including the key element, method, commodity or set
Also there is other identical element in standby.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer journey
Sequence product.Therefore, the application can using complete hardware embodiment, complete software embodiment or combine software and
The form of the embodiment of hardware aspect.And, the application can be used and wherein include calculating at one or more
Machine usable program code computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM,
Optical memory etc.) on implement computer program product form.
Embodiments herein is the foregoing is only, the application is not limited to.For this area skill
For art personnel, the application can have various modifications and variations.All institutes within spirit herein and principle
Any modification, equivalent substitution and improvements of work etc., within the scope of should be included in claims hereof.
Claims (10)
1. a kind of sorting technique of application program sample, it is characterised in that including:
The first virtual machine for obtaining the first application program sample in application program sample set to be sorted is held
Style of writing part;
The first function message structure that decompiling obtains decompiling is carried out to first virtual machine execution file,
Extract the first function command sequence in the first function message structure of the decompiling;
The second virtual machine for obtaining the second application program sample in application program sample set to be sorted is held
Style of writing part;
The second function message structure that decompiling obtains decompiling is carried out to second virtual machine execution file,
Extract the second function command sequence in the second function message structure of the decompiling;
Determine the editing distance between the first function command sequence and the second function command sequence;
Judge the editing distance whether less than predetermined threshold value;
If so, the first application program sample and the second application program sample are divided into same class
Not.
2. the method for claim 1, it is characterised in that the first function command sequence be by
The sequence that preceding n-bit character in the machine code field that the first function message structure is included per a line is constituted,
The second function command sequence is each in the machine code field included by the second function message structure
The sequence of capable preceding n-bit character composition.
3. the method for claim 1, it is characterised in that judge whether the editing distance is less than
Before predetermined threshold value, methods described also includes:
Determine the character sum of the first function command sequence or the second function command sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default
Numerical value is between 0~1.
4. the method for claim 1, it is characterised in that judge whether the editing distance is less than
Before predetermined threshold value, methods described also includes:
Determine the number of characters sum of the first function command sequence and the second function command sequence;
The product of the number of characters sum and default value is defined as the predetermined threshold value;Wherein, it is described pre-
If numerical value is between 0~1.
5. a kind of sorting technique of application program sample, it is characterised in that including:
The first virtual machine for obtaining the first application program sample in application program sample set to be sorted is held
Style of writing part;
The first function message structure that decompiling obtains decompiling is carried out to first virtual machine execution file,
Extract the first memonic symbol sequence in the first function message structure of the decompiling;
The second virtual machine for obtaining the second application program sample in application program sample set to be sorted is held
Style of writing part;
The second function message structure that decompiling obtains decompiling is carried out to second virtual machine execution file,
Extract the second memonic symbol sequence in the second function message structure of the decompiling;
Determine the editing distance between the first memonic symbol sequence and the second memonic symbol sequence;
Judge the editing distance whether less than predetermined threshold value;
If so, the first application program sample and the second application program sample are divided into same class
Not.
6. method as claimed in claim 5, it is characterised in that the first memonic symbol sequence is by institute
State the sequence of the function character composition in the code field that first function message structure is included, second mnemonic(al)
The sequence that function character in the code field that symbol sequence is included by the second function message structure is constituted.
7. method as claimed in claim 5, it is characterised in that judge whether the editing distance is less than
Before predetermined threshold value, methods described also includes:
Determine the character sum of the first memonic symbol sequence or the second memonic symbol sequence;
Character sum and the product of default value are defined as the predetermined threshold value;Wherein, it is described default
Numerical value is between 0~1.
8. a kind of sorter of application program sample, it is characterised in that including:
First acquisition unit, for obtaining the first application program sample in application program sample set to be sorted
This first virtual machine execution file;
First extraction unit, decompiling is obtained for carrying out decompiling to first virtual machine execution file
First function message structure, extracts the first function sequence of instructions in the first function message structure of the decompiling
Row;
Second acquisition unit, for obtaining the second application program sample in application program sample set to be sorted
This second virtual machine execution file;
Second extraction unit, decompiling is obtained for carrying out decompiling to second virtual machine execution file
Second function message structure, extracts the second function sequence of instructions in the second function message structure of the decompiling
Row;
Determining unit, for determining between the first function command sequence and the second function command sequence
Editing distance;
Judging unit, for judging the editing distance whether less than predetermined threshold value;
Taxon, for when the editing distance is less than predetermined threshold value, by the first application program sample
This and the second application program sample are divided into same category.
9. device as claimed in claim 8, it is characterised in that the first function command sequence be by
The sequence that preceding n-bit character in the machine code field that the first function message structure is included per a line is constituted,
The second function command sequence is each in the machine code field included by the second function message structure
The sequence of capable preceding n-bit character composition.
10. device as claimed in claim 8, it is characterised in that described device also includes:
Predetermined threshold value determining unit, for whether judging the editing distance less than before predetermined threshold value, really
The character sum of the fixed first function command sequence or the second function command sequence;The character is total
Number is defined as the predetermined threshold value with the product of default value;Wherein, the default value is between 0~1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510971488.7A CN106909844A (en) | 2015-12-22 | 2015-12-22 | The sorting technique and device of a kind of application program sample |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510971488.7A CN106909844A (en) | 2015-12-22 | 2015-12-22 | The sorting technique and device of a kind of application program sample |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106909844A true CN106909844A (en) | 2017-06-30 |
Family
ID=59201066
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510971488.7A Pending CN106909844A (en) | 2015-12-22 | 2015-12-22 | The sorting technique and device of a kind of application program sample |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106909844A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108491718A (en) * | 2018-02-13 | 2018-09-04 | 北京兰云科技有限公司 | A kind of method and device for realizing information classification |
CN109558735A (en) * | 2018-12-03 | 2019-04-02 | 杭州安恒信息技术股份有限公司 | A kind of rogue program sample clustering method and relevant apparatus based on machine learning |
CN109977976A (en) * | 2017-12-28 | 2019-07-05 | 腾讯科技(深圳)有限公司 | Detection method, device and the computer equipment of executable file similarity |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761475A (en) * | 2013-12-30 | 2014-04-30 | 北京奇虎科技有限公司 | Method and device for detecting malicious code in intelligent terminal |
CN103902910A (en) * | 2013-12-30 | 2014-07-02 | 北京奇虎科技有限公司 | Method and device for detecting malicious codes in intelligent terminal |
-
2015
- 2015-12-22 CN CN201510971488.7A patent/CN106909844A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761475A (en) * | 2013-12-30 | 2014-04-30 | 北京奇虎科技有限公司 | Method and device for detecting malicious code in intelligent terminal |
CN103902910A (en) * | 2013-12-30 | 2014-07-02 | 北京奇虎科技有限公司 | Method and device for detecting malicious codes in intelligent terminal |
Non-Patent Citations (1)
Title |
---|
赵作鹏: "《面向煤矿应急管理的数据处理关键技术研究》", 30 November 2013 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109977976A (en) * | 2017-12-28 | 2019-07-05 | 腾讯科技(深圳)有限公司 | Detection method, device and the computer equipment of executable file similarity |
CN108491718A (en) * | 2018-02-13 | 2018-09-04 | 北京兰云科技有限公司 | A kind of method and device for realizing information classification |
CN108491718B (en) * | 2018-02-13 | 2022-03-04 | 北京兰云科技有限公司 | Method and device for realizing information classification |
CN109558735A (en) * | 2018-12-03 | 2019-04-02 | 杭州安恒信息技术股份有限公司 | A kind of rogue program sample clustering method and relevant apparatus based on machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5992622B2 (en) | Malicious application diagnostic apparatus and method | |
CN103761475B (en) | Method and device for detecting malicious code in intelligent terminal | |
CN104123493B (en) | The safety detecting method and device of application program | |
US8850581B2 (en) | Identification of malware detection signature candidate code | |
CN104834837B (en) | A kind of antialiasing method of binary code based on semanteme | |
CN109564608A (en) | Updating virtual memory addresses of target application functions for updated versions of application binary code | |
CN103761476B (en) | The method and device of feature extraction | |
EP3227797B1 (en) | System and method for fast and scalable functional file correlation | |
CN106909841A (en) | A kind of method and device for judging viral code | |
US20140082729A1 (en) | System and method for analyzing repackaged application through risk calculation | |
CN105653949B (en) | A kind of malware detection methods and device | |
CN106033416A (en) | A string processing method and device | |
CN112148305B (en) | Application detection method, device, computer equipment and readable storage medium | |
CN106874180A (en) | Detection System And Method Thereof | |
CN106803040B (en) | Virus characteristic code processing method and device | |
CN108090360B (en) | Behavior feature-based android malicious application classification method and system | |
CN106845171A (en) | A kind of Android application codes protection mechanism discrimination method | |
CN106250769A (en) | The source code data detection method of a kind of multistage filtering and device | |
CN107229669A (en) | Method and system for selecting the sample set on assessing website Barrien-free | |
CN106909844A (en) | The sorting technique and device of a kind of application program sample | |
CN106598828A (en) | Method and device for determining invalid class in source code | |
CN109190370B (en) | Android interface similarity calculation method based on control region distribution characteristics | |
CN106484726A (en) | A kind of page display method and device | |
CN109634569A (en) | Process implementation method, device, equipment and readable storage medium storing program for executing based on note | |
CN107735792A (en) | Software analysis system, software analysis method and software analysis program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170630 |
|
RJ01 | Rejection of invention patent application after publication |