CN106503559B - The extracting method and device of feature - Google Patents
The extracting method and device of feature Download PDFInfo
- Publication number
- CN106503559B CN106503559B CN201611041703.4A CN201611041703A CN106503559B CN 106503559 B CN106503559 B CN 106503559B CN 201611041703 A CN201611041703 A CN 201611041703A CN 106503559 B CN106503559 B CN 106503559B
- Authority
- CN
- China
- Prior art keywords
- target
- keyword
- malware
- phrase
- initial key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/561—Virus type analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/563—Static detection by source code analysis
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Virology (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of extracting method of feature and devices, it is related to the technical field of feature extraction, it include: to be scanned to the source file and default file of each Malware at least one Malware, obtain the frequency of occurrences of each initial key word in each Malware in the initial key phrase and initial key phrase of each Malware;The similarity distance in initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains target range matrix;Initial key phrase is screened according to the similarity distance recorded in target range matrix, screening obtains target critical phrase, wherein the similarity distance in target critical phrase between any two keyword meets default classification value;It determines that target critical phrase is the characteristic information of each Malware, solves in the prior art when carrying out killing to Malware, the incomplete technical problem of killing.
Description
Technical field
The technical field of computer of the present invention, more particularly, to the extracting method and device of a kind of feature.
Background technique
With the arrival of Internet era, the popularity rate of smart phone in the world is also higher and higher, and smart phone is
System mainly includes Android and IOS system, wherein the performance that Android intelligent operating system then relies on its excellent obtains
Obtained the huge market share.Market can have been also appeared in the development of smart phone, more and more mobile phone Malwares
In the middle, the information security of user is endangered.Major security laboratory also gradually using mobile phone safeguard protection as primary study, but how
The mutation of effective killing novel malicious software and Malware is always a problem.
In the prior art, in killing tradition Malware, Malware is mainly determined by condition code extracting method, into
And killing is carried out to the Malware, wherein features described above code extracting method is mainly based upon program binary text.It is based on
The feature extracting method of program binary text can only carry out killing to traditional Malware, it is soft can not to detect novel malicious
Part and variation Malware.
Summary of the invention
The purpose of the present invention is to provide a kind of extracting method of feature and devices, to alleviate in the prior art to malice
When software carries out killing, the incomplete technical problem of killing.
According to an aspect of an embodiment of the present invention, a kind of extracting method of feature is provided, comprising: at least one evil
The source file and default file of each Malware in meaning software are scanned, and obtain the initial pass of each Malware
The frequency of occurrences of each initial key word in each Malware in keyword group and the initial key phrase;Root
The similarity distance in the initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains target range
Matrix;The initial key phrase is screened according to the similarity distance recorded in the target range matrix, is screened
Obtain target critical phrase, wherein the similarity distance in the target critical phrase between any two keyword meets default
Classification value;Determine that the target critical phrase is the characteristic information of each Malware.
Further, according to the similarity distance recorded in the target range matrix to the initial key phrase into
Row screening, it includes: using each keyword in the initial key phrase as root node, to institute that screening, which obtains target critical phrase,
It states root node and carries out first search operation, obtain target search result, wherein any two are crucial in the target search result
Similarity distance between word is less than or equal to the default classification value, and any two are crucial in the target search result
It is interrelated between word;Target keyword is searched in the target search result, forms the target critical phrase, wherein institute
State the pass for being less than the default classification value in target search result with the similarity distance of the target keyword comprising at least one
Keyword, also, the keyword in the target keyword and the target search result belongs to identical classification.
Further, according to the similarity distance recorded in the target range matrix to the initial key phrase into
Row screening, screening obtain target critical phrase further include: by algorithm of support vector machine to the initial key phrase
Habit processing, obtains target learning outcome, wherein the similarity distance in the target learning outcome between any two keyword is small
In or be equal to the default classification value, and it is interrelated between any two keyword in the target learning outcome;?
The target learning outcome searches target keyword, forms the target critical phrase, wherein wrap in the target learning outcome
It is less than the keyword of the default classification value, also, the target with the similarity distance of the target keyword containing at least one
Keyword in keyword and the target learning outcome belongs to identical classification.
Further, according to the frequency of occurrences calculate in initial key phrase between any two keyword it is similar away from
From including: by Google distance calculation formulaCalculate the initial pass
Similarity distance in keyword group between any two keyword obtains the target range matrix, wherein fC1Indicate initial key
The frequency of occurrences of the word C1 in each Malware, fC2Indicate initial key word C2 going out in each Malware
Existing frequency, f (C1, C2) indicate occur the initial key word C1 and the initial key simultaneously in each Malware
The frequency of word C2.
Further, the default file includes functional configuration file, at least one Malware source file and
Before default file is scanned, the method also includes: decompiling is carried out to each Malware by open source software,
Obtain the functional configuration file of each Malware and the source file of each Malware.
Further, described after determining the characteristic information that the target critical phrase is each Malware
Method further include: obtain the multiple groups keyword classification result of non-malicious software;By each target keyword and the multiple groups
The classification results of same type compare in keyword classification result, obtain comparing result;It is determined according to the comparing result
The authenticity of each target critical phrase.
Further, described after determining the characteristic information that the target critical phrase is each Malware
Method further include: obtain at least one software to be detected;By the characteristic information that extracts to it is described at least one wait for
Inspection software is analyzed, to determine whether at least one described software to be detected is Malware.
According to an aspect of an embodiment of the present invention, a kind of extraction element of feature is provided, comprising: scanning is single
Member, for each Malware at least one Malware source file and default file be scanned, obtain described every
Each initial key word is soft in each malice in the initial key phrase of a Malware and the initial key phrase
The frequency of occurrences in part;Computing unit is closed for calculating any two in the initial key phrase according to the frequency of occurrences
Similarity distance between keyword obtains target range matrix;Screening unit, for according to recording in the target range matrix
The similarity distance screens the initial key phrase, and screening obtains target critical phrase, wherein the target critical
Similarity distance in phrase between any two keyword meets default classification value;First determination unit, for determining the mesh
Mark the characteristic information that crucial phrase is each Malware.
Further, the screening unit includes: first processing module, for each of described initial key phrase
Keyword is root node, carries out first search operation to the root node, obtains target search result, wherein the target is searched
Similarity distance in hitch fruit between any two keyword is less than or equal to the default classification value, and the target is searched
It is interrelated between any two keyword in hitch fruit;First searching module, for searching mesh in the target search result
Keyword is marked, the target critical phrase is formed, wherein is closed comprising at least one with the target in the target search result
The similarity distance of keyword is less than the keyword of the default classification value, also, the target keyword and the target search knot
Keyword in fruit belongs to identical classification.
Further, the screening unit further include: Second processing module, for passing through algorithm of support vector machine to described
Initial key phrase carries out study processing, obtains target learning outcome, wherein any two are crucial in the target learning outcome
Similarity distance between word is less than or equal to the default classification value, and any two are crucial in the target learning outcome
It is interrelated between word;Second searching module forms the target for searching target keyword in the target learning outcome
Crucial phrase, wherein be less than institute comprising the similarity distance of at least one and the target keyword in the target learning outcome
State the keyword of default classification value, also, the keyword in the target keyword and the target learning outcome belong to it is identical
Classification.
In embodiments of the present invention, the source file of Malware and functional configuration file are scanned first, to obtain
The initial key phrase of each Malware and the wherein frequency of occurrences of each initial key word in default file;Then, root
The similarity distance in initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains target range matrix;It connects
Get off, initial key phrase is screened according to similarity distance, screens the keyword met the requirements, obtain target critical
Phrase;Finally, determining that target critical phrase is the characteristic information of such Malware.In embodiments of the present invention, pass through meter
The similarity distance for calculating keyword in Malware, can not only determine traditional Malware, additionally it is possible to identify the malice of variation
Software and novel malicious software, compared with the existing technology in feature extracting method, reached the mesh of complete detection Malware
, and then alleviate in the prior art when carrying out killing to Malware, the incomplete technical problem of killing, to realize
Improve the comprehensive technical effect to malware detection.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art
Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below
Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor
It puts, is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of the extracting method of feature according to an embodiment of the present invention;
Fig. 2 is that a kind of keyword type according to an embodiment of the present invention calls probability comparison diagram;
Fig. 3 is a kind of characteristic distance figure according to an embodiment of the present invention;
Fig. 4 is a kind of flow chart of optionally extracting method of feature according to an embodiment of the present invention;
Fig. 5 is a kind of schematic diagram of the extraction element of feature according to an embodiment of the present invention.
Specific embodiment
Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation
Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill
Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that term " center ", "upper", "lower", "left", "right", "vertical",
The orientation or positional relationship of the instructions such as "horizontal", "inner", "outside" be based on the orientation or positional relationship shown in the drawings, merely to
Convenient for description the present invention and simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation,
It is constructed and operated in a specific orientation, therefore is not considered as limiting the invention.In addition, term " first ", " second ",
" third " is used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase
Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can
To be mechanical connection, it is also possible to be electrically connected;It can be directly connected, can also can be indirectly connected through an intermediary
Connection inside two elements.For the ordinary skill in the art, above-mentioned term can be understood at this with concrete condition
Concrete meaning in invention.
According to embodiments of the present invention, a kind of embodiment of the extracting method of feature is provided, it should be noted that in attached drawing
Process the step of illustrating can execute in a computer system such as a set of computer executable instructions, although also,
Logical order is shown in flow charts, but in some cases, can be executed with the sequence for being different from herein it is shown or
The step of description.
Fig. 1 is a kind of flow chart of the extracting method of feature according to an embodiment of the present invention, as shown in Figure 1, this method packet
Include following steps:
Step S102 sweeps the source file and default file of each Malware at least one Malware
It retouches, obtains in the initial key phrase and initial key phrase of each Malware each initial key word in each malice
The frequency of occurrences in software.
In embodiments of the present invention, firstly, obtaining the java source code of Malware APK (AndroidPackage) (i.e.
Above-mentioned source file) and default file, wherein default file include the self-contained class library files of java and
AndroidManifest.xml file (i.e. functional configuration file);Source file may include the class text that technical staff writes manually
Part.Then, by improved string matching algorithm to java source code and default file (for example, class library files and
AndroidManifest.xml file) it is scanned.It specifically, can be by KMP algorithm to java source code, class library files
It is scanned with AndroidManifest.xml file, obtains the initial key phrase of Malware, and in the Malware
In the frequency of occurrences.Wherein, the initial key phrase of Malware indicates that the Malware is executing program when institute correspondingly
The function of execution.
It should be noted that in embodiments of the present invention, when the Malware to multiple and different types is scanned, often
A Malware obtains one group of initial key phrase.
Step S104 calculates the similarity distance in initial key phrase between any two keyword according to the frequency of occurrences,
Obtain target range matrix.
In embodiments of the present invention, the location of each initial key word and appearance frequency in the position are obtained in scanning
After rate, so that it may calculate the similarity distance in initial key phrase between any two keyword, calculate the initial key
In phrase after the similarity distance of any two keyword, the matrix of a n × n, i.e. target range matrix will be obtained.Specifically
Ground, if the similarity distance between two keywords is smaller, then it represents that the similarity between two keywords is higher.
Step S106 screens initial key phrase according to the similarity distance recorded in target range matrix, screening
Obtain target critical phrase, wherein the similarity distance in target critical phrase between any two keyword meets default classification
Value.
It in embodiments of the present invention, can be according to similar in target range matrix after obtaining target range matrix
Distance screens initial key phrase, obtains target critical phrase.Wherein, any in target critical phrase after screening
Similarity distance between two keywords meets default classification value, for example, meeting 0.28.That is, in target critical phrase
Similarity distance between any two keyword is less than or equal to 0.28.
Specifically, due in java source code, class library files and the AndroidManifest.xml file to Malware
Include keyword for executing malicious operation after being scanned, in obtained initial key phrase and does not execute malicious operation
Keyword.Due to during feature extraction, needing to filter out the keyword for executing malicious operation in initial key word
Come, forms target critical phrase.During screening, the benchmark of screening is above-mentioned default classification value, it is preferable that this default point
Class value is chosen for 0.28.
Step S108 determines that target critical phrase is the characteristic information of each Malware.
In embodiments of the present invention, after screening obtains above-mentioned target critical phrase, so that it may determine the target critical
Phrase is the characteristic information of the Malware.
For the Malware of multiple types, during carrying out feature extraction, above-mentioned steps S102 is all made of to step
Description in rapid S108 extracts.
After above method progress feature extraction, so that it may according to the characteristic information extracted to the doubtful evil of detection
Meaning software is judged, to judge whether the doubtful Malware is Malware.Tradition can not only be detected by this method
Malware, for novel malicious software and variation Malware can accurately detect.
In embodiments of the present invention, the source file of Malware and functional configuration file are scanned first, to obtain
The initial key phrase of the Malware of each type and wherein each initial key word the location of in default file and
The frequency of occurrences in position;Then, it is calculated according to the frequency of occurrences similar between any two keyword in initial key phrase
Distance obtains target range matrix;Next, being screened according to similarity distance to initial key phrase, screening is met
It is required that keyword, obtain target critical phrase;Finally, determining that target critical phrase is the feature letter of such Malware
Breath.In embodiments of the present invention, by calculating the similarity distance of keyword in Malware, traditional malice can not only be determined
Software, additionally it is possible to identify the Malware and novel malicious software of variation, compared with the existing technology in feature extracting method, reach
The purpose of complete detection Malware has been arrived, and then has been alleviated in the prior art when carrying out killing to Malware, killing is not
Comprehensive technical problem, to realize the comprehensive technical effect improved to malware detection.
In an optional embodiment of the invention, according to the similarity distance recorded in target range matrix to keyword
Group is screened, and screening obtains target critical phrase and includes the following steps:
Step S1061 carries out first search behaviour to root node using each keyword in initial key phrase as root node
Make, obtains target search result, wherein the similarity distance in target search result between any two keyword is less than or waits
It is interrelated between any two keyword in default classification value, and in target search result;
Step S1062 searches target keyword in target search result, forms target critical phrase, wherein target search
As a result it is less than the keyword of default classification value, also, target critical in comprising the similarity distance of at least one and target keyword
Keyword in word and target search result belongs to identical classification.
When screening to initial key phrase, there are many kinds of the methods of screening, in embodiments of the present invention, preferably makes
Initial key phrase is screened with first search algorithm and algorithm of support vector machine, obtains target critical phrase, wherein excellent
Selecting searching algorithm includes depth-priority-searching method, a variety of priority algorithms such as breadth first algorithm, in embodiments of the present invention, and it is unlimited
Surely the type of the priority algorithm used.
Specifically, scheme described in above-mentioned steps S1061 and step S1062 is to pass through depth-priority-searching method to initial
Crucial phrase is screened.Following step S1063 and step S1064 is to pass through support vector machines to carry out initial key phrase
Screening.
When being screened by depth-priority-searching method to initial key phrase, using target obtained in step S104 away from
Category filter is carried out to initial key word from the similarity distance in matrix.Specifically, using first search algorithm to step S104
In each keyword in obtained target range matrix be used as root node to carry out depth-first search, and each node
Between similarity it is high keyword record together, obtain target search result.That is, by similarity distance in any two keyword
The keyword record for meeting default classification value (for example, 0.28) together, obtains target search result.Obtaining target search knot
After fruit, target keyword can also will be searched in target search result, guarantee all target keywords found in target
There are other keywords and its similarity distance to be equal to or less than default classification value (for example, 0.28) in search result, wherein target
Keyword in keyword and target search result is a classification.After finding target keyword, so that it may will search
The target keyword composition target critical phrase arrived.Thus that similarity distance is small, i.e. the high keyword of similarity is classified as one
Class is stored as a characteristic information.
In another optional embodiment of the invention, crucial phrase is screened according to target range matrix, is screened
Obtaining target critical phrase further includes following steps:
Step S1063 carries out study processing to initial key phrase by algorithm of support vector machine, obtains target study knot
Fruit, wherein the similarity distance in target learning outcome between any two keyword is less than or equal to default classification value, and
It is interrelated between any two keyword in target learning outcome;
Step S1064 searches target keyword in target learning outcome, forms target critical phrase, wherein target study
As a result it is less than the keyword of default classification value, also, target critical in comprising the similarity distance of at least one and target keyword
Keyword in word and target learning outcome belongs to identical classification.
In embodiments of the present invention, in addition to depth-priority-searching method described in above-mentioned steps S1061 and step S1062 it
Outside, initial key phrase can also be learnt by support vector machines.Initial key phrase is learnt by support vector machines
Later, will classify two classes, and one kind is that similarity distance is greater than default classification value, and another kind of is similarity distance less than or equal to default
Classification value.That is, after being learnt by support vector machines to initial key phrase, so that it may which similarity distance to be less than
Or screened equal to the keyword of default classification value, constitute target learning outcome.
After obtaining target learning outcome, target keyword can also will be searched in target learning outcome, guarantee to search
To all target keywords there is other keywords and its similarity distance to be equal to or less than default point in target learning outcome
Class value (for example, 0.28), wherein the keyword in target keyword and target learning outcome is a classification.Finding mesh
After mark keyword, so that it may which the target keyword found is formed target critical phrase.It is thus that similarity distance is small,
That is the high keyword of similarity is classified as one kind, is stored as a characteristic information.
In another optional embodiment of the invention, any two in initial key phrase are calculated according to the frequency of occurrences
Similarity distance between keyword includes step S1041:
Step S1041 passes through Google distance calculation formulaIt calculates
Similarity distance in initial key phrase between any two keyword obtains target range matrix, wherein fC1It indicates initial to close
The frequency of occurrences of the keyword C1 in each Malware, fC2Indicate appearance frequency of the initial key word C2 in each Malware
Rate, f (C1, C2) indicate occur the frequency of initial key word C1 and initial key word C2 simultaneously in each Malware.
Specifically, there are many kinds of the modes for calculating the similarity distance between any two keyword, in the embodiment of the present invention
In, the similarity distance between keyword two-by-two is preferably calculated by Google distance algorithm.
The similarity distance between keyword two-by-two is calculated by Google distance algorithm to be described as follows: passing through formulaCalculate in initial key phrase between any two keyword it is similar away from
From, wherein among formula, fC1Indicate the quantity that initial key word C1 occurs in each Malware, fC2It indicates initial to close
The quantity that keyword C2 occurs in each Malware, f (C1, C2) have initial key word C1 while expression obtains in the search
The common results quantity returned with initial key word C2, N represent total number of searches, lg be 10 be bottom natural logrithm letter
Number, NGD (C1, C2) is the similarity distance between initial key word C1 and initial key word C2.
Due to the Google distance used in the embodiment of the present invention be the common frequency of occurrences of the vocabulary based on Web text come
It calculates, this is similar with the style of work for finding the higher Malware keyword of related information in APK.Therefore, it can incite somebody to action
Each individually APK file regards the set of a text as, certain one kind is had to the keyword of destruction using Google's distance
Set extracts among text.Android program design language when designing at the beginning, individual program statement,
Such as API Calls, it is arranged what the operations such as work permission were necessarily safe from harm.And a Malware causes damages to user,
Necessarily several normal sentences combine the destruction just caused to user's right.Feature provided in an embodiment of the present invention mentions
Taking method is exactly that the set of such keyword is extracted in qualitative Malware, constitutes the feature sample an of Malware
This, for being detected for other APK.
In another optional embodiment of the invention, default file includes functional configuration file, at least one
Before the source file and default file of Malware are scanned, method further includes following steps:
Step S1 carries out decompiling to each Malware by open source software, and the function of obtaining each Malware is matched
Set the source file of file and each Malware.
In embodiments of the present invention, the APK file renamed as .RAR for the Malware that will acquire first;Then to this
APK file is decompressed, and android functional configuration file AndroidManifest.xml is obtained.Specifically, to the APK file
Using open source software AXMLPrinter2.jar, carrying out following batch operation can be obtained above-mentioned functional configuration file: "
java-jar"+properties.getProperty("user.dir")+"\\AXMLPrinter2.jar"+path+"
AndroidManifest.xml > "+path+ " AndroidManifest.txt ", wherein the function of being obtained according to aforesaid operations
Configuration file is that can operate text formatting.
Next, in the APK file clip pack classes.dex of binary file containing source code of decompression, to source code binary system text
Part carries out decompiling using open source software dex2jar, to obtain jar file.It specifically, can be by calling following sentence "
cmd/c start"+properties.getProperty("user.dir")+"\\dex2jar-0.0.9.15\\"+"
Dex2jar "+" .bat "+" "+path+ " classes.dex " carries out decompiling.Not by jar file obtained by the above method
It is source code, therefore also needs to restore jar file using open source software jadnt158, finally, can be obtained the APK file
Java source code.
In another optional embodiment of the invention, determining that target critical phrase is the feature of each Malware
After information, method further includes following steps:
Step S2 obtains the multiple groups keyword classification result of non-malicious software;
Step S3 carries out the classification results of same type in each target keyword and multiple groups keyword classification result pair
Than obtaining comparing result;
Step S4 determines the authenticity of each target keyword according to comparing result.
Specifically, in embodiments of the present invention, it is retouched in addition to executing above-mentioned steps S102 to Malware into step S108
Except the scheme stated, above-mentioned steps can also be executed to non-malicious software (that is, normal software), obtain the key of non-malicious software
Word classification results.It, can be by target critical group and multiple groups keyword classification result after obtaining multiple groups keyword classification result
The classification results of middle same type compare, to determine the correctness of the target critical phrase.Under normal circumstances, non-malicious is soft
Similarity distance in part between any two keyword can be greater than certain value, for example, being greater than 0.28.
It equally include key in Malware for example, including keyword 1, keyword 2 and keyword 3 in non-malicious software
Word 1, keyword 2 and keyword 3.Under normal circumstances, in non-malicious software, the similarity distance between keyword and keyword 2 is answered
This is greater than the similarity distance in Malware between keyword and keyword 2.Therefore, in embodiments of the present invention, by by mesh
Mark crucial phrase is compared with multiple groups keyword classification result, can determine the authenticity of target keyword.
In another optional embodiment of the invention, determining that target critical phrase is the feature of each Malware
After information, method further includes following steps:
Step S5 obtains at least one software to be detected;
Step S6 analyzes at least one software to be detected by the characteristic information extracted, to determine at least one
Whether a software to be detected is Malware.
In embodiments of the present invention, after extracting the characteristic information of Malware of polymorphic type, so that it may pass through the spy
Whether reference breath detects software to be detected, be Malware with determination software to be detected.
It should be noted that in embodiments of the present invention, being carried out by characteristic information at least one software to be detected
After detection, testing result includes detecting successfully to fail with detection, wherein detects and successfully refers to determine the software to be detected
It is non-Malware, alternatively, being Malware;Detection, which unsuccessfully refers to, can not determine whether the software is Malware or non-evil
Meaning software.Under normal circumstances, if detection failure, show that the software to be detected is novel malicious software, alternatively, this is to be detected
Software is made a variation.It in the case, in embodiments of the present invention, can be according to for the sample database of feature-rich information
Description of the above-mentioned steps S102 into step S108 handles the software to be detected of detection failure, to obtain the novel malicious
The characteristic information of software or the Malware that makes a variation.
Explanation is needed further exist for, at least one above-mentioned software to be detected can be simultaneously the doubtful of multiple and different types
It can also simultaneously be the doubtful Malware of multiple same types like software.
After the sample database of abundant features described above information, in subsequent detection, it will be able to accurately detect novel
Malware or variation Malware.
It is shown in Fig. 2 to call probability comparison diagram for a kind of keyword type.We are it can be found that for stealing from Fig. 2
The API of user privacy information, for example, the INTERNET class API of connected network, reads the READ_CONTACTS of mobile phone contact
Class API and the READ_PHONE_STATE class API for reading mobile phone state are higher by than normal APK in the calling probability of Malware
1 times.However, there is some API Calls in normal software almost without still, frequency of use is very high in Malware, example
Such as, the INSTALL_PACKAGES class of new software is installed, and sends the SEND_SMS class of short message.Wherein, INSTALL_
PACKAGES can install other malice rogue softwares by backstage on the mobile phone of user, and SEND_SMS can then pass through hair
Breath of delivering letters expends user telephone fee and even remotely sends user information in the absence of a network.WRITE_EXTERNAL_STORAGE read-write
Mobile phone storage is also one of most common attack means of Malware, can be used to any change user data and steal letter
Breath.
It should be noted that due in general the two classifications of android.view and android.graphics,
The difference of Malware and normal software is simultaneously little, therefore, can be them in the characteristic set of oneself in comparison-of-pair sorting
Removal.In addition, calling probability of the thread Handler class API in Malware is also far longer than normal software, general malice is soft
Part all opens multiple threads simultaneously, invades user's right.
Fig. 3 is a kind of characteristic distance figure in the embodiment of the present invention, wherein characteristic distance is above-mentioned similarity distance.At this
In characteristic distance figure, the keyword feature set with attack mobile phone user network flow Malware is had chosen.In Fig. 3, number
The smaller similarity that represents of word is higher, and similarity distance is closer.
It can analyze out the substantially process of the malware attacks user of the type by characteristic distance shown in Fig. 3.
Specifically, user mobile phone system information and network are read by WRITE_SETTINGS and WRITE_APN_SETTINGS first
The setting of GPRS access point;Then, mobile phone state is modified by MODIFY_PHONE_STATE;Next, passing through CHANGE_
NETWORK_STATE modifies cell phone network state, and modifies mobile phone WIFI state by CHANGE_WIFI_STATE, to reach
Connect the purpose of network;Further, waste advertisements are pushed to user by BROADCAST_WAP_PUSH, or even passed through
PROCESS_OUTGOING_CALLS remotely makes a phone call.Similarly, in embodiments of the present invention, it extracts and dislikes through the above steps
Anticipate software characteristic information it is known that Android phone Malware primary challenge means and attack channel, this is right
The prevention and prediction of Malware have preferable help.
Fig. 4 is a kind of flow chart of optionally extracting method of feature according to an embodiment of the present invention, as shown in figure 4, should
Method includes the following steps:
Step S4011 obtains normal software sample;Step S4012 obtains Malware sample;Specifically, in the present invention
In embodiment, the sample size of the normal software got is at least one, and the sample size of the Malware got is same
At least one.
Step S4021 carries out decompiling to the APK of normal software sample;Step S4022, to the APK of Malware sample
Carry out decompiling;It specifically, can be in method described in S1 through the above steps to the APK and Malware of normal software sample
The APK of sample carries out decompiling, obtains the source code of normal software sample and the source of functional configuration file and Malware sample
Code and functional configuration file.
Step S4031 counts the initial key phrase 1 of normal software;Step S4032 counts the initial pass of Malware
Keyword group 2;It specifically, in embodiments of the present invention, can be by KMP matching algorithm respectively to the source of each normal software sample
Code and functional configuration file are scanned, and obtain initial key phrase 1, and can be by KMP matching algorithm respectively to each
The source code and functional configuration file of Malware sample are scanned, and obtain initial key phrase 2.
Step S4041 calculates the similarity distance 1 in initial key phrase 1 between any two keyword;Step S4042,
Calculate the similarity distance 2 closed in initial key phrase 2 between any two keyword;It specifically, in embodiments of the present invention, can be with
Calculate the similarity distance 1 in initial key phrase 1 between any two keyword by Google distance algorithm, obtain target away from
From matrix 1;And the similarity distance in initial key phrase 2 between any two keyword is calculated by Google distance algorithm
2, obtain target range matrix 2.
Step S4051 classifies to initial key phrase 1 according to similarity distance 1, obtains characteristic information 1;Step
S4052 classifies to initial key phrase 2 according to similarity distance 2, obtains characteristic information 2;Specifically, implement in the present invention
It, can be with S1061 through the above steps and step S1062, alternatively, being described in S1064 and step S1064 through the above steps in example
Scheme classify to initial key phrase 1 and initial key phrase 2, obtain target critical phrase 1 and target critical phrase
2, wherein target critical phrase 1 is characteristic information 1, and target critical phrase 2 is characteristic information 2.
Step S406 compares characteristic information 1 and characteristic information 2;It specifically, can be by characteristic information 1 and feature
Information 2 compares, to determine the authenticity of characteristic information 2.
Step S407 removes the invalid keyword in characteristic information 2;Specifically, in embodiments of the present invention, by feature
After information 1 and characteristic information 2 compare, keyword invalid in characteristic information 2 can be deleted according to comparing result.It needs
It is noted that when comparing characteristic information 1 and characteristic information 2, usually by the normal software of same type and malice
Software compares.
Step S408 puts into software to be detected;Specifically, after executing above-mentioned steps S407, so that it may by
To characteristic information 2 software to be detected is detected, to detect whether the software to be detected is Malware.
Step S409 judges to detect whether success to software to be detected, wherein if it is judged that success, then return
Step step S408 is executed, if it is judged that it is unsuccessful, then return to step S4032.Specifically, if to be detected
After software detection, testing result is referred to as success, it can determines that the software to be detected is normal software, or malice
Software detects next software to be detected at this point it is possible to return to step S408.If to software to be detected
Later, testing result is detection failure, i.e., cannot determine that the software to be detected is normal software or Malware, this
When, it can primarily determine that the software to be detected is novel malicious software, or be the Malware of variation, at this point it is possible to return
Receipt row step S4021 carries out feature extraction to the software to be detected again.And by target learning algorithm, such as prop up
Vector machine is held, this feature information is learnt, so that support vector machines records the variation Malware or novel malicious
The characteristic information of software.In turn, in subsequent detection, it will be able to accurately detect novel malicious software or variation malice
Software.
The embodiment of the present invention proposes a kind of extracting method of feature, specially a kind of Android malice based on similarity
Software features extracting method.This method calculates distinctive information in source code by using Google's distance, for example, API Calls, Android
Similarity between permission and Common Parameters;Then it is classified according to similarity;Next, can also same normal software
In keyword compare experiment, obtain the feature of Android Malware.It further, in embodiments of the present invention, can be with
Machine learning is carried out by set of the SVM vector machine to characteristic information, makes this method acquisition that can constantly accommodate new software disease
The function of malicious sample.
The embodiment of the invention also provides a kind of extraction element of feature, the extraction element of this feature is mainly used for executing sheet
The extracting method of feature provided by inventive embodiments above content below fills the extraction of feature provided in an embodiment of the present invention
It sets and does specific introduction.
Fig. 5 is a kind of schematic diagram of the extraction element of feature according to an embodiment of the present invention, as shown in figure 5, this feature
Extraction element mainly includes scanning element 51, computing unit 52, screening unit 53 and the first determination unit 54, in which:
Scanning element, be used for scanning element, for each Malware at least one Malware source file and
Default file is scanned, and obtains each initial pass in the initial key phrase and initial key phrase of each Malware
The frequency of occurrences of the keyword in each Malware.
Computing unit, for calculating the phase in initial key phrase between any two keyword according to the frequency of occurrences
Like distance, target range matrix is obtained.
Screening unit, for being screened according to the similarity distance recorded in target range matrix to initial key phrase,
Screening obtains target critical phrase, wherein the similarity distance in target critical phrase between any two keyword meets default
Classification value.
First determination unit, for determining that target critical phrase is the characteristic information of each Malware.
In embodiments of the present invention, after screening obtains above-mentioned target critical phrase, so that it may determine the target critical
Phrase is the characteristic information of such Malware.
In embodiments of the present invention, the source file of Malware and functional configuration file are scanned first, to obtain
The initial key phrase of the Malware of each type and wherein each initial key word going out in default file and source file
Existing frequency;Then, the similarity distance in initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains mesh
Subject distance matrix;Next, screening according to similarity distance to initial key phrase, the key met the requirements is screened
Word obtains target critical phrase;Finally, determining that target critical phrase is the characteristic information of such Malware.In the present invention
In embodiment, by calculating the similarity distance of keyword in Malware, traditional Malware can not only be determined, additionally it is possible to
Identify variation Malware and novel malicious software, compared with the existing technology in feature extracting method, reached comprehensive inspection
The purpose of Malware is surveyed, and then is alleviated in the prior art when carrying out killing to Malware, the incomplete technology of killing
Problem, to realize the comprehensive technical effect improved to malware detection.
Optionally, screening unit includes: first processing module, for using each keyword in initial key phrase as root
Node carries out first search operation to root node, obtains target search result, wherein any two are closed in target search result
Similarity distance between keyword is less than or equal to default classification value, and in target search result between any two keyword
It is interrelated;First searching module forms target critical phrase for searching target keyword in target search result, wherein
It is less than the keyword of default classification value in target search result comprising the similarity distance of at least one and target keyword, also,
Keyword in target keyword and target search result belongs to identical classification.
Optionally, screening unit further include: Second processing module, for passing through algorithm of support vector machine to initial key word
Group carries out study processing, obtains target learning outcome, wherein in target learning outcome between any two keyword it is similar away from
Classification value is preset from being less than or equal to, and interrelated between any two keyword in target learning outcome;Second looks into
Module is looked for, for searching target keyword in target learning outcome, forms target critical phrase, wherein in target learning outcome
It is less than the keyword of default classification value, also, target keyword and mesh comprising the similarity distance of at least one and target keyword
Keyword in mark learning outcome belongs to identical classification.
Optionally, computing unit includes: computing module, for passing through Google distance calculation formulaCalculate in initial key phrase between any two keyword it is similar away from
From obtaining target range matrix, wherein fC1Indicate the frequency of occurrences of the initial key word C1 in each Malware, fC2It indicates
The frequency of occurrences of the initial key word C2 in each Malware, f (C1, C2) indicate occur simultaneously in each Malware just
The frequency of beginning keyword C1 and initial key word C2.
Optionally, default file includes functional configuration file, device further include: decompiling unit, for at least one
Before the source file and default file of class Malware are scanned, compiled by the way that open source software is counter to each Malware
It translates, obtains the functional configuration file of each Malware and the source file of each Malware.
Optionally, device further include: first acquisition unit, for determining that target critical phrase is each Malware
After characteristic information, the multiple groups keyword classification result of non-malicious software is obtained;Comparison unit is used for each target keyword
It is compared with the classification results of same type in multiple groups keyword classification result, obtains comparing result;Second determination unit is used
In the authenticity for determining each target critical phrase according to comparing result.
Optionally, device further include: second acquisition unit, for determining that target critical phrase is each Malware
After characteristic information, at least one software to be detected is obtained;Analytical unit, for the characteristic information by extracting at least
One software to be detected is analyzed, to determine whether at least one software to be detected is Malware.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme.
Claims (10)
1. a kind of extracting method of feature characterized by comprising
The source file and default file of each Malware at least one Malware are scanned, obtained described each
Each initial key word is in each Malware in the initial key phrase of Malware and the initial key phrase
In the frequency of occurrences, wherein by KMP algorithm to java source code, class library files and AndroidManifest.xml file into
Row scanning, obtains the initial key phrase of Malware, and the frequency of occurrences in the Malware, Malware it is initial
Crucial phrase indicates the Malware function performed when executing program correspondingly;
The similarity distance in the initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains mesh
Subject distance matrix;
The initial key phrase is screened according to the similarity distance recorded in the target range matrix, is screened
To target critical phrase, wherein the similarity distance in the target critical phrase between any two keyword is less than or equal to
Default classification value;
Determine that the target critical phrase is the characteristic information of each Malware.
2. extracting method according to claim 1, which is characterized in that according to being recorded in the target range matrix
Similarity distance screens the initial key phrase, and screening obtains target critical phrase and includes:
Using each keyword in the initial key phrase as root node, first search operation is carried out to the root node, is obtained
To target search result, wherein the similarity distance in the target search result between any two keyword is less than or waits
It is interrelated between any two keyword in the default classification value, and in the target search result;
Target keyword is searched in the target search result, forms the target critical phrase, wherein the target search knot
It is less than the keyword of the default classification value in fruit with the similarity distance of the target keyword comprising at least one, and described
Keyword in target keyword and the target search result belongs to identical classification.
3. extracting method according to claim 1, which is characterized in that according to being recorded in the target range matrix
Similarity distance screens the initial key phrase, and screening obtains target critical phrase further include:
Study processing is carried out to the initial key phrase by algorithm of support vector machine, obtains target learning outcome, wherein institute
It states the similarity distance in target learning outcome between any two keyword and is less than or equal to the default classification value, and institute
It states interrelated between any two keyword in target learning outcome;
Target keyword is searched in the target learning outcome, forms the target critical phrase, wherein the target study knot
It is less than the keyword of the default classification value, also, institute in fruit with the similarity distance of the target keyword comprising at least one
The keyword stated in target keyword and the target learning outcome belongs to identical classification.
4. extracting method according to any one of claim 1 to 3, which is characterized in that calculated according to the frequency of occurrences
Similarity distance in initial key phrase between any two keyword includes:
Pass through Google distance calculation formulaCalculate the initial key word
Similarity distance in group between any two keyword obtains the target range matrix, wherein fC1Indicate initial key word C1
The frequency of occurrences in each Malware, fC2Indicate appearance frequency of the initial key word C2 in each Malware
Rate, f (C1, C2) indicate occur the initial key word C1 and the initial key word C2 simultaneously in each Malware
Frequency.
5. extracting method according to claim 1, which is characterized in that the default file includes functional configuration file,
Before source file and default file at least one Malware are scanned, the method also includes:
Decompiling is carried out to each Malware by open source software, obtains the functional configuration text of each Malware
The source file of part and each Malware.
6. extracting method according to claim 1, which is characterized in that determining that the target critical phrase is described each
After the characteristic information of Malware, the method also includes:
Obtain the multiple groups keyword classification result of non-malicious software;
The classification results of each target keyword and same type in the multiple groups keyword classification result are compared,
Obtain comparing result;
The authenticity of each target critical phrase is determined according to the comparing result.
7. extracting method according to claim 1, which is characterized in that determining that the target critical phrase is described each
After the characteristic information of Malware, the method also includes:
Obtain at least one software to be detected;
At least one described software to be detected is analyzed by the characteristic information extracted, to determine described at least one
Whether a software to be detected is Malware.
8. a kind of extraction element of feature characterized by comprising
Scanning element, for each Malware at least one Malware source file and default file be scanned,
Each initial key word is obtained in the initial key phrase and the initial key phrase of each Malware described
The frequency of occurrences in each Malware, wherein by KMP algorithm to java source code, class library files and
AndroidManifest.xml file is scanned, and obtains the initial key phrase of Malware, and in the Malware
The frequency of occurrences, the initial key phrase of Malware indicates the Malware function performed when executing program correspondingly
Energy;
Computing unit, for calculating the phase in the initial key phrase between any two keyword according to the frequency of occurrences
Like distance, target range matrix is obtained;
Screening unit, for according to the similarity distance recorded in the target range matrix to the initial key phrase into
Row screening, screening obtain target critical phrase, wherein in the target critical phrase between any two keyword it is similar away from
Classification value is preset from being less than or equal to;
First determination unit, for determining that the target critical phrase is the characteristic information of each Malware.
9. extraction element according to claim 8, which is characterized in that the screening unit includes:
First processing module, for using each keyword in the initial key phrase as root node, to the root node into
Row major search operation, obtains target search result, wherein the phase in the target search result between any two keyword
It is less than or equal to the default classification value like distance, and mutual between any two keyword in the target search result
Association;
First searching module, for forming the target critical phrase in target search result lookup target keyword,
In, it is less than the default classification value comprising the similarity distance of at least one and the target keyword in the target search result
Keyword, also, the keyword in the target keyword and the target search result belongs to identical classification.
10. extraction element according to claim 8, which is characterized in that the screening unit further include:
Second processing module obtains mesh for carrying out study processing to the initial key phrase by algorithm of support vector machine
Mark learning outcome, wherein the similarity distance in the target learning outcome between any two keyword is less than or equal to institute
Default classification value is stated, and interrelated between any two keyword in the target learning outcome;
Second searching module, for forming the target critical phrase in target learning outcome lookup target keyword,
In, it is less than the default classification value comprising the similarity distance of at least one and the target keyword in the target learning outcome
Keyword, also, the keyword in the target keyword and the target learning outcome belongs to identical classification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611041703.4A CN106503559B (en) | 2016-11-23 | 2016-11-23 | The extracting method and device of feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611041703.4A CN106503559B (en) | 2016-11-23 | 2016-11-23 | The extracting method and device of feature |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106503559A CN106503559A (en) | 2017-03-15 |
CN106503559B true CN106503559B (en) | 2019-03-19 |
Family
ID=58327928
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611041703.4A Active CN106503559B (en) | 2016-11-23 | 2016-11-23 | The extracting method and device of feature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106503559B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110414232B (en) * | 2019-06-26 | 2023-07-25 | 腾讯科技(深圳)有限公司 | Malicious program early warning method and device, computer equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473346A (en) * | 2013-09-24 | 2013-12-25 | 北京大学 | Android re-packed application detection method based on application programming interface |
CN105893485A (en) * | 2016-03-29 | 2016-08-24 | 浙江大学 | Automatic special subject generating method based on book catalogue |
CN106096405A (en) * | 2016-04-26 | 2016-11-09 | 浙江工业大学 | A kind of Android malicious code detecting method abstract based on Dalvik instruction |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10783254B2 (en) * | 2014-10-02 | 2020-09-22 | Massachusetts Institute Of Technology | Systems and methods for risk rating framework for mobile applications |
-
2016
- 2016-11-23 CN CN201611041703.4A patent/CN106503559B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473346A (en) * | 2013-09-24 | 2013-12-25 | 北京大学 | Android re-packed application detection method based on application programming interface |
CN105893485A (en) * | 2016-03-29 | 2016-08-24 | 浙江大学 | Automatic special subject generating method based on book catalogue |
CN106096405A (en) * | 2016-04-26 | 2016-11-09 | 浙江工业大学 | A kind of Android malicious code detecting method abstract based on Dalvik instruction |
Non-Patent Citations (4)
Title |
---|
"Android : Static Analysis Using Similarity Distance";Anthony Desnos;《2012 45th Hawaii International Conference on System Sciences》;20121231;第5394-5403页 |
"Android平台恶意软件检测系统设计与实现";杜洪波等;《软件导刊》;20151231;第14卷(第21期);第104-106页 |
"Exploring Permission-Induced Risk in Android Applications for Malicious Application Detection";Wei Wang et al.;《IEEE Transaction on Information Forensics and Security》;20141130;第9卷(第11期);第1869-1882页 |
"Sensitivity Analysis of Static Features for Android Malware Detection";Samaneh Hosseini Moghaddam et al.;《The 22nd Iranian Conference on Electrical Engineering》;20140522;第920-924页 |
Also Published As
Publication number | Publication date |
---|---|
CN106503559A (en) | 2017-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Han et al. | MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics | |
Tian et al. | An automated classification system based on the strings of trojan and virus families | |
Smutz et al. | Malicious PDF detection using metadata and structural features | |
Choi et al. | Efficient malicious code detection using N-gram analysis and SVM | |
Ye et al. | Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list | |
CN109684840A (en) | Based on the sensitive Android malware detection method for calling path | |
Dai et al. | Efficient Virus Detection Using Dynamic Instruction Sequences. | |
Sun et al. | Malware family classification method based on static feature extraction | |
CN109271788B (en) | Android malicious software detection method based on deep learning | |
CN101924761A (en) | Method for detecting malicious program according to white list | |
CN101751535A (en) | Data loss protection through application data access classification | |
CN102446255B (en) | Method and device for detecting page tamper | |
CN111382439A (en) | Malicious software detection method based on multi-mode deep learning | |
Saccente et al. | Project achilles: A prototype tool for static method-level vulnerability detection of Java source code using a recurrent neural network | |
CN109614795B (en) | Event-aware android malicious software detection method | |
Alsulami et al. | Lightweight behavioral malware detection for windows platforms | |
CN108090360A (en) | The Android malicious application sorting technique and system of a kind of Behavior-based control feature | |
CN107239694A (en) | A kind of Android application permissions inference method and device based on user comment | |
CN109800569A (en) | Program identification method and device | |
KR102516454B1 (en) | Method and apparatus for generating summary of url for url clustering | |
CN112817877B (en) | Abnormal script detection method and device, computer equipment and storage medium | |
Ye et al. | Intelligent file scoring system for malware detection from the gray list | |
CN106503559B (en) | The extracting method and device of feature | |
CN104036189A (en) | Page distortion detecting method and black link database generating method | |
CN113704759B (en) | Adaboost-based android malicious software detection method and system and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |