CN106503559B - The extracting method and device of feature - Google Patents

The extracting method and device of feature Download PDF

Info

Publication number
CN106503559B
CN106503559B CN201611041703.4A CN201611041703A CN106503559B CN 106503559 B CN106503559 B CN 106503559B CN 201611041703 A CN201611041703 A CN 201611041703A CN 106503559 B CN106503559 B CN 106503559B
Authority
CN
China
Prior art keywords
target
keyword
malware
phrase
initial key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611041703.4A
Other languages
Chinese (zh)
Other versions
CN106503559A (en
Inventor
孙军梅
杨春雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Normal University
Original Assignee
Hangzhou Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Normal University filed Critical Hangzhou Normal University
Priority to CN201611041703.4A priority Critical patent/CN106503559B/en
Publication of CN106503559A publication Critical patent/CN106503559A/en
Application granted granted Critical
Publication of CN106503559B publication Critical patent/CN106503559B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of extracting method of feature and devices, it is related to the technical field of feature extraction, it include: to be scanned to the source file and default file of each Malware at least one Malware, obtain the frequency of occurrences of each initial key word in each Malware in the initial key phrase and initial key phrase of each Malware;The similarity distance in initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains target range matrix;Initial key phrase is screened according to the similarity distance recorded in target range matrix, screening obtains target critical phrase, wherein the similarity distance in target critical phrase between any two keyword meets default classification value;It determines that target critical phrase is the characteristic information of each Malware, solves in the prior art when carrying out killing to Malware, the incomplete technical problem of killing.

Description

The extracting method and device of feature
Technical field
The technical field of computer of the present invention, more particularly, to the extracting method and device of a kind of feature.
Background technique
With the arrival of Internet era, the popularity rate of smart phone in the world is also higher and higher, and smart phone is System mainly includes Android and IOS system, wherein the performance that Android intelligent operating system then relies on its excellent obtains Obtained the huge market share.Market can have been also appeared in the development of smart phone, more and more mobile phone Malwares In the middle, the information security of user is endangered.Major security laboratory also gradually using mobile phone safeguard protection as primary study, but how The mutation of effective killing novel malicious software and Malware is always a problem.
In the prior art, in killing tradition Malware, Malware is mainly determined by condition code extracting method, into And killing is carried out to the Malware, wherein features described above code extracting method is mainly based upon program binary text.It is based on The feature extracting method of program binary text can only carry out killing to traditional Malware, it is soft can not to detect novel malicious Part and variation Malware.
Summary of the invention
The purpose of the present invention is to provide a kind of extracting method of feature and devices, to alleviate in the prior art to malice When software carries out killing, the incomplete technical problem of killing.
According to an aspect of an embodiment of the present invention, a kind of extracting method of feature is provided, comprising: at least one evil The source file and default file of each Malware in meaning software are scanned, and obtain the initial pass of each Malware The frequency of occurrences of each initial key word in each Malware in keyword group and the initial key phrase;Root The similarity distance in the initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains target range Matrix;The initial key phrase is screened according to the similarity distance recorded in the target range matrix, is screened Obtain target critical phrase, wherein the similarity distance in the target critical phrase between any two keyword meets default Classification value;Determine that the target critical phrase is the characteristic information of each Malware.
Further, according to the similarity distance recorded in the target range matrix to the initial key phrase into Row screening, it includes: using each keyword in the initial key phrase as root node, to institute that screening, which obtains target critical phrase, It states root node and carries out first search operation, obtain target search result, wherein any two are crucial in the target search result Similarity distance between word is less than or equal to the default classification value, and any two are crucial in the target search result It is interrelated between word;Target keyword is searched in the target search result, forms the target critical phrase, wherein institute State the pass for being less than the default classification value in target search result with the similarity distance of the target keyword comprising at least one Keyword, also, the keyword in the target keyword and the target search result belongs to identical classification.
Further, according to the similarity distance recorded in the target range matrix to the initial key phrase into Row screening, screening obtain target critical phrase further include: by algorithm of support vector machine to the initial key phrase Habit processing, obtains target learning outcome, wherein the similarity distance in the target learning outcome between any two keyword is small In or be equal to the default classification value, and it is interrelated between any two keyword in the target learning outcome;? The target learning outcome searches target keyword, forms the target critical phrase, wherein wrap in the target learning outcome It is less than the keyword of the default classification value, also, the target with the similarity distance of the target keyword containing at least one Keyword in keyword and the target learning outcome belongs to identical classification.
Further, according to the frequency of occurrences calculate in initial key phrase between any two keyword it is similar away from From including: by Google distance calculation formulaCalculate the initial pass Similarity distance in keyword group between any two keyword obtains the target range matrix, wherein fC1Indicate initial key The frequency of occurrences of the word C1 in each Malware, fC2Indicate initial key word C2 going out in each Malware Existing frequency, f (C1, C2) indicate occur the initial key word C1 and the initial key simultaneously in each Malware The frequency of word C2.
Further, the default file includes functional configuration file, at least one Malware source file and Before default file is scanned, the method also includes: decompiling is carried out to each Malware by open source software, Obtain the functional configuration file of each Malware and the source file of each Malware.
Further, described after determining the characteristic information that the target critical phrase is each Malware Method further include: obtain the multiple groups keyword classification result of non-malicious software;By each target keyword and the multiple groups The classification results of same type compare in keyword classification result, obtain comparing result;It is determined according to the comparing result The authenticity of each target critical phrase.
Further, described after determining the characteristic information that the target critical phrase is each Malware Method further include: obtain at least one software to be detected;By the characteristic information that extracts to it is described at least one wait for Inspection software is analyzed, to determine whether at least one described software to be detected is Malware.
According to an aspect of an embodiment of the present invention, a kind of extraction element of feature is provided, comprising: scanning is single Member, for each Malware at least one Malware source file and default file be scanned, obtain described every Each initial key word is soft in each malice in the initial key phrase of a Malware and the initial key phrase The frequency of occurrences in part;Computing unit is closed for calculating any two in the initial key phrase according to the frequency of occurrences Similarity distance between keyword obtains target range matrix;Screening unit, for according to recording in the target range matrix The similarity distance screens the initial key phrase, and screening obtains target critical phrase, wherein the target critical Similarity distance in phrase between any two keyword meets default classification value;First determination unit, for determining the mesh Mark the characteristic information that crucial phrase is each Malware.
Further, the screening unit includes: first processing module, for each of described initial key phrase Keyword is root node, carries out first search operation to the root node, obtains target search result, wherein the target is searched Similarity distance in hitch fruit between any two keyword is less than or equal to the default classification value, and the target is searched It is interrelated between any two keyword in hitch fruit;First searching module, for searching mesh in the target search result Keyword is marked, the target critical phrase is formed, wherein is closed comprising at least one with the target in the target search result The similarity distance of keyword is less than the keyword of the default classification value, also, the target keyword and the target search knot Keyword in fruit belongs to identical classification.
Further, the screening unit further include: Second processing module, for passing through algorithm of support vector machine to described Initial key phrase carries out study processing, obtains target learning outcome, wherein any two are crucial in the target learning outcome Similarity distance between word is less than or equal to the default classification value, and any two are crucial in the target learning outcome It is interrelated between word;Second searching module forms the target for searching target keyword in the target learning outcome Crucial phrase, wherein be less than institute comprising the similarity distance of at least one and the target keyword in the target learning outcome State the keyword of default classification value, also, the keyword in the target keyword and the target learning outcome belong to it is identical Classification.
In embodiments of the present invention, the source file of Malware and functional configuration file are scanned first, to obtain The initial key phrase of each Malware and the wherein frequency of occurrences of each initial key word in default file;Then, root The similarity distance in initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains target range matrix;It connects Get off, initial key phrase is screened according to similarity distance, screens the keyword met the requirements, obtain target critical Phrase;Finally, determining that target critical phrase is the characteristic information of such Malware.In embodiments of the present invention, pass through meter The similarity distance for calculating keyword in Malware, can not only determine traditional Malware, additionally it is possible to identify the malice of variation Software and novel malicious software, compared with the existing technology in feature extracting method, reached the mesh of complete detection Malware , and then alleviate in the prior art when carrying out killing to Malware, the incomplete technical problem of killing, to realize Improve the comprehensive technical effect to malware detection.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of the extracting method of feature according to an embodiment of the present invention;
Fig. 2 is that a kind of keyword type according to an embodiment of the present invention calls probability comparison diagram;
Fig. 3 is a kind of characteristic distance figure according to an embodiment of the present invention;
Fig. 4 is a kind of flow chart of optionally extracting method of feature according to an embodiment of the present invention;
Fig. 5 is a kind of schematic diagram of the extraction element of feature according to an embodiment of the present invention.
Specific embodiment
Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that term " center ", "upper", "lower", "left", "right", "vertical", The orientation or positional relationship of the instructions such as "horizontal", "inner", "outside" be based on the orientation or positional relationship shown in the drawings, merely to Convenient for description the present invention and simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation, It is constructed and operated in a specific orientation, therefore is not considered as limiting the invention.In addition, term " first ", " second ", " third " is used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance.
In the description of the present invention, it should be noted that unless otherwise clearly defined and limited, term " installation ", " phase Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can To be mechanical connection, it is also possible to be electrically connected;It can be directly connected, can also can be indirectly connected through an intermediary Connection inside two elements.For the ordinary skill in the art, above-mentioned term can be understood at this with concrete condition Concrete meaning in invention.
According to embodiments of the present invention, a kind of embodiment of the extracting method of feature is provided, it should be noted that in attached drawing Process the step of illustrating can execute in a computer system such as a set of computer executable instructions, although also, Logical order is shown in flow charts, but in some cases, can be executed with the sequence for being different from herein it is shown or The step of description.
Fig. 1 is a kind of flow chart of the extracting method of feature according to an embodiment of the present invention, as shown in Figure 1, this method packet Include following steps:
Step S102 sweeps the source file and default file of each Malware at least one Malware It retouches, obtains in the initial key phrase and initial key phrase of each Malware each initial key word in each malice The frequency of occurrences in software.
In embodiments of the present invention, firstly, obtaining the java source code of Malware APK (AndroidPackage) (i.e. Above-mentioned source file) and default file, wherein default file include the self-contained class library files of java and AndroidManifest.xml file (i.e. functional configuration file);Source file may include the class text that technical staff writes manually Part.Then, by improved string matching algorithm to java source code and default file (for example, class library files and AndroidManifest.xml file) it is scanned.It specifically, can be by KMP algorithm to java source code, class library files It is scanned with AndroidManifest.xml file, obtains the initial key phrase of Malware, and in the Malware In the frequency of occurrences.Wherein, the initial key phrase of Malware indicates that the Malware is executing program when institute correspondingly The function of execution.
It should be noted that in embodiments of the present invention, when the Malware to multiple and different types is scanned, often A Malware obtains one group of initial key phrase.
Step S104 calculates the similarity distance in initial key phrase between any two keyword according to the frequency of occurrences, Obtain target range matrix.
In embodiments of the present invention, the location of each initial key word and appearance frequency in the position are obtained in scanning After rate, so that it may calculate the similarity distance in initial key phrase between any two keyword, calculate the initial key In phrase after the similarity distance of any two keyword, the matrix of a n × n, i.e. target range matrix will be obtained.Specifically Ground, if the similarity distance between two keywords is smaller, then it represents that the similarity between two keywords is higher.
Step S106 screens initial key phrase according to the similarity distance recorded in target range matrix, screening Obtain target critical phrase, wherein the similarity distance in target critical phrase between any two keyword meets default classification Value.
It in embodiments of the present invention, can be according to similar in target range matrix after obtaining target range matrix Distance screens initial key phrase, obtains target critical phrase.Wherein, any in target critical phrase after screening Similarity distance between two keywords meets default classification value, for example, meeting 0.28.That is, in target critical phrase Similarity distance between any two keyword is less than or equal to 0.28.
Specifically, due in java source code, class library files and the AndroidManifest.xml file to Malware Include keyword for executing malicious operation after being scanned, in obtained initial key phrase and does not execute malicious operation Keyword.Due to during feature extraction, needing to filter out the keyword for executing malicious operation in initial key word Come, forms target critical phrase.During screening, the benchmark of screening is above-mentioned default classification value, it is preferable that this default point Class value is chosen for 0.28.
Step S108 determines that target critical phrase is the characteristic information of each Malware.
In embodiments of the present invention, after screening obtains above-mentioned target critical phrase, so that it may determine the target critical Phrase is the characteristic information of the Malware.
For the Malware of multiple types, during carrying out feature extraction, above-mentioned steps S102 is all made of to step Description in rapid S108 extracts.
After above method progress feature extraction, so that it may according to the characteristic information extracted to the doubtful evil of detection Meaning software is judged, to judge whether the doubtful Malware is Malware.Tradition can not only be detected by this method Malware, for novel malicious software and variation Malware can accurately detect.
In embodiments of the present invention, the source file of Malware and functional configuration file are scanned first, to obtain The initial key phrase of the Malware of each type and wherein each initial key word the location of in default file and The frequency of occurrences in position;Then, it is calculated according to the frequency of occurrences similar between any two keyword in initial key phrase Distance obtains target range matrix;Next, being screened according to similarity distance to initial key phrase, screening is met It is required that keyword, obtain target critical phrase;Finally, determining that target critical phrase is the feature letter of such Malware Breath.In embodiments of the present invention, by calculating the similarity distance of keyword in Malware, traditional malice can not only be determined Software, additionally it is possible to identify the Malware and novel malicious software of variation, compared with the existing technology in feature extracting method, reach The purpose of complete detection Malware has been arrived, and then has been alleviated in the prior art when carrying out killing to Malware, killing is not Comprehensive technical problem, to realize the comprehensive technical effect improved to malware detection.
In an optional embodiment of the invention, according to the similarity distance recorded in target range matrix to keyword Group is screened, and screening obtains target critical phrase and includes the following steps:
Step S1061 carries out first search behaviour to root node using each keyword in initial key phrase as root node Make, obtains target search result, wherein the similarity distance in target search result between any two keyword is less than or waits It is interrelated between any two keyword in default classification value, and in target search result;
Step S1062 searches target keyword in target search result, forms target critical phrase, wherein target search As a result it is less than the keyword of default classification value, also, target critical in comprising the similarity distance of at least one and target keyword Keyword in word and target search result belongs to identical classification.
When screening to initial key phrase, there are many kinds of the methods of screening, in embodiments of the present invention, preferably makes Initial key phrase is screened with first search algorithm and algorithm of support vector machine, obtains target critical phrase, wherein excellent Selecting searching algorithm includes depth-priority-searching method, a variety of priority algorithms such as breadth first algorithm, in embodiments of the present invention, and it is unlimited Surely the type of the priority algorithm used.
Specifically, scheme described in above-mentioned steps S1061 and step S1062 is to pass through depth-priority-searching method to initial Crucial phrase is screened.Following step S1063 and step S1064 is to pass through support vector machines to carry out initial key phrase Screening.
When being screened by depth-priority-searching method to initial key phrase, using target obtained in step S104 away from Category filter is carried out to initial key word from the similarity distance in matrix.Specifically, using first search algorithm to step S104 In each keyword in obtained target range matrix be used as root node to carry out depth-first search, and each node Between similarity it is high keyword record together, obtain target search result.That is, by similarity distance in any two keyword The keyword record for meeting default classification value (for example, 0.28) together, obtains target search result.Obtaining target search knot After fruit, target keyword can also will be searched in target search result, guarantee all target keywords found in target There are other keywords and its similarity distance to be equal to or less than default classification value (for example, 0.28) in search result, wherein target Keyword in keyword and target search result is a classification.After finding target keyword, so that it may will search The target keyword composition target critical phrase arrived.Thus that similarity distance is small, i.e. the high keyword of similarity is classified as one Class is stored as a characteristic information.
In another optional embodiment of the invention, crucial phrase is screened according to target range matrix, is screened Obtaining target critical phrase further includes following steps:
Step S1063 carries out study processing to initial key phrase by algorithm of support vector machine, obtains target study knot Fruit, wherein the similarity distance in target learning outcome between any two keyword is less than or equal to default classification value, and It is interrelated between any two keyword in target learning outcome;
Step S1064 searches target keyword in target learning outcome, forms target critical phrase, wherein target study As a result it is less than the keyword of default classification value, also, target critical in comprising the similarity distance of at least one and target keyword Keyword in word and target learning outcome belongs to identical classification.
In embodiments of the present invention, in addition to depth-priority-searching method described in above-mentioned steps S1061 and step S1062 it Outside, initial key phrase can also be learnt by support vector machines.Initial key phrase is learnt by support vector machines Later, will classify two classes, and one kind is that similarity distance is greater than default classification value, and another kind of is similarity distance less than or equal to default Classification value.That is, after being learnt by support vector machines to initial key phrase, so that it may which similarity distance to be less than Or screened equal to the keyword of default classification value, constitute target learning outcome.
After obtaining target learning outcome, target keyword can also will be searched in target learning outcome, guarantee to search To all target keywords there is other keywords and its similarity distance to be equal to or less than default point in target learning outcome Class value (for example, 0.28), wherein the keyword in target keyword and target learning outcome is a classification.Finding mesh After mark keyword, so that it may which the target keyword found is formed target critical phrase.It is thus that similarity distance is small, That is the high keyword of similarity is classified as one kind, is stored as a characteristic information.
In another optional embodiment of the invention, any two in initial key phrase are calculated according to the frequency of occurrences Similarity distance between keyword includes step S1041:
Step S1041 passes through Google distance calculation formulaIt calculates Similarity distance in initial key phrase between any two keyword obtains target range matrix, wherein fC1It indicates initial to close The frequency of occurrences of the keyword C1 in each Malware, fC2Indicate appearance frequency of the initial key word C2 in each Malware Rate, f (C1, C2) indicate occur the frequency of initial key word C1 and initial key word C2 simultaneously in each Malware.
Specifically, there are many kinds of the modes for calculating the similarity distance between any two keyword, in the embodiment of the present invention In, the similarity distance between keyword two-by-two is preferably calculated by Google distance algorithm.
The similarity distance between keyword two-by-two is calculated by Google distance algorithm to be described as follows: passing through formulaCalculate in initial key phrase between any two keyword it is similar away from From, wherein among formula, fC1Indicate the quantity that initial key word C1 occurs in each Malware, fC2It indicates initial to close The quantity that keyword C2 occurs in each Malware, f (C1, C2) have initial key word C1 while expression obtains in the search The common results quantity returned with initial key word C2, N represent total number of searches, lg be 10 be bottom natural logrithm letter Number, NGD (C1, C2) is the similarity distance between initial key word C1 and initial key word C2.
Due to the Google distance used in the embodiment of the present invention be the common frequency of occurrences of the vocabulary based on Web text come It calculates, this is similar with the style of work for finding the higher Malware keyword of related information in APK.Therefore, it can incite somebody to action Each individually APK file regards the set of a text as, certain one kind is had to the keyword of destruction using Google's distance Set extracts among text.Android program design language when designing at the beginning, individual program statement, Such as API Calls, it is arranged what the operations such as work permission were necessarily safe from harm.And a Malware causes damages to user, Necessarily several normal sentences combine the destruction just caused to user's right.Feature provided in an embodiment of the present invention mentions Taking method is exactly that the set of such keyword is extracted in qualitative Malware, constitutes the feature sample an of Malware This, for being detected for other APK.
In another optional embodiment of the invention, default file includes functional configuration file, at least one Before the source file and default file of Malware are scanned, method further includes following steps:
Step S1 carries out decompiling to each Malware by open source software, and the function of obtaining each Malware is matched Set the source file of file and each Malware.
In embodiments of the present invention, the APK file renamed as .RAR for the Malware that will acquire first;Then to this APK file is decompressed, and android functional configuration file AndroidManifest.xml is obtained.Specifically, to the APK file Using open source software AXMLPrinter2.jar, carrying out following batch operation can be obtained above-mentioned functional configuration file: " java-jar"+properties.getProperty("user.dir")+"\\AXMLPrinter2.jar"+path+" AndroidManifest.xml > "+path+ " AndroidManifest.txt ", wherein the function of being obtained according to aforesaid operations Configuration file is that can operate text formatting.
Next, in the APK file clip pack classes.dex of binary file containing source code of decompression, to source code binary system text Part carries out decompiling using open source software dex2jar, to obtain jar file.It specifically, can be by calling following sentence " cmd/c start"+properties.getProperty("user.dir")+"\\dex2jar-0.0.9.15\\"+" Dex2jar "+" .bat "+" "+path+ " classes.dex " carries out decompiling.Not by jar file obtained by the above method It is source code, therefore also needs to restore jar file using open source software jadnt158, finally, can be obtained the APK file Java source code.
In another optional embodiment of the invention, determining that target critical phrase is the feature of each Malware After information, method further includes following steps:
Step S2 obtains the multiple groups keyword classification result of non-malicious software;
Step S3 carries out the classification results of same type in each target keyword and multiple groups keyword classification result pair Than obtaining comparing result;
Step S4 determines the authenticity of each target keyword according to comparing result.
Specifically, in embodiments of the present invention, it is retouched in addition to executing above-mentioned steps S102 to Malware into step S108 Except the scheme stated, above-mentioned steps can also be executed to non-malicious software (that is, normal software), obtain the key of non-malicious software Word classification results.It, can be by target critical group and multiple groups keyword classification result after obtaining multiple groups keyword classification result The classification results of middle same type compare, to determine the correctness of the target critical phrase.Under normal circumstances, non-malicious is soft Similarity distance in part between any two keyword can be greater than certain value, for example, being greater than 0.28.
It equally include key in Malware for example, including keyword 1, keyword 2 and keyword 3 in non-malicious software Word 1, keyword 2 and keyword 3.Under normal circumstances, in non-malicious software, the similarity distance between keyword and keyword 2 is answered This is greater than the similarity distance in Malware between keyword and keyword 2.Therefore, in embodiments of the present invention, by by mesh Mark crucial phrase is compared with multiple groups keyword classification result, can determine the authenticity of target keyword.
In another optional embodiment of the invention, determining that target critical phrase is the feature of each Malware After information, method further includes following steps:
Step S5 obtains at least one software to be detected;
Step S6 analyzes at least one software to be detected by the characteristic information extracted, to determine at least one Whether a software to be detected is Malware.
In embodiments of the present invention, after extracting the characteristic information of Malware of polymorphic type, so that it may pass through the spy Whether reference breath detects software to be detected, be Malware with determination software to be detected.
It should be noted that in embodiments of the present invention, being carried out by characteristic information at least one software to be detected After detection, testing result includes detecting successfully to fail with detection, wherein detects and successfully refers to determine the software to be detected It is non-Malware, alternatively, being Malware;Detection, which unsuccessfully refers to, can not determine whether the software is Malware or non-evil Meaning software.Under normal circumstances, if detection failure, show that the software to be detected is novel malicious software, alternatively, this is to be detected Software is made a variation.It in the case, in embodiments of the present invention, can be according to for the sample database of feature-rich information Description of the above-mentioned steps S102 into step S108 handles the software to be detected of detection failure, to obtain the novel malicious The characteristic information of software or the Malware that makes a variation.
Explanation is needed further exist for, at least one above-mentioned software to be detected can be simultaneously the doubtful of multiple and different types It can also simultaneously be the doubtful Malware of multiple same types like software.
After the sample database of abundant features described above information, in subsequent detection, it will be able to accurately detect novel Malware or variation Malware.
It is shown in Fig. 2 to call probability comparison diagram for a kind of keyword type.We are it can be found that for stealing from Fig. 2 The API of user privacy information, for example, the INTERNET class API of connected network, reads the READ_CONTACTS of mobile phone contact Class API and the READ_PHONE_STATE class API for reading mobile phone state are higher by than normal APK in the calling probability of Malware 1 times.However, there is some API Calls in normal software almost without still, frequency of use is very high in Malware, example Such as, the INSTALL_PACKAGES class of new software is installed, and sends the SEND_SMS class of short message.Wherein, INSTALL_ PACKAGES can install other malice rogue softwares by backstage on the mobile phone of user, and SEND_SMS can then pass through hair Breath of delivering letters expends user telephone fee and even remotely sends user information in the absence of a network.WRITE_EXTERNAL_STORAGE read-write Mobile phone storage is also one of most common attack means of Malware, can be used to any change user data and steal letter Breath.
It should be noted that due in general the two classifications of android.view and android.graphics, The difference of Malware and normal software is simultaneously little, therefore, can be them in the characteristic set of oneself in comparison-of-pair sorting Removal.In addition, calling probability of the thread Handler class API in Malware is also far longer than normal software, general malice is soft Part all opens multiple threads simultaneously, invades user's right.
Fig. 3 is a kind of characteristic distance figure in the embodiment of the present invention, wherein characteristic distance is above-mentioned similarity distance.At this In characteristic distance figure, the keyword feature set with attack mobile phone user network flow Malware is had chosen.In Fig. 3, number The smaller similarity that represents of word is higher, and similarity distance is closer.
It can analyze out the substantially process of the malware attacks user of the type by characteristic distance shown in Fig. 3. Specifically, user mobile phone system information and network are read by WRITE_SETTINGS and WRITE_APN_SETTINGS first The setting of GPRS access point;Then, mobile phone state is modified by MODIFY_PHONE_STATE;Next, passing through CHANGE_ NETWORK_STATE modifies cell phone network state, and modifies mobile phone WIFI state by CHANGE_WIFI_STATE, to reach Connect the purpose of network;Further, waste advertisements are pushed to user by BROADCAST_WAP_PUSH, or even passed through PROCESS_OUTGOING_CALLS remotely makes a phone call.Similarly, in embodiments of the present invention, it extracts and dislikes through the above steps Anticipate software characteristic information it is known that Android phone Malware primary challenge means and attack channel, this is right The prevention and prediction of Malware have preferable help.
Fig. 4 is a kind of flow chart of optionally extracting method of feature according to an embodiment of the present invention, as shown in figure 4, should Method includes the following steps:
Step S4011 obtains normal software sample;Step S4012 obtains Malware sample;Specifically, in the present invention In embodiment, the sample size of the normal software got is at least one, and the sample size of the Malware got is same At least one.
Step S4021 carries out decompiling to the APK of normal software sample;Step S4022, to the APK of Malware sample Carry out decompiling;It specifically, can be in method described in S1 through the above steps to the APK and Malware of normal software sample The APK of sample carries out decompiling, obtains the source code of normal software sample and the source of functional configuration file and Malware sample Code and functional configuration file.
Step S4031 counts the initial key phrase 1 of normal software;Step S4032 counts the initial pass of Malware Keyword group 2;It specifically, in embodiments of the present invention, can be by KMP matching algorithm respectively to the source of each normal software sample Code and functional configuration file are scanned, and obtain initial key phrase 1, and can be by KMP matching algorithm respectively to each The source code and functional configuration file of Malware sample are scanned, and obtain initial key phrase 2.
Step S4041 calculates the similarity distance 1 in initial key phrase 1 between any two keyword;Step S4042, Calculate the similarity distance 2 closed in initial key phrase 2 between any two keyword;It specifically, in embodiments of the present invention, can be with Calculate the similarity distance 1 in initial key phrase 1 between any two keyword by Google distance algorithm, obtain target away from From matrix 1;And the similarity distance in initial key phrase 2 between any two keyword is calculated by Google distance algorithm 2, obtain target range matrix 2.
Step S4051 classifies to initial key phrase 1 according to similarity distance 1, obtains characteristic information 1;Step S4052 classifies to initial key phrase 2 according to similarity distance 2, obtains characteristic information 2;Specifically, implement in the present invention It, can be with S1061 through the above steps and step S1062, alternatively, being described in S1064 and step S1064 through the above steps in example Scheme classify to initial key phrase 1 and initial key phrase 2, obtain target critical phrase 1 and target critical phrase 2, wherein target critical phrase 1 is characteristic information 1, and target critical phrase 2 is characteristic information 2.
Step S406 compares characteristic information 1 and characteristic information 2;It specifically, can be by characteristic information 1 and feature Information 2 compares, to determine the authenticity of characteristic information 2.
Step S407 removes the invalid keyword in characteristic information 2;Specifically, in embodiments of the present invention, by feature After information 1 and characteristic information 2 compare, keyword invalid in characteristic information 2 can be deleted according to comparing result.It needs It is noted that when comparing characteristic information 1 and characteristic information 2, usually by the normal software of same type and malice Software compares.
Step S408 puts into software to be detected;Specifically, after executing above-mentioned steps S407, so that it may by To characteristic information 2 software to be detected is detected, to detect whether the software to be detected is Malware.
Step S409 judges to detect whether success to software to be detected, wherein if it is judged that success, then return Step step S408 is executed, if it is judged that it is unsuccessful, then return to step S4032.Specifically, if to be detected After software detection, testing result is referred to as success, it can determines that the software to be detected is normal software, or malice Software detects next software to be detected at this point it is possible to return to step S408.If to software to be detected Later, testing result is detection failure, i.e., cannot determine that the software to be detected is normal software or Malware, this When, it can primarily determine that the software to be detected is novel malicious software, or be the Malware of variation, at this point it is possible to return Receipt row step S4021 carries out feature extraction to the software to be detected again.And by target learning algorithm, such as prop up Vector machine is held, this feature information is learnt, so that support vector machines records the variation Malware or novel malicious The characteristic information of software.In turn, in subsequent detection, it will be able to accurately detect novel malicious software or variation malice Software.
The embodiment of the present invention proposes a kind of extracting method of feature, specially a kind of Android malice based on similarity Software features extracting method.This method calculates distinctive information in source code by using Google's distance, for example, API Calls, Android Similarity between permission and Common Parameters;Then it is classified according to similarity;Next, can also same normal software In keyword compare experiment, obtain the feature of Android Malware.It further, in embodiments of the present invention, can be with Machine learning is carried out by set of the SVM vector machine to characteristic information, makes this method acquisition that can constantly accommodate new software disease The function of malicious sample.
The embodiment of the invention also provides a kind of extraction element of feature, the extraction element of this feature is mainly used for executing sheet The extracting method of feature provided by inventive embodiments above content below fills the extraction of feature provided in an embodiment of the present invention It sets and does specific introduction.
Fig. 5 is a kind of schematic diagram of the extraction element of feature according to an embodiment of the present invention, as shown in figure 5, this feature Extraction element mainly includes scanning element 51, computing unit 52, screening unit 53 and the first determination unit 54, in which:
Scanning element, be used for scanning element, for each Malware at least one Malware source file and Default file is scanned, and obtains each initial pass in the initial key phrase and initial key phrase of each Malware The frequency of occurrences of the keyword in each Malware.
Computing unit, for calculating the phase in initial key phrase between any two keyword according to the frequency of occurrences Like distance, target range matrix is obtained.
Screening unit, for being screened according to the similarity distance recorded in target range matrix to initial key phrase, Screening obtains target critical phrase, wherein the similarity distance in target critical phrase between any two keyword meets default Classification value.
First determination unit, for determining that target critical phrase is the characteristic information of each Malware.
In embodiments of the present invention, after screening obtains above-mentioned target critical phrase, so that it may determine the target critical Phrase is the characteristic information of such Malware.
In embodiments of the present invention, the source file of Malware and functional configuration file are scanned first, to obtain The initial key phrase of the Malware of each type and wherein each initial key word going out in default file and source file Existing frequency;Then, the similarity distance in initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains mesh Subject distance matrix;Next, screening according to similarity distance to initial key phrase, the key met the requirements is screened Word obtains target critical phrase;Finally, determining that target critical phrase is the characteristic information of such Malware.In the present invention In embodiment, by calculating the similarity distance of keyword in Malware, traditional Malware can not only be determined, additionally it is possible to Identify variation Malware and novel malicious software, compared with the existing technology in feature extracting method, reached comprehensive inspection The purpose of Malware is surveyed, and then is alleviated in the prior art when carrying out killing to Malware, the incomplete technology of killing Problem, to realize the comprehensive technical effect improved to malware detection.
Optionally, screening unit includes: first processing module, for using each keyword in initial key phrase as root Node carries out first search operation to root node, obtains target search result, wherein any two are closed in target search result Similarity distance between keyword is less than or equal to default classification value, and in target search result between any two keyword It is interrelated;First searching module forms target critical phrase for searching target keyword in target search result, wherein It is less than the keyword of default classification value in target search result comprising the similarity distance of at least one and target keyword, also, Keyword in target keyword and target search result belongs to identical classification.
Optionally, screening unit further include: Second processing module, for passing through algorithm of support vector machine to initial key word Group carries out study processing, obtains target learning outcome, wherein in target learning outcome between any two keyword it is similar away from Classification value is preset from being less than or equal to, and interrelated between any two keyword in target learning outcome;Second looks into Module is looked for, for searching target keyword in target learning outcome, forms target critical phrase, wherein in target learning outcome It is less than the keyword of default classification value, also, target keyword and mesh comprising the similarity distance of at least one and target keyword Keyword in mark learning outcome belongs to identical classification.
Optionally, computing unit includes: computing module, for passing through Google distance calculation formulaCalculate in initial key phrase between any two keyword it is similar away from From obtaining target range matrix, wherein fC1Indicate the frequency of occurrences of the initial key word C1 in each Malware, fC2It indicates The frequency of occurrences of the initial key word C2 in each Malware, f (C1, C2) indicate occur simultaneously in each Malware just The frequency of beginning keyword C1 and initial key word C2.
Optionally, default file includes functional configuration file, device further include: decompiling unit, for at least one Before the source file and default file of class Malware are scanned, compiled by the way that open source software is counter to each Malware It translates, obtains the functional configuration file of each Malware and the source file of each Malware.
Optionally, device further include: first acquisition unit, for determining that target critical phrase is each Malware After characteristic information, the multiple groups keyword classification result of non-malicious software is obtained;Comparison unit is used for each target keyword It is compared with the classification results of same type in multiple groups keyword classification result, obtains comparing result;Second determination unit is used In the authenticity for determining each target critical phrase according to comparing result.
Optionally, device further include: second acquisition unit, for determining that target critical phrase is each Malware After characteristic information, at least one software to be detected is obtained;Analytical unit, for the characteristic information by extracting at least One software to be detected is analyzed, to determine whether at least one software to be detected is Malware.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (10)

1. a kind of extracting method of feature characterized by comprising
The source file and default file of each Malware at least one Malware are scanned, obtained described each Each initial key word is in each Malware in the initial key phrase of Malware and the initial key phrase In the frequency of occurrences, wherein by KMP algorithm to java source code, class library files and AndroidManifest.xml file into Row scanning, obtains the initial key phrase of Malware, and the frequency of occurrences in the Malware, Malware it is initial Crucial phrase indicates the Malware function performed when executing program correspondingly;
The similarity distance in the initial key phrase between any two keyword is calculated according to the frequency of occurrences, obtains mesh Subject distance matrix;
The initial key phrase is screened according to the similarity distance recorded in the target range matrix, is screened To target critical phrase, wherein the similarity distance in the target critical phrase between any two keyword is less than or equal to Default classification value;
Determine that the target critical phrase is the characteristic information of each Malware.
2. extracting method according to claim 1, which is characterized in that according to being recorded in the target range matrix Similarity distance screens the initial key phrase, and screening obtains target critical phrase and includes:
Using each keyword in the initial key phrase as root node, first search operation is carried out to the root node, is obtained To target search result, wherein the similarity distance in the target search result between any two keyword is less than or waits It is interrelated between any two keyword in the default classification value, and in the target search result;
Target keyword is searched in the target search result, forms the target critical phrase, wherein the target search knot It is less than the keyword of the default classification value in fruit with the similarity distance of the target keyword comprising at least one, and described Keyword in target keyword and the target search result belongs to identical classification.
3. extracting method according to claim 1, which is characterized in that according to being recorded in the target range matrix Similarity distance screens the initial key phrase, and screening obtains target critical phrase further include:
Study processing is carried out to the initial key phrase by algorithm of support vector machine, obtains target learning outcome, wherein institute It states the similarity distance in target learning outcome between any two keyword and is less than or equal to the default classification value, and institute It states interrelated between any two keyword in target learning outcome;
Target keyword is searched in the target learning outcome, forms the target critical phrase, wherein the target study knot It is less than the keyword of the default classification value, also, institute in fruit with the similarity distance of the target keyword comprising at least one The keyword stated in target keyword and the target learning outcome belongs to identical classification.
4. extracting method according to any one of claim 1 to 3, which is characterized in that calculated according to the frequency of occurrences Similarity distance in initial key phrase between any two keyword includes:
Pass through Google distance calculation formulaCalculate the initial key word Similarity distance in group between any two keyword obtains the target range matrix, wherein fC1Indicate initial key word C1 The frequency of occurrences in each Malware, fC2Indicate appearance frequency of the initial key word C2 in each Malware Rate, f (C1, C2) indicate occur the initial key word C1 and the initial key word C2 simultaneously in each Malware Frequency.
5. extracting method according to claim 1, which is characterized in that the default file includes functional configuration file, Before source file and default file at least one Malware are scanned, the method also includes:
Decompiling is carried out to each Malware by open source software, obtains the functional configuration text of each Malware The source file of part and each Malware.
6. extracting method according to claim 1, which is characterized in that determining that the target critical phrase is described each After the characteristic information of Malware, the method also includes:
Obtain the multiple groups keyword classification result of non-malicious software;
The classification results of each target keyword and same type in the multiple groups keyword classification result are compared, Obtain comparing result;
The authenticity of each target critical phrase is determined according to the comparing result.
7. extracting method according to claim 1, which is characterized in that determining that the target critical phrase is described each After the characteristic information of Malware, the method also includes:
Obtain at least one software to be detected;
At least one described software to be detected is analyzed by the characteristic information extracted, to determine described at least one Whether a software to be detected is Malware.
8. a kind of extraction element of feature characterized by comprising
Scanning element, for each Malware at least one Malware source file and default file be scanned, Each initial key word is obtained in the initial key phrase and the initial key phrase of each Malware described The frequency of occurrences in each Malware, wherein by KMP algorithm to java source code, class library files and AndroidManifest.xml file is scanned, and obtains the initial key phrase of Malware, and in the Malware The frequency of occurrences, the initial key phrase of Malware indicates the Malware function performed when executing program correspondingly Energy;
Computing unit, for calculating the phase in the initial key phrase between any two keyword according to the frequency of occurrences Like distance, target range matrix is obtained;
Screening unit, for according to the similarity distance recorded in the target range matrix to the initial key phrase into Row screening, screening obtain target critical phrase, wherein in the target critical phrase between any two keyword it is similar away from Classification value is preset from being less than or equal to;
First determination unit, for determining that the target critical phrase is the characteristic information of each Malware.
9. extraction element according to claim 8, which is characterized in that the screening unit includes:
First processing module, for using each keyword in the initial key phrase as root node, to the root node into Row major search operation, obtains target search result, wherein the phase in the target search result between any two keyword It is less than or equal to the default classification value like distance, and mutual between any two keyword in the target search result Association;
First searching module, for forming the target critical phrase in target search result lookup target keyword, In, it is less than the default classification value comprising the similarity distance of at least one and the target keyword in the target search result Keyword, also, the keyword in the target keyword and the target search result belongs to identical classification.
10. extraction element according to claim 8, which is characterized in that the screening unit further include:
Second processing module obtains mesh for carrying out study processing to the initial key phrase by algorithm of support vector machine Mark learning outcome, wherein the similarity distance in the target learning outcome between any two keyword is less than or equal to institute Default classification value is stated, and interrelated between any two keyword in the target learning outcome;
Second searching module, for forming the target critical phrase in target learning outcome lookup target keyword, In, it is less than the default classification value comprising the similarity distance of at least one and the target keyword in the target learning outcome Keyword, also, the keyword in the target keyword and the target learning outcome belongs to identical classification.
CN201611041703.4A 2016-11-23 2016-11-23 The extracting method and device of feature Active CN106503559B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611041703.4A CN106503559B (en) 2016-11-23 2016-11-23 The extracting method and device of feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611041703.4A CN106503559B (en) 2016-11-23 2016-11-23 The extracting method and device of feature

Publications (2)

Publication Number Publication Date
CN106503559A CN106503559A (en) 2017-03-15
CN106503559B true CN106503559B (en) 2019-03-19

Family

ID=58327928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611041703.4A Active CN106503559B (en) 2016-11-23 2016-11-23 The extracting method and device of feature

Country Status (1)

Country Link
CN (1) CN106503559B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414232B (en) * 2019-06-26 2023-07-25 腾讯科技(深圳)有限公司 Malicious program early warning method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473346A (en) * 2013-09-24 2013-12-25 北京大学 Android re-packed application detection method based on application programming interface
CN105893485A (en) * 2016-03-29 2016-08-24 浙江大学 Automatic special subject generating method based on book catalogue
CN106096405A (en) * 2016-04-26 2016-11-09 浙江工业大学 A kind of Android malicious code detecting method abstract based on Dalvik instruction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10783254B2 (en) * 2014-10-02 2020-09-22 Massachusetts Institute Of Technology Systems and methods for risk rating framework for mobile applications

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473346A (en) * 2013-09-24 2013-12-25 北京大学 Android re-packed application detection method based on application programming interface
CN105893485A (en) * 2016-03-29 2016-08-24 浙江大学 Automatic special subject generating method based on book catalogue
CN106096405A (en) * 2016-04-26 2016-11-09 浙江工业大学 A kind of Android malicious code detecting method abstract based on Dalvik instruction

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Android : Static Analysis Using Similarity Distance";Anthony Desnos;《2012 45th Hawaii International Conference on System Sciences》;20121231;第5394-5403页
"Android平台恶意软件检测系统设计与实现";杜洪波等;《软件导刊》;20151231;第14卷(第21期);第104-106页
"Exploring Permission-Induced Risk in Android Applications for Malicious Application Detection";Wei Wang et al.;《IEEE Transaction on Information Forensics and Security》;20141130;第9卷(第11期);第1869-1882页
"Sensitivity Analysis of Static Features for Android Malware Detection";Samaneh Hosseini Moghaddam et al.;《The 22nd Iranian Conference on Electrical Engineering》;20140522;第920-924页

Also Published As

Publication number Publication date
CN106503559A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
Han et al. MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics
Tian et al. An automated classification system based on the strings of trojan and virus families
Smutz et al. Malicious PDF detection using metadata and structural features
Choi et al. Efficient malicious code detection using N-gram analysis and SVM
Ye et al. Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list
CN109684840A (en) Based on the sensitive Android malware detection method for calling path
Dai et al. Efficient Virus Detection Using Dynamic Instruction Sequences.
Sun et al. Malware family classification method based on static feature extraction
CN109271788B (en) Android malicious software detection method based on deep learning
CN101924761A (en) Method for detecting malicious program according to white list
CN101751535A (en) Data loss protection through application data access classification
CN102446255B (en) Method and device for detecting page tamper
CN111382439A (en) Malicious software detection method based on multi-mode deep learning
Saccente et al. Project achilles: A prototype tool for static method-level vulnerability detection of Java source code using a recurrent neural network
CN109614795B (en) Event-aware android malicious software detection method
Alsulami et al. Lightweight behavioral malware detection for windows platforms
CN108090360A (en) The Android malicious application sorting technique and system of a kind of Behavior-based control feature
CN107239694A (en) A kind of Android application permissions inference method and device based on user comment
CN109800569A (en) Program identification method and device
KR102516454B1 (en) Method and apparatus for generating summary of url for url clustering
CN112817877B (en) Abnormal script detection method and device, computer equipment and storage medium
Ye et al. Intelligent file scoring system for malware detection from the gray list
CN106503559B (en) The extracting method and device of feature
CN104036189A (en) Page distortion detecting method and black link database generating method
CN113704759B (en) Adaboost-based android malicious software detection method and system and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant