CN112434291A - Application program identification method and device, equipment and storage medium - Google Patents

Application program identification method and device, equipment and storage medium Download PDF

Info

Publication number
CN112434291A
CN112434291A CN201910792044.5A CN201910792044A CN112434291A CN 112434291 A CN112434291 A CN 112434291A CN 201910792044 A CN201910792044 A CN 201910792044A CN 112434291 A CN112434291 A CN 112434291A
Authority
CN
China
Prior art keywords
feature
application program
application
features
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910792044.5A
Other languages
Chinese (zh)
Inventor
李爽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201910792044.5A priority Critical patent/CN112434291A/en
Publication of CN112434291A publication Critical patent/CN112434291A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Abstract

The embodiment of the application discloses an application program identification method, an application program identification device, equipment and a storage medium, wherein the method comprises the following steps: performing feature extraction on the acquired installation package of the application program to be identified to obtain an installation package feature set of the application program to be identified; obtaining scores of the features in the feature set of the installation package, and sequencing the scored features according to scoring results; performing feature selection according to the sequencing result to determine a target feature set; and classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification. According to the method and the device, the identification rate of the security attribute categories of the application program can be improved.

Description

Application program identification method and device, equipment and storage medium
Technical Field
The present application relates to computer technology, and more particularly, but not exclusively, to a method, an apparatus, a device, and a storage medium for identifying an application program.
Background
Malware detection of an Android operating system is classified according to whether an installation package file is run: static detection, dynamic detection and mixed detection of dynamic and static combination. .
Static detection does not require execution of Android software, and Java (a computer programming language) source code is restored or Dalvik (virtual machine for Android operating system) bytecode is obtained through reverse analysis. The method comprises the steps of analyzing Java source codes and Dalvik byte codes to obtain attributes such as an Application Programming Interface (API), system call and authority used by Android software, and then realizing Android malicious software detection by matching attribute features of malicious software or constructing a machine learning classifier. The Android software needs to be operated in dynamic detection, and by monitoring real-time state information collection characteristics such as data flow direction, method calling, electric quantity consumption and the like during the operation of the Android software, during detection, rules can be defined to match malicious behaviors, and malicious software can be identified by adopting a machine learning algorithm.
Key features can be extracted through dynamic and static analysis; the feature quantity can be reduced through a feature screening algorithm (such as an information gain algorithm, a Relieff algorithm and the like), and even the classification detection effect is improved; and the classification detection is realized through a classification detection algorithm (such as a support vector machine, a random forest algorithm, a deep learning algorithm and the like).
At present, the existing detection method has the technical problems of low self-detection precision and low detection rate.
Disclosure of Invention
In view of this, embodiments of the present application provide an application program identification method, an application program identification device, an application program identification apparatus, and a storage medium, so that an identification rate of security attribute categories of an application program can be improved.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides an application program identification method, which comprises the following steps:
performing feature extraction on the acquired installation package of the application program to be identified to obtain an installation package feature set of the application program to be identified;
obtaining scores of the features in the feature set of the installation package, and sequencing the scored features according to scoring results;
performing feature selection according to the sequencing result to determine a target feature set;
and classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
In the foregoing solution, the performing feature selection according to the sorting result to determine a target feature set includes:
selecting the features for multiple times according to the sorting result to form multiple feature subsets comprising different numbers of features;
and training a classifier by using the feature subset, and determining a target feature set according to a training result.
In the foregoing solution, the multiple feature selections according to the sorting result include:
and selecting the characteristics ranked in the front according to the descending order of the ranking result for multiple times, wherein each selection is started from the first position of the ranking result.
In the above scheme, the obtaining scores of the features in the feature set of the installation package and sorting the scored features according to the score results includes:
grading the features in the feature set of the installation package by using different feature grading methods;
and sorting the scores corresponding to each feature scoring method to obtain a corresponding sorting result.
In the foregoing solution, the multiple feature selections according to the sorting result include:
selecting the characteristics with the set number in the front of the sorting results for multiple times according to the descending order of the sorting results;
and taking intersection of the set number of features corresponding to each sequencing result.
In the above scheme, the training the classifier using the feature subset and determining the target feature set according to the training result includes:
the feature subsets correspond to the classifiers one by one, and the corresponding classifiers are trained by utilizing the feature subsets;
determining harmonic mean values of different feature subset training results;
and determining a target feature set according to the sizes of the harmonic mean values.
In the above scheme, the classifying the target feature set for multiple times and determining the security attribute category of the application to be identified according to the multiple classification results includes:
classifying the target feature set for multiple times respectively by utilizing a plurality of base classifiers in the trained integrated classifier;
and determining the security attribute category of the application program to be identified according to the number of different security attribute categories in the multi-time classification result.
In the above scheme, the classifying the target feature set for multiple times and determining the security attribute category of the application to be identified according to the multiple classification results includes:
determining an application category of the application program to be identified;
determining a corresponding integrated classifier according to the application category of the application program to be identified;
classifying the target feature set for multiple times respectively by utilizing a plurality of base classifiers in the integrated classifier;
and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
In the above solution, the type of the feature in the installation package feature set includes at least one of the following:
installing a package permission type; install package intent type; the type of installation package component; number of installation pack components.
In the above scheme, the method further comprises:
acquiring an application program with marked information as a training sample;
determining a plurality of training sample subsets, wherein the number of application programs with the same marking information in each training sample subset is the same;
the training feature subsets correspond to the base classifiers one to one, and the corresponding base classifiers are trained by utilizing the training feature subsets;
and forming a plurality of base classifiers into an integrated classifier.
In the above scheme, the application categories to which the application programs in the training samples belong are the same.
In the above scheme, the training of the corresponding base classifier using the training feature subset includes:
performing feature extraction on a training sample subset to determine an installation package feature set of an application program in the training sample subset;
determining the prediction information of the application program according to the installation package feature set;
comparing the marking information of the application program with the prediction information to obtain a loss function of the base classifier;
training the base classifier using the loss function.
In the above scheme, the determining the prediction information of the application program according to the installation package feature set includes:
obtaining scores of the features in the feature set of the installation package, and sequencing the scored features according to scoring results;
performing feature selection according to the sequencing result to determine a target feature set of the training sample subset;
and determining the prediction information of the application program in the training sample subset according to the target feature set of the training sample subset.
An embodiment of the present application provides an application program identification apparatus, the apparatus includes:
the device comprises a characteristic extraction unit, a characteristic extraction unit and a characteristic extraction unit, wherein the characteristic extraction unit is used for extracting the characteristics of the acquired installation package of the application program to be identified so as to obtain an installation package characteristic set of the application program to be identified;
the scoring unit is used for acquiring the scores of the features in the feature set of the installation package and sequencing the scored features according to the scoring results;
the characteristic selection unit is used for carrying out characteristic selection according to the sequencing result so as to determine a target characteristic set;
and the identification unit is used for classifying the target feature set for multiple times and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
The embodiment of the application provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the program to realize the application program identification method.
An embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the application program identification method as described above.
In the embodiment of the application, optimization is performed in three key stages of feature extraction, feature screening and classification detection, so that the problem of low detection rate of the untrusted application program can be solved, and the identification rate of the security attribute category of the application program is improved.
Drawings
Fig. 1A is a first schematic flow chart illustrating an implementation of an application program identification method according to an embodiment of the present application;
fig. 1B is a schematic view illustrating a flow chart of an implementation of the application program identification method according to the embodiment of the present application;
fig. 1C is a schematic flow chart illustrating an implementation of the application program identification method according to the embodiment of the present application;
fig. 1D is a schematic view illustrating an implementation flow of the application program identification method according to the embodiment of the present application;
FIG. 2 is a schematic diagram of a feature extraction model according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of an implementation of the feature screening method according to the embodiment of the present application;
FIG. 4 is a schematic diagram illustrating an implementation flow of a software detection method according to an embodiment of the present application;
FIG. 5 is a diagram illustrating a static inspection model architecture according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a structure of an application program recognition apparatus according to an embodiment of the present application;
fig. 7 is a hardware entity diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning by themselves. Thus, "module", "component" or "unit" may be used mixedly.
At present, dynamic and static detection research aiming at Android malicious software has made a certain progress, and meanwhile, more and more researchers improve the classification detection effect by using a feature screening algorithm and a machine learning algorithm, but the detection effect is still not ideal. The embodiment of the application aims at the following defects in the related art: (1) the features extracted in the current research are limited and research for application category information is lacking. (2) In the classification detection, the problem of unbalance of Android software samples is not considered, and the selected classification algorithm defaults to balance the samples, namely, the influence of small number of malicious samples on the classification detection result is not considered. (3) The feature screening algorithm has the problems of high time complexity or complex feature screening process and the like. An application program identification method, device, equipment and storage medium are provided.
The embodiment of the application provides an application program identification method, which is applied to computer equipment, the functions realized by the method can be realized by calling a program code through a processor in a server, of course, the program code can be stored in a computer storage medium, and the server at least comprises the processor and the storage medium. Fig. 1A is a first schematic flow chart illustrating an implementation process of an application program identification method according to an embodiment of the present application, as shown in fig. 1A, the method includes:
s101, extracting the characteristics of the obtained installation package of the application program to be identified to obtain an installation package characteristic set of the application program to be identified;
here, for the android application system, information such as permission, intention, and the number of components when an application is installed in the android application system can be extracted. Information such as authority, intention, number of components and the like of an application program in an android application system usually exists in an installation package of the application program, so that the feature extraction is divided into three stages, namely, decompiling the installation package of the application program, analyzing a corresponding Extensible Markup Language (XML) file and constructing a feature set. The method comprises the steps of obtaining a list file during installation through decompiling an installation package, extracting relevant information from the list file, forming different types of information into different types of features, and constructing a feature set according to the different types of features.
Of course, the application program is not limited to the type of the application system to which the application program belongs, and may be an android application system or other types of application systems.
In the embodiment of the present application, the computer device may be various types of devices with information processing capability, such as a mobile phone, a Personal Digital Assistant (PDA), a navigator, a Digital phone, a video phone, a smart watch, a smart bracelet, a wearable device, a tablet computer, an all-in-one machine, and the like. In the implementation process, the server may be an electronic device, such as a mobile phone, a tablet computer, or a notebook computer, or a fixed terminal, such as a personal computer or a server cluster, or other computing devices with information processing capabilities.
Step S102, obtaining scores of the features in the installation package feature set, and sorting the scored features according to scoring results;
s103, selecting features according to the sequencing result to determine a target feature set;
here, step S102 and step S103 mainly function to perform feature selection, which is a key step in data preprocessing, and aims to eliminate feature redundancy and screen features with a large correlation degree, and to improve the accuracy and efficiency of subsequent classification detection.
At present, there are many methods for scoring features in the related art, which can score each extracted feature respectively to select features, thereby reducing redundant features and improving the classification effect of the features.
And S104, classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
After the scores of the features in the feature set of the installation package are obtained, the scored features are ranked, feature selection is performed according to the ranking result, and a target feature set is determined. And finally, performing multiple classification by using the target feature set, and determining a classification result for each classification. And determining the security attribute category of the application program to be identified by combining the results of the multiple classification.
In the embodiment of the present application, the security attribute category of the application to be identified refers to whether the application to be identified belongs to a trusted application or an untrusted application (i.e. an abnormal application). The application program identification method in the embodiment of the application program can help a user to identify the security attribute category of the application program before the application program is installed, and when the identification result shows that the application program is a credible application program, the application program is continuously installed. And when the identification result shows that the application program is an untrusted application program, prompting the user that the application program is at risk.
In the embodiment of the application, the feature extraction is carried out on the obtained installation package of the application program to be identified so as to obtain an installation package feature set of the application program to be identified; obtaining scores of the features in the feature set of the installation package, and sequencing the scored features according to scoring results; performing feature selection according to the sequencing result to determine a target feature set; and classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification, so that the problem of low detection rate of the untrusted application program can be avoided, and the identification rate of the security attribute category of the application program is improved.
Based on the foregoing embodiments, an embodiment of the present application further provides an application program identification method, and fig. 1B is a schematic diagram of an implementation flow of the application program identification method according to the embodiment of the present application, as shown in fig. 1B, the method includes:
step S111, extracting characteristics of the obtained installation package of the application program to be identified to obtain an installation package characteristic set of the application program to be identified;
in the embodiment of the application, a static detection method is adopted to identify the security attribute category of the application program to be identified. The method comprises the steps of extracting information such as authority, intention and component number during installation of Android software by reversely analyzing an Android Package (APK) file to realize feature extraction of an application program to be identified so as to obtain an installation Package feature set of the application program to be identified. The installation package feature set comprises a plurality of features, and the attributes of each feature are different.
S112, obtaining scores of the features in the feature set of the installation package, and sorting the scored features according to scoring results;
here, the obtaining of the scores of the features in the feature set of the installation package may be obtaining the score of each feature in the feature set of the installation package, and sorting all the features from high to low according to the scores of the features.
Step S113, selecting features for multiple times according to the sorting result to form a plurality of feature subsets comprising different numbers of features;
in this embodiment of the application, the feature selection is performed multiple times according to the sorting result to form multiple feature subsets including different numbers of features, where feature selection is performed multiple times according to the sorting result, one feature subset is selected for each feature selection, and the number of features in each feature subset is different. All feature subsets may contain the same features or may contain different features.
For example, the extracted installation package feature set is [ f ]1,f2,f3]Obtaining the scores of the features in the feature set of the installation package, and sorting the scored features according to the score results, wherein the sorting result is [ f2,f1,f3]And performing three times of feature selection on the sequencing result: the first time, selecting a feature in the top order, and forming a feature subset as f2]. The second time selects two characteristics in the top order, and the subset of the characteristics is formed as f2,f1]. Thirdly, three characteristics which are ranked in the top are selected, and a characteristic subset is formed as f2,f1,f3]。
As another example, the extracted installation package feature set is [ f ]1,f2,f3,f4]Obtaining the scores of the features in the feature set of the installation package, and sorting the scored features according to the score results, wherein the sorting result is [ f2,f4,f1,f3]And performing two times of feature selection on the sequencing result: the first selected subset of features is f2]. The second selected subset of features is f4,f1,f3]。
Step S114, training a classifier by using the feature subset, and determining a target feature set according to a training result;
in this embodiment of the application, the training of the classifier by using the feature subsets and the determination of the target feature set according to the training result may be that each feature subset of the plurality of feature subsets is used to train a classifier, and then the feature subset corresponding to the classifier with the best classification effect in each classifier is determined as the target feature set.
And S115, classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
Here, the target feature set may be applied to a plurality of different classifiers, which classify the target feature set respectively, each classifier obtains a classification result, and the security attribute category of the application to be identified is determined by using a majority voting method according to a plurality of classification results of the plurality of classifiers. Or, the security attribute category of the application program to be identified may also be determined according to a plurality of classification results of a plurality of classifiers and according to the weight of each classifier.
In some embodiments, the step S114 of training a classifier by using the feature subset, and determining a target feature set according to a training result may be implemented by:
step S1141, the feature subsets correspond to classifiers one by one, and the corresponding classifiers are trained by utilizing the feature subsets;
in the embodiment of the application, for a feature subset, the feature data in the feature subset can be divided into a plurality of parts, one part of the feature data is taken as a test set, and the rest part of the feature data is taken as a training set; and then training the corresponding classifier by using the training set, and testing the trained classifier by using the test set. Wherein each share is rotated as a test set. The average of the test results for each test set may be used as the score for the classifier.
Step S1142, determining harmonic mean values of training results of different feature subsets;
here, a classifier may be trained using each feature subset, and then the harmonic mean of the test results of the classifiers is used as the score of the classifier, one for each classifier. And then, according to the scoring result, determining the feature subset corresponding to the classifier with the highest scoring result as a target feature set.
And S1143, determining a target feature set according to the sizes of the harmonic mean values.
In the embodiment of the application, the feature extraction is carried out on the obtained installation package of the application program to be identified so as to obtain an installation package feature set of the application program to be identified; obtaining scores of the features in the feature set of the installation package, and sequencing the scored features according to scoring results; selecting features for multiple times according to the sorting result to form multiple feature subsets comprising different numbers of features; training a classifier by using the feature subset, and determining a target feature set according to a training result; and classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification, so that the problem of low detection rate of the untrusted application program can be avoided, and the identification rate of the security attribute category of the application program is improved.
Based on the foregoing embodiments, an embodiment of the present application further provides an application program identification method, including:
step S121, extracting the characteristics of the obtained installation package of the application program to be identified to obtain an installation package characteristic set of the application program to be identified;
step S122, obtaining scores of the features in the installation package feature set, and sorting the scored features according to scoring results;
s123, selecting the characteristics ranked in the front according to the descending order of the ranking result for multiple times, wherein each selection starts from the head of the ranking result to form a plurality of characteristic subsets comprising different numbers of characteristics;
in the embodiment of the application, the scored features may be arranged in a descending order, that is, the higher the score, the more forward the ranking. And then, carrying out feature selection on the sorting result for multiple times, wherein the feature in the front sorting is selected each time, and each time the selection is started from the first feature. There are as many feature subsets as there are feature choices. Thus, the number of features in each feature subset is different, and there is an intersection between the features in each feature subset. For example, the result after sorting in descending order is [ f2,f1,f3]And performing three times of feature selection on the sequencing result: the first time, selecting a feature in the top order, and forming a feature subset as f2]. The second time selects two characteristics in the top order, and the subset of the characteristics is formed as f2,f1]. Thirdly, three characteristics which are ranked in the top are selected, and a characteristic subset is formed as f2,f1,f3]And performing three times of feature selection, wherein three feature subsets are selected in total, the number of features in each feature subset is different, and the feature intersection exists in each feature subset.
Step S124, training a classifier by using the feature subset, and determining a target feature set according to a training result;
and S125, classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
In some embodiments, the step S124 of training a classifier by using the feature subset, and determining a target feature set according to a training result may be implemented by:
step S1241, the feature subsets correspond to classifiers one to one, and the corresponding classifiers are trained by utilizing the feature subsets;
step S1242, determining harmonic mean values of training results of different feature subsets;
and S1243, determining a target feature set according to the sizes of the harmonic mean values.
In the embodiment of the application, the feature extraction is carried out on the obtained installation package of the application program to be identified so as to obtain an installation package feature set of the application program to be identified; obtaining scores of the features in the feature set of the installation package, and sequencing the scored features according to scoring results; selecting the characteristics ranked in the front according to the descending order of the ranking result for multiple times, wherein each selection starts from the first position of the ranking result to form a plurality of characteristic subsets comprising different numbers of characteristics; training a classifier by using the feature subset, and determining a target feature set according to a training result; and classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification, so that the problem of low detection rate of the untrusted application program can be avoided, and the identification rate of the security attribute category of the application program is improved.
Based on the foregoing embodiment, an embodiment of the present application further provides an application program identification method, and fig. 1C is a schematic flow chart illustrating an implementation of the application program identification method according to the embodiment of the present application, as shown in fig. 1C, the method includes:
step S131, extracting the characteristics of the obtained installation package of the application program to be identified to obtain an installation package characteristic set of the application program to be identified;
step S132, scoring the features in the feature set of the installation package by using different feature scoring methods;
at present, there are many feature scoring methods, and each feature scoring method has different parameter indexes such as scoring principle, scoring effect, scoring time consumption, and the like. For example, some feature scoring methods have high execution efficiency, and can quickly complete scoring on all features, but it cannot be guaranteed that a feature subset with the optimal classification effect is screened out, and a feature subset with a small scale cannot be determined. Some feature scoring methods can select a feature subset with a near-optimal classification effect, but as the feature scale increases, the time complexity of feature screening increases exponentially. Therefore, in the embodiment of the application, a plurality of different feature scoring methods are used for scoring the features in the feature set of the installation package.
Step S133, ranking the scores corresponding to each feature scoring method to obtain corresponding ranking results;
here, the scores corresponding to each feature scoring method are ranked to obtain corresponding feature ranking results, and the ranking results corresponding to each feature scoring method are independent of each other.
S134, selecting the characteristics of the set number sorted in the front for each sorting result for multiple times according to the descending order of the sorting results;
step S135, intersecting the set number of features corresponding to each sorting result to form a plurality of feature subsets including features of different numbers;
here, each sort result may be sorted in descending order, and feature selection may be performed a plurality of times, each time selecting a set number of features sorted earlier. And intersecting the set number of features corresponding to each sorting result to form a plurality of feature subsets comprising different numbers of features.
For example, two feature scoring methods are used to score the feature set of the installation package, and the descending result corresponding to the first scoring method is: [ f ] of1,f2,f3]The second scoring method corresponds to a descending result of [ f [ ]2,f1,f3]And performing three times of feature selection on the descending result corresponding to the first scoring method: the first time, a feature is selected that is ranked first, and a predetermined number of features are formed as [ f1]Selecting the two characteristics in the front order for the second time, wherein the characteristic forming the preset number is [ f ]1,f2]. Thirdly, three characteristics which are ranked in the top are selected, and the characteristics which form the preset number are [ f1,f2,f3]. Similarly, three times of feature selection are performed on the descending result corresponding to the second scoring method: the first time, a feature is selected that is ranked first, and a predetermined number of features are formed as [ f2]Selecting the two characteristics in the front order for the second time, wherein the characteristic forming the preset number is [ f ]2,f1]. Thirdly, three characteristics which are ranked in the top are selected, and the characteristics which form the preset number are [ f2,f1,f3]. Then, taking an intersection of the features with the same preset number corresponding to the two different scoring methods to obtain a plurality of feature subsets, wherein the feature subsets comprise: [ f ] of1,f2]And [ f1,f2,f3]。
S136, training a classifier by using the feature subset, and determining a target feature set according to a training result;
and S137, classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
In some embodiments, the step S136 of training a classifier by using the feature subset, and determining a target feature set according to a training result may be implemented by:
step S1361, the feature subsets correspond to classifiers one by one, and the corresponding classifiers are trained by utilizing the feature subsets;
step S1362, determining harmonic mean values of the training results of different feature subsets;
and S1363, determining a target feature set according to the sizes of the harmonic mean values.
In the embodiment of the application, the feature extraction is carried out on the obtained installation package of the application program to be identified so as to obtain an installation package feature set of the application program to be identified; grading the features in the feature set of the installation package by using different feature grading methods; ranking the scores corresponding to each feature scoring method to obtain a corresponding ranking result; selecting the characteristics with the set number in the front of the sorting results for multiple times according to the descending order of the sorting results; taking intersection of the set number of features corresponding to each sequencing result to form a plurality of feature subsets comprising features of different numbers; training a classifier by using the feature subset, and determining a target feature set according to a training result; and classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification, so that the problem of low detection rate of the untrusted application program can be avoided, and the identification rate of the security attribute category of the application program is improved.
Based on the foregoing embodiment, an embodiment of the present application further provides an application program identification method, and fig. 1D is a schematic diagram of an implementation flow of the application program identification method according to the embodiment of the present application, as shown in fig. 1D, the method includes:
step S141, extracting the characteristics of the obtained installation package of the application program to be identified to obtain an installation package characteristic set of the application program to be identified;
s142, obtaining scores of the features in the feature set of the installation package, and sorting the scored features according to scoring results;
s143, selecting features according to the sorting result to determine a target feature set;
s144, classifying the target feature set for multiple times respectively by utilizing a plurality of base classifiers in the trained integrated classifier;
and S145, determining the security attribute categories of the application program to be identified according to the number of different security attribute categories in the multi-time classification result.
For example, the trained integrated classifier includes five trained base classifiers, and each base classifier is used to classify the target features to obtain five classification results, where the classification results of the four base classifiers are all that the application program to be identified is an abnormal application program, and the classification result of one base classifier is that the application program to be identified is a trusted application program, and then the security attribute class of the application program to be identified is determined to be the abnormal application program through a majority voting method.
In the embodiment of the application, the feature extraction is carried out on the obtained installation package of the application program to be identified so as to obtain an installation package feature set of the application program to be identified; obtaining scores of the features in the feature set of the installation package, and sequencing the scored features according to scoring results; performing feature selection according to the sequencing result to determine a target feature set; classifying the target feature set for multiple times respectively by utilizing a plurality of base classifiers in the trained integrated classifier; and determining the security attribute categories of the application program to be identified according to the number of different security attribute categories in the multi-time classification result, so that the problem of low detection rate of the non-trusted application program can be avoided, and the identification rate of the security attribute categories of the application program is improved.
Based on the foregoing embodiments, an embodiment of the present application further provides an application program identification method, including:
step S151, extracting the characteristics of the obtained installation package of the application program to be identified to obtain an installation package characteristic set of the application program to be identified;
s152, obtaining scores of the features in the feature set of the installation package, and sorting the scored features according to scoring results;
s153, selecting features according to the sorting result to determine a target feature set;
step S154, determining the application category of the application program to be identified;
in the embodiment of the application, before classification detection, the application category of the application to be identified may be determined, an integrated classifier corresponding to the category may be determined according to the application category of the application to be identified, and the application may be identified by using the integrated classifiers of the same category.
Of course, the integrated classifiers of the same class refer to that the application classes of the application programs in the training samples and the test samples corresponding to the integrated classifiers are single application classes, and the single application class is the application class of the application program to be identified.
For example, if the application category of the application to be identified is entertainment category, the applications in the sample set of the corresponding integrated classifier are all applications in entertainment category. And if the application class of the application program to be identified is the tool class, the application programs in the sample set of the corresponding integrated classifier are all the application programs in the tool class.
Step S155, determining a corresponding integrated classifier according to the application category of the application program to be identified;
step S156, a plurality of base classifiers in the integrated classifier are utilized to classify the target feature set for a plurality of times respectively;
and step S157, determining the security attribute category of the application program to be identified according to the multiple classification results.
In some embodiments, the type of installation package feature set feature comprises at least one of:
installing a package permission type; install package intent type; the type of installation package component; number of installation pack components.
In some embodiments, the method further comprises:
step S11a, acquiring an application program with label information as a training sample;
step S12a, determining a plurality of training sample subsets, wherein the number of the application programs with the same label information in each training sample subset is the same;
here, the number of training samples of the abnormal application program is generally small, and in order to achieve a better training effect, in the present application, the training samples are split into a plurality of training sample subsets, and it is ensured that the number of application programs having the same label information in each training sample subset is the same.
Step S13a, the training feature subsets correspond to the base classifiers one by one, and the corresponding base classifiers are trained by utilizing the training feature subsets;
and step S14a, forming an integrated classifier by the plurality of base classifiers.
Here, the steps S11a to S14a are a training process of the ensemble classifier, in which a plurality of basis classifiers are respectively trained by using a plurality of balance sample data, and then the basis classifiers are combined into one ensemble classifier by using a majority voting method.
In the embodiment of the present application, the application categories to which the application programs in the training samples belong are the same.
For example, the application category to which the application belongs may be an entertainment category, a tool category, a communication category, and the like.
In some embodiments, the partial technical features "training the corresponding base classifier using the training feature subset" in step S13a may be implemented by:
step S11b, performing feature extraction on a training sample subset to determine an installation package feature set of an application program in the training sample subset;
step S12b, determining the prediction information of the application program according to the installation package feature set;
step S13b, comparing the marking information of the application program with the prediction information to obtain a loss function of the base classifier;
and step S14b, training the base classifier by using the loss function.
In some embodiments, the step S12b of determining the prediction information of the application according to the installation package feature set includes:
step S121b, obtaining scores of the features in the installation package feature set, and sorting the scored features according to scoring results;
step S122b, selecting features according to the sorting result to determine a target feature set of the training sample subset;
step S123b, determining prediction information of the application program in the training sample subset according to the target feature set of the training sample subset.
Based on the foregoing embodiment, the embodiment of the application provides an application classification-based Android malware static detection model for solving the problem of low classification accuracy and low detection rate in Android malware detection research. The model provided by the embodiment of the application researches an application classification-based Android malicious software static detection model, namely classification detection is finished under a single application class, and an application classification method is not involved. The embodiment of the application comprises four parts:
the first part is to analyze the work of the mainstream Android application market at home and abroad on application classification, research the necessity of application classification, download the Android software samples used in the application and reserve the class information of the Android software samples.
And in the second part, the Android system architecture is analyzed, the safety mechanism and the model of the Android system are analyzed, the advantages and the disadvantages of the static detection and the dynamic detection of the Android malicious software are summarized on the basis, and the static attributes such as permission, intention and component use during the installation of the Android software are extracted in combination with the background of the current research to construct an Android software sample data set.
And in the third part, the advantages and the defects of two types of commonly used Filter and Wrapper feature screening algorithms in the feature screening process are analyzed, the experience of feature screening in the current Android software detection method is used for reference, a simple and efficient IG-Relieff mixed feature screening algorithm is provided, and feature screening is completed.
Fourth, in order to improve the detection rate of classification detection under the condition that Android software sample data is unbalanced, the embodiment of the application provides an algorithm combining a Bagging algorithm and a Support Vector Machine (SVM), namely the Bagging-SVM algorithm is applied to the structure of a classifier, and on the basis, the application classification-based Android malicious software static detection model is realized.
Wherein: the first part can be realized by the following modes:
the static detection method for the Android malicious software has the advantages of small processing magnitude, easiness in large-scale expansion and the like, and is widely applied. The machine learning algorithm can obviously improve the classification detection effect, so that the Android malicious software static detection method combined with machine learning is a hot spot of current research. Researchers usually improve the machine learning algorithm to improve the classification detection effect, but if no more key features are added, the improvement of the classification detection effect by only improving the machine learning algorithm is limited, and the dependence degree of the classification detection effect on the selected features is high, so that the classification detection effect can be obviously improved by adding more key features.
The category information of the application program is key information, the functions of the application programs in the same category are similar, and the permission required to be applied by the function, the called component and the executed method are consistent; in order to realize malicious functions, the authority applied and the called components in the same application category have large possibility of exception. In addition, the application classification information is used for detecting the malicious software, so that the false alarm rate of Android software classification detection can be reduced. For example, a wallpaper software applies for the right to read the address book, which indicates that there may be a risk of privacy disclosure. And similarly, aiming at the permission of reading the address book, for a communication social software, the normal permission of realizing the communication social function is realized.
The mainstream Android software distribution platform Google pay (Google Play), the 360 app store, the pea pod app store, and the app stores of various large mobile phone manufacturers classify Android mobile phone software. When a developer submits software, application categories are submitted to an application store, and the application store needs to perform qualification and category examination on the software according to the attributes of the application and finally releases the software to a user. According to the actual needs of users, Android software is classified into a security class, a communication social class, a video audio-visual class, a theme wallpaper class, a game entertainment class and the like according to software categories, and application software is further subdivided in a pea pod application market on the basis, for example, system tools can be further classified into Wireless internet access (WIFI), a browser, an input method, power saving, safety and the like.
According to the embodiment of the application, the software with large user download amount and high score is collected from the application market, and the software is divided according to the category information of the application market to form a single-category software sample set. The software sample set can be dually scanned at the same time to ensure that the software is benign, the malicious software samples come from the Android software genome project, and the malicious software is few, so that the software application classes are manually divided by running on a simulator.
Wherein, the second part can be realized by the following modes:
the static detection has the advantages of light weight, convenience for large-scale expansion and the like, so that the static detection is adopted in the embodiment of the application. And aiming at the extracted software sample, the authority, intention and component number information during Android software installation are extracted by reversely analyzing the APK file. The feature extraction is divided into three stages, namely decompiling, analyzing the XML file and constructing a feature vector. Fig. 2 is a schematic diagram of a feature extraction model according to an embodiment of the present application, and as shown in fig. 2, the feature extraction model includes three parts: (1) in the decompilation stage, an APK tool developed by Google (Google) can be adopted to complete decompilation, and then the manifest file Android manifest. (2) XML file has information of authority, intention and component, etc. which must be applied when Android application program is installed, xml.eree.elementtree package in Python (a computer programming language) is called to analyze XML file in the second stage, user-permission label is extracted to obtain system authority applied by application program, action and category labels are extracted to obtain attribute related to intention, and activity, receiver, service and provider labels are extracted to obtain component related information. (3) Xml file is analyzed, corresponding information needs to be quantized, and a feature vector is constructed according to the quantized information. Since the authority information and the intention information are scalars, the embodiments of the present application quantize them into "0" and "1", where "0" indicates that the intention or the authority is not included, and "1" indicates that the intention or the authority is included. Regarding the component information, the embodiments of the present application take the number of components included in each APK file as a quantization result.
Wherein: the third part can be realized by the following ways:
the feature screening is a key step in data preprocessing, aims to eliminate feature redundancy and screen features with high correlation, and has the function of improving the precision and efficiency of subsequent classification detection.
The Filter type feature screening algorithm has the advantages of high execution efficiency and capability of rapidly finishing scoring of all the features. The limitation is that it cannot be guaranteed that the feature subset with the optimal classification effect is screened out, and a feature subset with a small scale cannot be determined. The method is characterized in that a method for filtering characteristics of a Wrapper class is used for filtering a characteristic subset with a classification effect close to the best by means of strategies such as heuristic search, random search and the like. Aiming at the inherent problems of the Filter algorithm and the Wrapper algorithm in feature screening, the embodiment of the application provides a new feature screening algorithm, namely a feature screening algorithm mixing the Filter algorithm and the Wrapper algorithm, namely an IG-Relieff mixed feature screening algorithm (IG-Relieff algorithm for short), and according to the grading results of the IG algorithm and the Relieff algorithm, the feature with high grading of the two algorithms is preferentially searched, so that the rapid screening is realized.
In order to reduce the feature dimension to the maximum extent and improve the subsequent classification detection precision, an IG-Relieff mixed screening algorithm is adopted in the feature screening stage. The IG algorithm and the Relieff algorithm are respectively used for screening the optimal characteristic subset according to the evaluation indexes of the characteristics, but the evaluation criteria are different. For a specific feature, the score of the IG algorithm is the difference between the entropy of the feature and its conditional entropy, and mainly considers the influence of the feature on the information entropy of the classification system, while the score of the ReliefF algorithm is the difference between the distance between the feature and a similar neighboring sample and the distance between the feature and a heterogeneous neighboring sample, and reflects the influence of the correlation between the features on the classification result. And the IG-Relieff algorithm combines the advantages of the IG-Relieff algorithm and the Relieff algorithm and realizes feature screening according to the classification result of the classifier.
The feature screening algorithm in the embodiment of the application can be divided into three stages, namely a preliminary scoring stage, a feature searching stage and a feature subset evaluation stage.
(1) And (3) a preliminary scoring stage: in the initial scoring stage, all features are scored by respectively adopting IG and Relieff feature screening algorithms, and scoring results are used as the basis of subsequent feature search.
The IG algorithm uses the information gain to quantify the score of the features. For a data set S extracted from an Android software sample (the data set S is a feature vector set obtained by extracting features of a plurality of software samples), a certain feature f (i.e., a column of feature vectors extracted from a feature extraction model shown in fig. 2) in the data set S is set to have n values f1,f2,...fn(for example, if the number of a certain component in the sample 1 is 5, the number of the components in the sample 2 is 1, and the number of the components in the sample 3 is 10, then the number characteristics of the components have 3 values, respectively 5, 1, and 10), the category label C has 2 values of "0" or "1" representing a benign sample (i.e., trusted application) and a malicious sample (i.e., untrusted application), and for each category C, the category label C represents a benign sample (i.e., trusted application) and a malicious sample (i.e., untrusted application), respectivelyi(i denotes the class of the class, e.g. benign samples are c0The malicious sample is c1) The probability of occurrence is p (C ═ C)i) Then, the information entropy of the whole classification detection system can be represented by formula (1):
Figure BDA0002179809140000201
in the known characteristic f ═ fj(fjIs f is the1To fnAny one value of) for class c)iIs expressed as P (c)i|fj) Given f ═ fjThe conditional entropy of time can be expressed by equation (2):
Figure BDA0002179809140000202
therefore, the information gain IG (C | f) for the feature f can be expressed by equation (3):
IG(C|f)=H(C)-H(C|f) (3)
that is, a certain feature f in the sample is scored as IG (C | f).
The Relieff algorithm is a supervised feature screening algorithm, the effectiveness of features is evaluated by calculating the difference value of the features between similar neighbor samples and heterogeneous neighbor samples aiming at samples of known types, and the weight value of the features is calculated through repeated iterative learning. The larger the weight value is, the better the discrimination ability of the feature is, and conversely, the worse the discrimination ability is.
When the characteristics are evaluated by the Relieff algorithm, the data set S is set to { S }1,s2,…,smM is the number of samples in the data set, smRepresenting a feature vector of any Android software sample, wherein each sample comprises p features, si={si1,si2,…sipI is more than or equal to 1 and less than or equal to m, and the values of all the characteristics are numerical values. Sample siClass label c ofiE C, C ═ {0,1} is the set of category labels. Two samples siAnd sj(1 ≦ i ≠ j ≦ m) the difference in the feature f (f is any of the p features) can be represented by equation (4):
Figure BDA0002179809140000211
therein, maxfAnd minfRespectively the maximum and minimum of the feature f in the sample set.
To calculate the score of the features for the Relieff algorithm, a sample s is first randomly selected from a set of samplesi(1 ≦ i ≦ m), selecting one sample from the two classes of samples that is closest to it, and siThe same class is called NearHit, the different class is called MissHit, and r (r) is carried out to avoid the randomness of one sampling>1) And (4) iteration, wherein k represents the number of neighbor in each iteration, and a randomly selected sample s is setiWhen the class (1. ltoreq. i.ltoreq.m) is R, C denotes an arbitrary class other than R, and k denotes the number of neighbors, the weight update formula (5) is:
Figure BDA0002179809140000212
where p (C) represents the probability that a sample of class C appears in the sample population.
W (f) after r iterations is the score of the feature f by the Relieff algorithm.
(2) A characteristic searching stage: let the IG algorithm have a score of G-G for p features1,g2,…,gpThe scoring value of the ReliefF algorithm for p features is R ═ R1,r2,…,rp}. Screening n characteristics with the highest scores according to the scoring results in the G and the R, wherein the n characteristics are respectively GnAnd RnThen, the search formula (6) of the feature subset FS is:
FS=Gn∪Rn (6)
(3) a characteristic subset evaluation stage: in the characteristic evaluation stage, a data set is constructed by using characteristic subsets generated in the characteristic search stage, the data set is divided into a training set and a testing set by adopting a 5-fold cross-validation method, namely, the data set is randomly divided into 5 parts, each part is alternately used as the testing set, the rest 4 parts are used as the training set, 5 times of classification detection is carried out, and the average value of 5 times of classification detection results is used as a grading standard for the characteristic subsets, so that the characteristic subset with the highest grading value is screened out. The evaluation indexes of the classification detection algorithm and the detection result in the characteristic screening process are described as follows:
the SVM classifier based on the Linear kernel function, namely a Linear Support Vector Machine (LSVM), has the characteristics of short training time and high classification detection efficiency, and meanwhile, the LSVM classifier is selected to complete feature screening by considering that a subsequent classification algorithm adopts an integrated classifier based on the LSVM algorithm.
In the Android malicious software detection, evaluation indexes such as accuracy, recall rate (detection rate) and classification precision can be obtained through a confusion matrix. In the Android malware detection, the confusion matrix is shown in table 1:
TABLE 1 confusion matrix
Figure BDA0002179809140000221
Wherein the four basic index quantities are respectively defined as: TP represents the number of benign software correctly identified. FP represents the amount of malware that was misidentified as benign. TN represents the number of malware correctly identified. FN indicates the number of benign software misidentifications.
Based on the confusion matrix, the following four metrics are defined:
index 1, i.e., the ratio of the accuracy rate representing the number of correctly identified malware to identified malware, can be expressed by equation (7):
Figure BDA0002179809140000222
index 2, the detection rate representing the ratio of the number of correctly identified malware to the actual malware, also called the recall rate, can be expressed by equation (8):
Figure BDA0002179809140000223
index 3, which is the ratio of the number of correctly recognized software and all sample software expressed by classification accuracy, can be expressed by equation (9):
Figure BDA0002179809140000224
index 4, F1, is a harmonic mean of accuracy and detection rate, and can be expressed by equation (10):
Figure BDA0002179809140000231
the index 3 is a commonly used evaluation index, but the index does not consider the imbalance of data and cannot objectively evaluate the classification result, that is, in the Android malware sample set, malicious samples are less in number due to difficulty in collection, and even if the classification precision is high, the detection rate is low. In comparison, the index F1 comprehensively considers the accuracy and the detection rate, and is suitable for the classification effect evaluation under data imbalance, so in the feature screening stage, in the embodiment of the present application, F1 is used as the evaluation index of the classification detection.
Fig. 3 is a schematic view of an implementation flow of the feature screening method in the embodiment of the present application, and as shown in fig. 3, the feature screening algorithm in the embodiment of the present application is divided into three stages: a preliminary scoring phase including steps S301 to S302, a feature search phase including steps S303 to S304, and a feature subset evaluation phase including steps S305 to S308.
S301, searching n (with an initial value of 1) features with the highest score by an IG algorithm;
s302, searching n (the initial value is 1) features with the highest score by a RelieF algorithm;
in the embodiment of the application, all the features are scored by using two feature screening algorithms of IG and Relieff, and the scoring results are respectively stored as G ═ G1,g2,…,gpR ═ R1,r2,...,rp}。
Step S303, taking a union set of the feature subsets searched by the two algorithms;
step S304, constructing a data set by the searched features;
in the embodiment of the application, n characteristics with high scores in G and R are preferentially searched, and are respectively GnAnd RnAnd according to GnAnd RnThe searched feature subset FS is determined.
S305, training a classifier to obtain a classification result;
s306, recording the searched feature subset and classification results;
step S307, whether all the characteristics are searched;
here, when it is determined in step S307 that all the features have been searched, step S308 is performed. If all the features are not searched, the n +1 features with the highest scores are continuously searched.
And step S308, acquiring the feature subset with the highest score.
In the embodiment of the application, an FS is used for constructing a data set, the data set is divided into a training set and a testing set, an LSVM classifier is trained, the LSVM classifier is used for completing classification detection of the testing set, an F1 value is calculated according to a detection result and serves as a score of the FS, and a feature subset with the highest score is screened out through increasing n, so that classification detection is completed.
That is to say, the screening process of the feature screening algorithm in the embodiment of the present application can be implemented by the following steps:
step S311, scoring p features in the data set S by using IG and RelielfF algorithms, storing scoring results as G and R, wherein SF (an initial value is an empty set) represents a feature subset with the highest current score, and Fbest (an initial value is 0) represents the highest current score;
step S312, calculating a feature subset FS by a formula (6) according to the value of n (the initial value is 1), and forming a screened data set D based on the FS;
and step S313, randomly and uniformly dividing the data in the step D into 5 parts, wherein 4 parts are used as a training set of the classifier, and the rest 1 part is used as a test set of the classifier. Executing training and testing processes, storing the classification result of the test set, and calculating F1 as the scoring value of the FS;
in step S314, 1 data in step 313 is rotated as a test set, and the test set is repeatedly executed 5 times to calculate an average value of F1. If the average value is larger than Fbest, the average value is assigned to Fbest, and elements in FS are assigned to SF; otherwise, skipping the assignment stages of Fbest and SF;
and S315, increasing n by 1, repeating the step S312 to the step S314, stopping iteration when n reaches the maximum iteration number p, and outputting D.
Wherein: the fourth part can be realized by the following modes:
in the Android malicious software detection process, a single SVM algorithm is adopted for classification detection, and two problems exist: due to the fact that the Android malicious software samples are difficult to collect, the quantity ratio of benign software to malicious software in the data set is unbalanced, the classification result of the trained SVM classifier is biased to the benign software, and the detection rate of the Android malicious software is low. The SVM algorithm has a good classifying effect on small samples, but is sensitive to the samples, particularly to the samples close to a classification boundary, and if wrongly-classified samples exist on the classification boundary, the accuracy of the classifier is seriously influenced.
Therefore, the benign data set and the malicious data set bootstrap are sampled respectively, a balanced data set is constructed, and the influence of unbalanced data on the classification result is reduced. The bootstrap sampling is an important statistical variability of estimation in non-parameter statistics, and can be used as a statistical method for estimating a statistical interval. Since the Bagging algorithm (a method for improving the accuracy of the learning algorithm) can remarkably improve the classification performance of the algorithm with unstable classification effect, the construction of an integrated classifier based on an SVM (support vector machine) through the Bagging algorithm is considered, and the stability and the detection precision of the classifier are improved.
The embodiment of the application provides an Android malicious software detection algorithm based on Bagging-SVM. First, k balanced data sets are constructed using bootstrap sampling. Then, k SVM base classifiers C based on the balanced data set are trained. And finally, forming an SVM integrated classifier C by the k SVM base classifiers in a majority voting mode. And during classification detection, the output of the SVM integrated classifier C is used as a classification result of the test sample.
In order to improve the execution efficiency of classification detection, the embodiment of the application adopts a linear support vector machine to design a classification detection process based on a Bagging-SVM algorithm, and the process design is as follows:
step S401, randomly extracting 80% of data from D to form a training set D _ train, forming the rest 20% of data into a test set D _ test, and splitting the D _ train into benign data sets D according to data typesbAnd malicious dataset DmI.e. D _ train ═ Db,Dm};
Step S402, let m and b denote D respectivelymAnd DbThe number of samples in (c). To achieve bootstrap sampling, a random number generator is used to generate two sets of m repeatable positive integers IMi (i 1, 2.. multidot.m) and IBi (i 1, 2.. multidot.m), where 0 is 0<0<IMi,IBi≤m;
Step S403, taking IMi as the serial number of the malicious sample to be extracted and IBi as the serial number of the benign sample to be extracted, executing the extraction process to obtain m benign data and m malicious data, and combining the benign data and the malicious data to form a new data set D to be trainedi
Step S404, DiIs represented as (x)i,yi) (i ═ 1, 2.. times, m), where x isi=(xi1,xi2,...,xip)TAttribute set, y, corresponding to the ith training sampleiE {0,1} represents the class of the Android sample, the SVM model can be represented by formula (11):
Figure BDA0002179809140000261
wherein alpha is a punishment term, and the lagrange multiplier lambda is obtained by solving the model by adopting a quadratic programming methodiThen the parameters w and C of LSVM-based classifier C can be solved by equations (12) and (13):
Figure BDA0002179809140000262
Figure BDA0002179809140000263
correspondingly, the base classifier C can be represented by equation (14):
Figure BDA0002179809140000264
and S405, repeatedly executing the steps S402 to S404 for k times to obtain k LSVM base classifiers C, and combining the k base classifiers C into an SVM integrated classifier C.
Step S406, inputting x into k basis classifiers in the SVM integrated classifier C for each test sample x in the D _ test, and calculating voting results of the k basis classifiers. Wherein, the voting formula (15) is:
C*(x)=vote(C1(x),C2(x),...,Ck(x))=δ(∑isign(Ci(x)=y)) (15)
when the condition C is judgedi(x) When y holds,sign(Ci(x) Y) 1; otherwise, sign (C)i(x) Y) 1. When the condition is judged as sigmaisign(Ci(x)=y)>When 0 is true, delta (∑)isign(Ci(x)=y)>0) 1, representing malware; otherwise, delta (∑)isign(Ci(x)=y)>0) And 0, namely benign software.
Fig. 4 is a schematic view of an implementation flow of a software detection method according to an embodiment of the present application, and as shown in fig. 4, the method includes:
step S411, randomly splitting the data set into a test data set and a training data set;
step S412, setting the initial value of the number k of the SVM classifiers as 1;
step S413, judging whether the software in the training data set is malicious software;
if it is malware, step S414 is executed; if not, step S415 is executed.
Step S414, adding the software into a malicious data set, and carrying out bootstrap sampling;
step S415, adding the software into a benign data set, and carrying out bootstrap sampling;
s416, constructing a balanced data set according to the sampling result, and training an SVM classifier;
step S417, whether k meets the stop condition;
here, when k satisfies the stop condition, step S418 is performed. When K does not satisfy the stop condition, K +1 performs step S412.
Step S418, performing majority voting on the obtained k SVM classifiers to form an SVM integrated classifier;
and S419, carrying out classification detection on the test data set according to the integrated classifier.
In the embodiment of the application, aiming at the problem that the existing classification detection model is low in detection accuracy and detection rate, the embodiment of the application collects software samples in different software categories in an Android application market, extracts static characteristics of the Android software samples, provides an IG-Relieff feature screening algorithm and a Bagging-SVM classification detection algorithm, and provides an application classification-based Android malicious software static detection model on the basis, and fig. 5 is a framework diagram of the software static detection model in the embodiment of the application, as shown in fig. 5: the detection model comprises four phases: classification 501, feature extraction 502, feature screening 503, and classification detection 504 are applied. And in the application classification 501 stage, software samples are divided into different sample sets according to software categories in the application market, and the subsequent three stages are completed under specific software categories. And a characteristic extraction 502 stage, which is to extract static characteristics with good discrimination as a basis for classification detection. And a characteristic screening 503 stage, wherein the classification detection effect of the model is improved by optimizing the extracted data set. And a classification detection 504 stage, training a machine learning classifier, and completing classification detection of the Android malicious software.
Compared with the conventional dynamic and static detection method, the method has higher detection precision, and improves the classification precision of the final result by 8 percent compared with the conventional dynamic and static detection method by applying the four stages of classification, feature extraction, feature screening and classification detection. Compared with the existing method, the detection rate is higher.
Based on the foregoing embodiments, the present application provides an application program identification apparatus, where the apparatus includes units, modules included in the units, and components included in the modules, and may be implemented by a processor in a computer device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the processor may be a Central Processing Unit (CPU), a Microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.
Fig. 6 is a schematic diagram illustrating a structure of an application program identification apparatus according to an embodiment of the present application, and as shown in fig. 6, the apparatus 600 includes:
the feature extraction unit 601 is configured to perform feature extraction on the acquired installation package of the application program to be identified to obtain an installation package feature set of the application program to be identified;
a scoring unit 602, configured to obtain a score of a feature in the installation package feature set, and sort the scored feature according to a scoring result;
a feature selection unit 603, configured to perform feature selection according to the sorting result to determine a target feature set;
the identifying unit 604 is configured to classify the target feature set multiple times, and determine a security attribute category of the application to be identified according to a result of the multiple classification.
In some embodiments, the feature selecting unit 603 includes:
a feature selection module for selecting features a plurality of times according to the sorting result to form a plurality of feature subsets including different numbers of features;
and the characteristic subset training module is used for training a classifier by utilizing the characteristic subset and determining a target characteristic set according to a training result.
In some embodiments, the feature selection module comprises:
and the first feature selection part is used for selecting the top-ranked features for multiple times according to the descending order of the ranking result, and each time the selection is started from the first position of the ranking result.
In some embodiments, the scoring unit 602 includes:
the grading module is used for grading the features in the feature set of the installation package by using different feature grading methods;
and the sorting module is used for sorting the scores corresponding to each feature scoring method to obtain a corresponding sorting result.
In some embodiments, the feature selection module comprises:
the second characteristic selection module is used for selecting the characteristics with the set number in the front of the sequencing result for multiple times according to the descending sequence of the sequencing result;
and the processing module is used for taking intersection of the set number of features corresponding to each sequencing result.
In some embodiments, the feature subset training module comprises:
a feature subset training unit configured to train a corresponding classifier using the feature subset when the feature subset corresponds to the classifier one by one;
a first determining component for determining a harmonic mean of the training results of the different feature subsets;
and the second determining component is used for determining a target feature set according to the sizes of the harmonic mean values.
In some embodiments, the identifying unit 604 includes:
the first classification module is used for classifying the target feature set for multiple times by utilizing a plurality of base classifiers in the trained integrated classifier;
and the first determining module is used for determining the security attribute category of the application program to be identified according to the number of different security attribute categories in the multi-time classification result.
In some embodiments, the identifying unit 604 includes:
the second determination module is used for determining the application category of the application program to be identified;
the third determining module is used for determining a corresponding integrated classifier according to the application category of the application program to be identified;
the second classification module is used for classifying the target feature set for multiple times by utilizing a plurality of base classifiers in the integrated classifier;
and the fourth determining module is used for determining the security attribute category of the application program to be identified according to the result of the multiple classification.
In some embodiments, the type of installation package feature set feature comprises at least one of:
installing a package permission type; install package intent type; the type of installation package component; number of installation pack components.
In some embodiments, the apparatus further comprises:
the acquisition unit is used for acquiring the application program with the label information as a training sample;
the device comprises a determining unit, a judging unit and a judging unit, wherein the determining unit is used for determining a plurality of training sample subsets, and the number of application programs with the same marking information in each training sample subset is the same;
the training unit is used for enabling the training feature subsets to correspond to the base classifiers one by one and training the corresponding base classifiers by utilizing the training feature subsets;
a combination unit for combining the plurality of base classifiers into an integrated classifier.
In some embodiments, the application categories to which the applications in the training sample belong are the same.
In some embodiments, the training unit comprises:
the first training determination module is used for performing feature extraction on a training sample subset to determine an installation package feature set of an application program in the training sample subset;
the second training determination module is used for determining the prediction information of the application program according to the installation package characteristic set;
the comparison module is used for comparing the marking information of the application program with the prediction information to obtain a loss function of the base classifier;
and the base classifier training module is used for training the base classifier by utilizing the loss function.
In some embodiments, the second training determination module comprises:
the training and sequencing component is used for acquiring the scores of the features in the feature set of the installation package and sequencing the scored features according to the scoring results;
the first training determination component is used for carrying out feature selection according to the sequencing result so as to determine a target feature set of the training sample subset;
and the second training determination component is used for determining the prediction information of the application program in the training sample subset according to the target feature set of the training sample subset.
The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
Correspondingly, the embodiment of the present application provides a computer device, which includes a memory and a processor, where the memory stores a computer program that can run on the processor, and the processor executes the computer program to implement the application program identification method provided in the above embodiment.
Correspondingly, the embodiment of the application provides a computer readable storage medium, on which a computer program is stored, and the computer program realizes the application program identification method when being executed by a processor.
Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
It should be noted that fig. 7 is a schematic diagram of a hardware entity of a computer device according to an embodiment of the present application, and as shown in fig. 7, the hardware entity of the computer device 700 includes: a processor 701, a communication interface 702, and a memory 703, wherein
The processor 701 generally controls the overall operation of the computer device 700.
The communication interface 702 may enable the computer device to communicate with other terminals or servers via a network.
The Memory 703 is configured to store instructions and applications executable by the processor 701, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 701 and modules in the computer device 700, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing module, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (16)

1. An application identification method, the method comprising:
performing feature extraction on the acquired installation package of the application program to be identified to obtain an installation package feature set of the application program to be identified;
obtaining scores of the features in the feature set of the installation package, and sequencing the scored features according to scoring results;
performing feature selection according to the sequencing result to determine a target feature set;
and classifying the target feature set for multiple times, and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
2. The method of claim 1, wherein the performing feature selection according to the ranking result to determine a target feature set comprises:
selecting the features for multiple times according to the sorting result to form multiple feature subsets comprising different numbers of features;
and training a classifier by using the feature subset, and determining a target feature set according to a training result.
3. The method of claim 2, wherein the plurality of feature selections based on the ranking result comprises:
and selecting the characteristics ranked in the front according to the descending order of the ranking result for multiple times, wherein each selection is started from the first position of the ranking result.
4. The method according to claim 2, wherein the obtaining scores of the features in the feature set of the installation package and sorting the scored features according to the scoring results comprises:
grading the features in the feature set of the installation package by using different feature grading methods;
and sorting the scores corresponding to each feature scoring method to obtain a corresponding sorting result.
5. The method of claim 4, wherein the selecting features a plurality of times according to the ranking result comprises:
selecting the characteristics with the set number in the front of the sorting results for multiple times according to the descending order of the sorting results;
and taking intersection of the set number of features corresponding to each sequencing result.
6. The method according to any one of claims 2 to 5, wherein the training a classifier by using the feature subset to determine a target feature set according to a training result comprises:
the feature subsets correspond to the classifiers one by one, and the corresponding classifiers are trained by utilizing the feature subsets;
determining harmonic mean values of different feature subset training results;
and determining a target feature set according to the sizes of the harmonic mean values.
7. The method according to claim 1, wherein the classifying the target feature set for a plurality of times, and determining the security attribute class of the application to be identified according to the result of the classifying for a plurality of times comprises:
classifying the target feature set for multiple times respectively by utilizing a plurality of base classifiers in the trained integrated classifier;
and determining the security attribute category of the application program to be identified according to the number of different security attribute categories in the multi-time classification result.
8. The method according to claim 1, wherein the classifying the target feature set for a plurality of times, and determining the security attribute class of the application to be identified according to the result of the classifying for a plurality of times comprises:
determining an application category of the application program to be identified;
determining a corresponding integrated classifier according to the application category of the application program to be identified;
classifying the target feature set for multiple times respectively by utilizing a plurality of base classifiers in the integrated classifier;
and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
9. The method of claim 1, wherein the type of installation package feature set feature comprises at least one of:
installing a package permission type; install package intent type; the type of installation package component; number of installation pack components.
10. The method according to claim 7 or 8, characterized in that the method further comprises:
acquiring an application program with marked information as a training sample;
determining a plurality of training sample subsets, wherein the number of application programs with the same marking information in each training sample subset is the same;
the training feature subsets correspond to the base classifiers one to one, and the corresponding base classifiers are trained by utilizing the training feature subsets;
and forming a plurality of base classifiers into an integrated classifier.
11. The method of claim 10, wherein the application categories to which the applications in the training samples belong are the same.
12. The method of claim 10, wherein training the corresponding base classifier using the training feature subset comprises:
performing feature extraction on a training sample subset to determine an installation package feature set of an application program in the training sample subset;
determining the prediction information of the application program according to the installation package feature set;
comparing the marking information of the application program with the prediction information to obtain a loss function of the base classifier;
training the base classifier using the loss function.
13. The method of claim 12, wherein determining the predictive information for the application based on the installation package feature set comprises:
obtaining scores of the features in the feature set of the installation package, and sequencing the scored features according to scoring results;
performing feature selection according to the sequencing result to determine a target feature set of the training sample subset;
and determining the prediction information of the application program in the training sample subset according to the target feature set of the training sample subset.
14. An application recognition apparatus, the apparatus comprising:
the device comprises a characteristic extraction unit, a characteristic extraction unit and a characteristic extraction unit, wherein the characteristic extraction unit is used for extracting the characteristics of the acquired installation package of the application program to be identified so as to obtain an installation package characteristic set of the application program to be identified;
the scoring unit is used for acquiring the scores of the features in the feature set of the installation package and sequencing the scored features according to the scoring results;
the characteristic selection unit is used for carrying out characteristic selection according to the sequencing result so as to determine a target characteristic set;
and the identification unit is used for classifying the target feature set for multiple times and determining the security attribute category of the application program to be identified according to the result of the multiple classification.
15. A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the application identification method of any one of claims 1 to 13 when executing the program.
16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the application recognition method of any one of claims 1 to 13.
CN201910792044.5A 2019-08-26 2019-08-26 Application program identification method and device, equipment and storage medium Pending CN112434291A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910792044.5A CN112434291A (en) 2019-08-26 2019-08-26 Application program identification method and device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910792044.5A CN112434291A (en) 2019-08-26 2019-08-26 Application program identification method and device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112434291A true CN112434291A (en) 2021-03-02

Family

ID=74690267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910792044.5A Pending CN112434291A (en) 2019-08-26 2019-08-26 Application program identification method and device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112434291A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113691492A (en) * 2021-06-11 2021-11-23 杭州安恒信息安全技术有限公司 Method, system, device and readable storage medium for determining illegal application program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140090061A1 (en) * 2012-09-26 2014-03-27 Northrop Grumman Systems Corporation System and method for automated machine-learning, zero-day malware detection
CN107256357A (en) * 2017-04-18 2017-10-17 北京交通大学 The detection of Android malicious application based on deep learning and analysis method
CN108717511A (en) * 2018-05-14 2018-10-30 中国科学院信息工程研究所 A kind of Android applications Threat assessment models method for building up, appraisal procedure and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140090061A1 (en) * 2012-09-26 2014-03-27 Northrop Grumman Systems Corporation System and method for automated machine-learning, zero-day malware detection
CN107256357A (en) * 2017-04-18 2017-10-17 北京交通大学 The detection of Android malicious application based on deep learning and analysis method
CN108717511A (en) * 2018-05-14 2018-10-30 中国科学院信息工程研究所 A kind of Android applications Threat assessment models method for building up, appraisal procedure and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113691492A (en) * 2021-06-11 2021-11-23 杭州安恒信息安全技术有限公司 Method, system, device and readable storage medium for determining illegal application program
CN113691492B (en) * 2021-06-11 2023-04-07 杭州安恒信息安全技术有限公司 Method, system, device and readable storage medium for determining illegal application program

Similar Documents

Publication Publication Date Title
Chou et al. Large‐scale plant protein subcellular location prediction
Hassan et al. Evaluation of computational techniques for predicting non-synonymous single nucleotide variants pathogenicity
CN110704840A (en) Convolutional neural network CNN-based malicious software detection method
Qiu et al. Predicting co-complexed protein pairs from heterogeneous data
Richardson et al. MetaCurator: A hidden Markov model‐based toolkit for extracting and curating sequences from taxonomically‐informative genetic markers
US20110295902A1 (en) Taxonomic classification of metagenomic sequences
CN112464232B (en) Android system malicious software detection method based on mixed feature combination classification
CN109063478A (en) Method for detecting virus, device, equipment and the medium of transplantable executable file
EP3293664B1 (en) Software analysis system, software analysis method, and software analysis program
Pei et al. CLADES: A classification‐based machine learning method for species delimitation from population genetic data
CN107615240A (en) For analyzing the scheme based on biological sequence of binary file
Rifaioglu et al. Large‐scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants
Meng et al. SecProMTB: support vector machine‐based classifier for secretory proteins using imbalanced data sets applied to Mycobacterium tuberculosis
Amengual-Rigo et al. NetCleave: an open-source algorithm for predicting C-terminal antigen processing for MHC-I and MHC-II
CN112434291A (en) Application program identification method and device, equipment and storage medium
CN1871595A (en) Methods of processing biological data
Liu et al. Are dropout imputation methods for scRNA-seq effective for scATAC-seq data?
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
Vallat et al. Building and assessing atomic models of proteins from structural templates: Learning and benchmarks
CN115952078A (en) Test case sequencing method, device and system and storage medium
CN110990834A (en) Static detection method, system and medium for android malicious software
Kumar et al. Android malware prediction using extreme learning machine with different kernel functions
WO2022157867A1 (en) Generating device, generating method, and generating program
Ghanbari Sorkhi et al. Predicting drug-target interaction based on bilateral local models using a decision tree-based hybrid support vector machine
Jian et al. $\boldsymbol {\ell_2} $ Multiple Kernel Fuzzy SVM-Based Data Fusion for Improving Peptide Identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210302

WD01 Invention patent application deemed withdrawn after publication