CN110647747B - False mobile application detection method based on multi-dimensional similarity - Google Patents

False mobile application detection method based on multi-dimensional similarity Download PDF

Info

Publication number
CN110647747B
CN110647747B CN201910835333.9A CN201910835333A CN110647747B CN 110647747 B CN110647747 B CN 110647747B CN 201910835333 A CN201910835333 A CN 201910835333A CN 110647747 B CN110647747 B CN 110647747B
Authority
CN
China
Prior art keywords
mobile application
similarity
false
mobile
application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910835333.9A
Other languages
Chinese (zh)
Other versions
CN110647747A (en
Inventor
王俊峰
吴鹏
刘�东
李凡
周凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201910835333.9A priority Critical patent/CN110647747B/en
Publication of CN110647747A publication Critical patent/CN110647747A/en
Application granted granted Critical
Publication of CN110647747B publication Critical patent/CN110647747B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2127Bluffing

Abstract

The invention discloses a false mobile application detection method based on multidimensional similarity, which filters mobile applications to be detected by utilizing the signature of a developer and the characteristics of a public library; respectively providing corresponding processing algorithms for the filtered samples from the mobile application whole and the mobile application resources in three dimensions of mobile application codes to realize the identification of suspicious false applications, and simultaneously realizing the fusion of the three algorithms by utilizing a joint strategy to achieve the balance of detection accuracy and time consumption; and finally, detecting whether the suspicious false application contains malicious behaviors by using a multiple antivirus engine integration platform Virustotal. The invention does not need manual labeling in advance and a training process of a detection model, and a single algorithm has good detection efficiency on a certain type of false application mode; meanwhile, a plurality of algorithms are fused by using a joint strategy, so that the false application identification accuracy is improved, and the time overhead is reduced.

Description

False mobile application detection method based on multi-dimensional similarity
Technical Field
The invention relates to the technical field of false mobile application analysis, in particular to a false mobile application detection method based on multi-dimensional similarity.
Background
Fake mobile applications are fraudulent mobile applications that are created by impersonating or repackaging a legal mobile application (an official mobile application version). Which typically carries a particular intent or benefit association such as an advertising plug-in, malicious code, etc. There are three main modes of current spurious mobile applications: applying minor changes to the original, such as modifying developer information; replacing the original application interface, such as localization operations like Chinese localization and the like; more codes are added in the legal application, such as adding an advertisement library and the like. Fake mobile applications have become an important security threat for current mobile applications by utilizing official or third party application markets for propagation and inducing user downloads. According to the mobile threat report released in 2019 by McAfee, 65000 new false applications are found only in 2018 in 12 months, the number is about 6.5 times of 1 month, and the trend continues in 2019, and the users are mainly tricked into installing malicious applications through bait strategies such as fishing, and the like, for example, a Google application store has more than 800 ten thousand false applications (including advertisement software) already downloaded. According to a report of 'review of internet network security situation in China in 2018' issued by a corresponding national internet emergency center, a large number of false mobile applications are used for stealing user information and implementing phishing in 2018, and only by taking 'loan APP' as an example, a victim user is found to be more than 150 ten thousand times by sampling monitoring; meanwhile, the number of counterfeit APPs with icons or names similar to the original software is on the rise, for example, the number of the counterfeit APP samples in the mobile application of the financial industry is increased by nearly 3.5 times, which is new and high in recent years.
Currently, people have relatively few direct detection researches on false applications and have high dependence on expert knowledge and artificial experience, and mobile application repackaging detection and malware detection technologies are closely related to the detection researches.
In terms of repacking detection, DroidMoss et al propose to use hash values of mobile Application instruction sequence fragments as fingerprints for repacking detection (the Second ACM Conference on Data and Application Security and Privacy, New York, NY, feb.2012, pp.317-326.); hanna et al consider the hash feature vector of the opcode k-gram to enable Detection of repackaged applications (Detection of Intrusions and Malware, and Vulnerability Association, Berlin, Heidelberg, Jul.2013, pp.62-81.); DNADroid proposes to feature program dependence graphs of mobile applications (Computer Security-esorcics 2012, Berlin, Heidelberg, sep.2012, pp.37-54.); AnDarwin converts the program dependence graph into semantic vectors on the basis of DNADroid to improve the detection efficiency (Computer Security-ESORICS 2013, Berlin, Heidelberg, Sep.2013, pp.182-199); but the generation of the program dependency graph and the comparison between the graphs are time-consuming, so that the large-scale application of the program dependency graph is difficult.
ViewDroid uses the User Interface (UI) of the Mobile application as a criterion for repackaging application decisions, with the basic assumption that repackaging does not typically modify the appearance (the 2014ACM Conference on Security and Privacy in Wireless Mobile Networks, Oxford, United Kingdom, jul.2014, pp.25-36.); zheng et al, characterized by the fact that they combine UIs and their triggered sensitive APIs (the Second ACM works on Security and Privacy in Smartphones and Mobile Devices, Raleigh, North Carolina, oct.2012, pp.93-104.), have difficulty discovering different kinds of fake Mobile applications that contain the same malicious code.
In the aspect of malicious Mobile application detection, Venugopal et al earlier proposed to detect malicious codes by using a signature-based detection method, mainly extracting hash codes of application programs and constructing hash tables for malicious feature matching (Mobile Information Systems, 2008); ghorbanian et al designed and implemented a signature-based hybrid intrusion detection system (the IEEE Conference on Business Engineering and Industrial Applications Coloquium (BEIAC),2013: 827-831); wang et al propose an improved algorithm based on multi-level signature matching to detect Android malware code (the 20155 th International Conference on Information Science and Technology (ICIST),2015: 93-98); signature-based techniques work well for known malware, but are less capable of detecting new types of malware; sarma et al systematically analyzed the application rights of the application programs and designed corresponding detection methods (the 17th ACM symposium on Access controls Models and Technologies,2012: 13-22); qiao et al propose a malicious code detection method that combines rights and API function call characteristics (the 20165 th IIAI International Congress on Advanced Applied information (IIAI-AAI),2016: 566-571); du et al provide a new malware detection method with community structure, which automatically divides the function call graph into community structures as malicious characteristics (IEEE Access, vol.5, pp.17478-17486,2017.); chen et al propose a method for rapidly detecting malicious applications based on a combination of application UI structure and functional control flow graph (24th useneix Security Symposium (useneix Security 15), Washington, d.c., aug.2015, pp.659-674); sun et al propose using a cloud computing environment-based Internet of things malware detection method that uses a reversible sketch structure as a malware feature signature (Software: Practice and expeience, vol.47, No.3, pp.421-441,2017.); shen proposed a method to prevent malware spreading in Internet of Things networks, which was mainly implemented by cloud and fog computing Intrusion Detection Systems (IDS) (IEEE Internet of Things, Journal, vol.5, No.2, pp.1043-1054,2018).
In summary, the fake mobile application has low manufacturing cost and low technical threshold, but has great harm and considerable benefit for attackers, so the main harm mode of the fake application is gradually moved, and the research on the current discovery method of the fake application is still to be intensively researched.
Disclosure of Invention
The invention aims to provide a false mobile application detection method based on multi-dimensional similarity, which can quickly find false mobile applications from a propagation channel (application market), thereby reducing the harm of the false mobile applications.
In order to solve the technical problems, the invention adopts the technical scheme that:
a false mobile application detection method based on multi-dimensional similarity comprises the following steps:
step 1: filtering each mobile application (such as all applications in the App application market) in the mobile application set by using the signature of the mobile application and the characteristics of the public library, and eliminating interference factors;
step 2: the filtered mobile application Set in the step 1 is recorded as Set0, each mobile application in the Set0 is integrally converted into a hash value based on content sensitivity, it is ensured that similar content is mapped to the similar hash value, then mobile application similarity calculation is performed by using a head index of the hash value, and the mobile application Set0 to be processed is divided into a normal mobile application Set and a suspicious false mobile application Set according to a judgment condition;
and step 3: judging the detection result in the step 2 by using a termination condition in the joint strategy, terminating execution if the condition is met, and skipping to the step 7; otherwise, executing step 4;
the joint strategy calculation formula is as follows:
Figure GDA0002867285100000041
in the formula, num (S)x) Number of normal mobile applications, Num (NS), representing the current algorithmic decisionx) A number of applications determined to be suspicious on behalf of the current mobile application;
and 4, step 4: the mobile application Set (marked as Set1) judged to be normal in the step 2 is analyzed, the similarity of the mobile applications in the Set1 is calculated by utilizing a Minhash algorithm through analyzing the characteristics of the resource files, and the mobile application Set1 is divided into a normal application Set and a suspicious false application Set according to the judgment condition;
and 5: judging the detection result in the step 4 by using a termination condition in the joint strategy, terminating execution if the condition is met, and skipping to the step 7; otherwise, executing step 6;
the joint strategy calculation formula is as follows:
Figure GDA0002867285100000051
in the formula, num (S)x) Number of normal mobile applications, Num (NS), representing the current algorithmic decisionx) A number of applications determined to be suspicious on behalf of the current mobile application;
step 6: performing expansion analysis on the mobile application Set (marked as Set2) judged to be normal in the step 4, extracting a function call graph of the mobile application code in the Set2, extracting a Motif structure from the function call graph, calculating the similarity of the mobile application by using a Motif structure comparison algorithm, and dividing the mobile application Set2 into a normal application Set and a suspicious false application Set according to judgment conditions;
and 7: judging the mobile application which is suspicious by the multidimensional similarity algorithm, and verifying whether malicious behaviors are contained by using a virus scanning engine;
and 8: storing the features which are judged to be the false mobile application in the steps 2, 4 and 6 into a database to form a false mobile application feature library;
and step 9: aiming at the false judgment of the single mobile application, extracting features by using the methods in the steps 2, 4 and 6, comparing the extracted features with the similarity of the false mobile application feature library in the step 8, and judging the mobile application feature to be judged as the false mobile application if the mobile application feature to be judged is in the feature library; otherwise, the mobile application is not false; for mobile applications judged to be false, a maliciousness judgment is further performed in step 7.
Further, in step 2, the mobile application similarity algorithm based on the content-sensitive hash value is as follows:
1) constructing based on the content sensitive hash value header: the total head length based on the content sensitive hash value is 5 bytes, and the mobile application file length and the content distribution coefficient of the mobile application are formed, wherein the mobile application file length is closely related to the size of the mobile application and has large change, the mobile application file length needs to be normalized, the original length of the mobile file is firstly logarithmized, and the result of the logarithmization and 256 (the maximum value represented by one byte) are subjected to modulus operation to obtain the normalized mobile file length; and the content distribution coefficient is calculated as formula (1):
Figure GDA0002867285100000061
parameter q in equation (1)1、q2、q3Derived from the Pearson Hash value Classification statistics of fixed-length strings in Mobile applications, where q is2Is the median value of the overall distribution, q1Low 4 quantile, q, in a global distribution3Is integrally distributed high 4 quantile points;
2) and calculating the similarity of the head index based on the content sensitive hash value: suppose any two mobile applications to be compared are denoted by symbols g, h, respectively, and the similarity based on the content sensitive hash header index is calculated as in equation (2):
Figure GDA0002867285100000062
the cdiff is a comparison result of the cyclic redundancy check codes, and if the two cyclic redundancy check codes are the same, the value is 0, and if the two cyclic redundancy check codes are different, the value is 1;
ldiff,
Figure GDA0002867285100000063
and
Figure GDA0002867285100000064
calculated from equation (3):
Υ(g,h,r)=Min((g-h)modr,(h-g)modr) (3)
wherein r is the length of the circular queue and takes the value of 16 or 256, and γ (g, h, r) represents the distance difference between g and h in the circular queue with the size of r;
when calculating idiff, r in formula (3) takes 256, and if the calculation result γ (g, h, r) >1, idiff is idiff × 12, otherwise idiff is γ (g, h, r);
when calculating
Figure GDA0002867285100000065
In formula (3), r is 16, and if result γ (g, h, r) is calculated>1, then
Figure GDA0002867285100000066
Or
Figure GDA0002867285100000067
Otherwise
Figure GDA0002867285100000068
Or
Figure GDA0002867285100000069
Further, the Minhash-based algorithm is adopted in step 4 as follows:
1) extracting resource file information of the mobile application by means of decompilation technology and vectorizing the characteristics of the resource file information, namely expressing the resource characteristics of a single mobile application as S ═ { a ═1,a2,…,am};
2) Realizing the characteristic representation of resources by the way of step 1) for each mobile application in the mobile application set to be compared, if the mobile application set to be comparedN mobile applications, then D ═ S1,S2,...,Sn}; normalizing all the mobile application feature vectors to be compared, and then uniformly expressing the resource feature vectors of the mobile application sets to be compared as U ═ S (S)1∪S2∪...∪Sn) Defining the length x of UuLen (U), then any element a in UiCertain element belonging to D, and converting the characteristic vector into a corresponding characteristic matrix:
Figure GDA0002867285100000071
wherein the row elements are
Figure GDA0002867285100000072
Column element is S1,S2,...,SnThe value of each element in the matrix is 0 or 1, and the corresponding rule is that if the element in the ith row and the j column belongs to sjIf yes, the value of the element is 1, otherwise, the value is 0;
3) constructing a characteristic signature matrix equivalent to the matrix, wherein the row number of the characteristic signature matrix is a fixed value k, the value of k is the median of all resource characteristic vector lengths, and the calculation method of the characteristic signature matrix is as shown in a formula (4);
Figure GDA0002867285100000073
wherein h isi(u) is an independent set of random functions, and the random functions are defined as formula (5)
Figure GDA0002867285100000074
In the formula (5), a and b are both random numbers, the value range is [1, c), and c is a Messenbergin number, and the size of the Messenbergin number is determined according to the size of the random function set;
4) calculating the similarity between different columns in the feature signature matrix by using the standardized Euclidean distance, namely the similarity equivalent to the mobile application resource, and calculating the similarity as shown in a formula (6);
Figure GDA0002867285100000081
further, the Motif structure algorithm based on the similarity of the code structure features in step 6 is as follows:
1) decompiling each mobile application to be processed, and acquiring a function call graph G of a code of the mobile application to be processed, wherein V represents a vertex, namely different functions, and E represents an edge, namely a call relation between the different functions;
2) aiming at the function call obtained in the step 1), taking the Motif structures of three vertexes as a basic structure, extracting the corresponding Motif structures, including the structure types and the corresponding frequency distribution, and forming a feature representation { Motif type and Motif frequency distribution } of the mobile application code, wherein the Motif type is represented by MT and the Motif frequency distribution is represented by MFD;
3) then, the similarity of the Motif categories is calculated by using the Jacobsad distance similarity, and the calculation is as the formula (7)
Figure GDA0002867285100000082
If the calculation result is close to 1, further calculating the similarity of the frequency distribution, and calculating as formula (8);
Figure GDA0002867285100000083
4) calculating the similarity of the mobile application codes by using a formula (9)
Figure GDA0002867285100000084
Further, in step 7, malicious behavior determination is performed using the VirusTotal maintained by Google.
Further, the judgment conditions in step 2 are as follows: and if the similarity calculation result is less than or equal to 50, determining the mobile application is suspicious false mobile application, otherwise, determining the mobile application is normal mobile application.
Further, in step 3, if the combined policy calculation result is > 98%, the condition is satisfied.
Further, in step 4, the judgment condition is: and if the similarity calculation result is less than or equal to 0.5, determining the mobile application is suspicious false mobile application, otherwise, determining the mobile application is normal mobile application.
Further, in step 5, if the combined policy calculation result is > 98%, the condition is satisfied.
Further, the judgment conditions in step 6 are as follows: and if the similarity calculation result is less than or equal to 1.0, determining the mobile application is suspicious false mobile application, otherwise, determining the mobile application is normal mobile application.
Compared with the prior art, the invention has the beneficial effects that:
1) according to the invention, the false application is detected from a propagation channel according to the characteristic that the false application is self-propagated through an application market, so that the harm of the false application can be reduced from the source;
2) according to the method, the false mobile application is detected from the similarity of three different dimensions, namely the whole application, application resources and application codes according to the mode of the false application, so that the comprehensive detection of the false mobile application is realized;
3) the invention provides three targeted algorithms by applying the characteristics of three different sides to the application, and on the basis, combines a plurality of algorithms into a whole by utilizing a combination strategy, thereby realizing the balance of false application detection accuracy and time consumption;
4) the method takes the application program as an object, does not need a separate and clearly labeled training sample, combines the advantages of accuracy and time consumption in 3), and can be used as a supplement of a safety detection mechanism in an application market (an application store) to reduce the harm of false application.
Drawings
FIG. 1 is a general flowchart of false application detection based on multi-dimensional similarity.
FIG. 2 is a framework for computing resource similarity based on mobile applications.
FIG. 3 is a computing framework based on mobile application code similarity.
Fig. 4 shows the results of the detection on the reference data set based on the multidimensional similarity algorithm.
Fig. 5 shows the results of the detection in the real environment dataset based on the multidimensional similarity algorithm.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. The invention relates to a false mobile application detection method based on multidimensional similarity, which comprises the following steps:
step 1: and filtering the mobile application by using the signature information of the mobile application developer and the characteristics of a public library (including a development library provided by Android or IoS and a public library provided by a third party developer), and eliminating interference factors for subsequent similarity calculation.
Step 2: for the mobile application filtered in the step 1, the whole mobile application is converted into a hash value based on content sensitivity, so that similar content is mapped to the similar hash value, on the basis, the similarity of the mobile application is calculated by using a hash value head index, and the result is divided into a normal application and a suspicious false application, which specifically comprises the following steps: A) calculating based on the content sensitive hash head characteristics, and B) calculating based on the similarity of the content sensitive hash head indexes.
A) The total length of the header based on the content sensitive hash value is 5 bytes, and the total length of the header is composed of a cyclic redundancy check code, the length of a mobile application file and a content distribution coefficient of mobile application;
normalizing the length of the mobile application file, namely firstly, logarithmizing the original length of the mobile file, and carrying out modular operation on a result of the logarithmization and 256 to obtain the length of the normalized mobile file; and the content distribution coefficient is calculated as formula (1):
Figure GDA0002867285100000101
parameter q in equation (1)1、q2、q3Derived from the Pearson Hash value Classification statistics of fixed-length strings in Mobile applications, where q is2Is the median value of the overall distribution, q1Low 4 quantile, q, in a global distribution3Is integrally distributed high 4 quantile points;
B) suppose any two mobile applications to be compared are denoted by symbols g, h, respectively, and the similarity based on the content sensitive hash header index is calculated as in equation (2):
Figure GDA0002867285100000102
the cdiff is a comparison result of the cyclic redundancy check codes, and if the two cyclic redundancy check codes are the same, the value is 0, and if the two cyclic redundancy check codes are different, the value is 1;
ldiff,
Figure GDA0002867285100000111
and
Figure GDA0002867285100000112
calculated from equation (3):
Υ(g,h,r)=Min((g-h)modr,(h-g)modr) (3)
wherein r is the length of the circular queue and takes the value of 16 or 256, and γ (g, h, r) represents the distance difference between g and h in the circular queue with the size of r;
when calculating idiff, r in formula (3) takes 256, and if the calculation result γ (g, h, r) >1, idiff is idiff × 12, otherwise idiff is γ (g, h, r);
when calculating
Figure GDA0002867285100000113
In formula (3), r is 16, and if result γ (g, h, r) is calculated>1, then
Figure GDA0002867285100000114
Or
Figure GDA0002867285100000115
Otherwise
Figure GDA0002867285100000116
Or
Figure GDA0002867285100000117
And step 3: and (3) judging the detection result in the step (2) by utilizing a termination condition in the joint strategy, wherein the calculation formula is as follows:
Figure GDA0002867285100000118
in the formula, num (S)x) Number of normal mobile applications, Num (NS), representing the current algorithmic decisionx) Representing the number of suspicious applications determined by the current mobile application. If it is not
Figure GDA0002867285100000119
And (4) if the threshold value is larger than the set threshold value, jumping to the step 7, and otherwise, executing the step 4.
And 4, step 4: and (3) under the guidance of a joint strategy, further performing analysis on the application judged to be normal in the step (2), calculating the similarity of the application by utilizing a Minhash algorithm through analyzing the characteristics of the resource file, and dividing the result into suspicious mobile application and normal mobile application. The method mainly comprises the following steps: A) extracting a source file, B) expressing the resource file characteristics, C) constructing a resource file signature characteristic matrix and D) calculating the similarity.
A) Extracting the information of the mobile application resource file by means of decompilation technology and vectorizing the characteristics of the file, namely expressing the characteristics of a single mobile application resource as S ═ a1,a2,…,am};
B) Performing characteristic representation on resources by means of the step 1) for each mobile application in the mobile application set to be compared, and if there are n mobile applications in the mobile application set to be compared, representing that D ═ S1,S2,...,Sn}; all the mobile application feature vectors to be compared are normalized, thenThe resource feature vectors of the mobile application sets to be compared are collectively denoted as U ═ S (S)1∪S2∪...∪Sn) Defining the length x of UuLen (U), then any element a in UiCertain element belonging to D, and converting the characteristic vector into a corresponding characteristic matrix:
Figure GDA0002867285100000121
wherein the row elements are
Figure GDA0002867285100000122
Column element is S1,S2,...,SnThe value of each element in the matrix is 0 or 1, and the corresponding rule is that if the element in the ith row and the j column belongs to sjIf yes, the value of the element is 1, otherwise, the value is 0;
C) constructing a characteristic signature matrix equivalent to the matrix, wherein the row number of the characteristic signature matrix is a fixed value k, the value of k is the median of all resource characteristic vector lengths, and the calculation method of the characteristic signature matrix is as shown in a formula (4);
Figure GDA0002867285100000123
wherein h isi(u) is an independent set of random functions, and the random functions are defined as formula (5)
Figure GDA0002867285100000124
In the formula (5), a and b are both random numbers, the value range is [1, c), and c is a Messenbergin number, and the size of the Messenbergin number is determined according to the size of the random function set;
D) calculating the similarity between different columns in the feature signature matrix by using the standardized Euclidean distance, namely the similarity equivalent to the mobile application resource, and calculating the similarity as shown in a formula (6);
Figure GDA0002867285100000131
and 5: and (4) judging the detection result in the step (4) by using a termination condition in the joint strategy, wherein the calculation formula is as follows:
Figure GDA0002867285100000132
in the formula, num (S)x) Number of normal mobile applications, Num (NS), representing the current algorithmic decisionx) Representing the number of suspicious applications determined by the current mobile application. If it is not
Figure GDA0002867285100000133
And (4) if the threshold value is larger than the set threshold value, jumping to the step 7, and otherwise, executing the step 4.
Step 6: under the guidance of a joint strategy, further performing development analysis on the application judged to be normal in the step 4, and performing A) extraction of a function call graph of the mobile application code, B) extraction of a Motif structure of the function call graph, C) calculation of similarity of the mobile application by using a Motif structure comparison algorithm, and dividing the result into suspicious mobile application and normal mobile application.
A) Decompiling each mobile application to be processed, and acquiring a function call graph G of a code of the mobile application to be processed, wherein V represents a vertex, namely different functions, and E represents an edge, namely a call relation between the different functions;
B) aiming at the function call obtained in the step 1), taking the Motif structures of three vertexes as a basic structure, extracting the corresponding Motif structures, including the structure types and the corresponding frequency distribution, and forming a feature representation { Motif type and Motif frequency distribution } of the mobile application code, wherein the Motif type is represented by MT and the Motif frequency distribution is represented by MFD;
C) then, the similarity of the Motif categories is calculated by using the Jacobsad distance similarity, and the calculation is as the formula (7)
Figure GDA0002867285100000134
If the calculation result is close to 1, further calculating the similarity of the frequency distribution, and calculating as formula (8);
Figure GDA0002867285100000141
calculating the similarity of the mobile application codes by using a formula (9)
Figure GDA0002867285100000142
And 7: and judging the mobile application which is suspicious by the multidimensional similarity algorithm, and further utilizing a plurality of antivirus engines to verify whether malicious behaviors are contained. The VirusTotal is mainly used for judging malicious behaviors, more than 60 antivirus engines which are mainstream at home and abroad are integrated by the VirusTotal, and a rule for judging a false mobile application as containing the malicious behaviors is as follows: the results returned by the VirusTotal include at least 10 antivirus engines which are judged to be malicious mobile applications, and the antivirus engines need to include well-known antivirus engines such as symmetrel, norton, Mcafee, cabasco, 360 and the like.
The multidimensional similarity false mobile application detection algorithm provided by the invention has the advantages that the test results in the authoritative data sets and the real environment respectively correspond to the graph shown in the figure 4 and the graph shown in the figure 5, and the accuracy rate on both the authoritative data sets is more than 99 percent, and the accuracy rate in the real environment reaches 97.43 percent. By using the method, the invention discovers that part of the known applications (ranked according to the application market download amount) still have false versions in the current mainstream application market. The invention balances the detection accuracy and time consumption under the control of the combined strategy, and can be used as an effective supplement (aiming at false application detection) of the application market security strategy.

Claims (10)

1. A false mobile application detection method based on multi-dimensional similarity is characterized by comprising the following steps:
step 1: filtering each mobile application in the mobile application set by using the signature of the mobile application and the characteristics of the public library to eliminate interference factors;
step 2: the filtered mobile application Set in the step 1 is recorded as Set0, each mobile application in the Set0 is integrally converted into a hash value based on content sensitivity, it is ensured that similar content is mapped to the similar hash value, then mobile application similarity calculation is performed by using a head index of the hash value, and the mobile application Set0 to be processed is divided into a normal mobile application Set and a suspicious false mobile application Set according to a judgment condition;
and step 3: judging the detection result in the step 2 by using a termination condition in the joint strategy, terminating execution if the condition is met, and skipping to the step 7; otherwise, executing step 4;
the joint strategy calculation formula is as follows:
Figure FDA0002867285090000011
in the formula, num (S)x) Number of normal mobile applications, Num (NS), representing the current algorithmic decisionx) A number of applications determined to be suspicious on behalf of the current mobile application;
and 4, step 4: recording the mobile application Set judged to be normal in the step 2 as Set1, performing analysis on Set1, calculating the similarity of the mobile applications in Set1 by using a Minhash algorithm through analyzing the characteristics of resource files, and dividing the mobile application Set1 into a normal application Set and a suspicious false application Set according to judgment conditions;
and 5: judging the detection result in the step 4 by using a termination condition in the joint strategy, terminating execution if the condition is met, and skipping to the step 7; otherwise, executing step 6;
step 6: recording the mobile application Set judged to be normal in the step 4 as Set2, performing analysis on Set2, extracting a function call graph of mobile application codes in Set2, extracting a Motif structure from the function call graph, calculating the similarity of the mobile applications by using a Motif structure comparison algorithm, and dividing the mobile application Set2 into a normal application Set and a suspicious false application Set according to judgment conditions;
and 7: judging the mobile application which is suspicious by the multidimensional similarity algorithm, and verifying whether malicious behaviors are contained by using a virus scanning engine;
and 8: storing the features which are judged to be the false mobile application in the steps 2, 4 and 6 into a database to form a false mobile application feature library;
and step 9: aiming at the false judgment of the single mobile application, extracting features by using the methods in the steps 2, 4 and 6, carrying out similarity comparison with a false mobile application feature library in the step 8, and judging the mobile application feature to be judged as the false mobile application if the mobile application feature to be judged is in the feature library; otherwise, the mobile application is not false; for mobile applications judged to be false, a maliciousness judgment is further performed in step 7.
2. The false mobile application detection method based on multidimensional similarity as claimed in claim 1, wherein the mobile application similarity algorithm based on content sensitive hash value in step 2 is:
1) constructing based on the content sensitive hash value header: the total length of the header based on the content sensitive hash value is 5 bytes, and the total length of the header is composed of a cyclic redundancy check code, the length of a mobile application file and a content distribution coefficient of mobile application;
normalizing the length of the mobile application file, namely firstly, logarithmizing the original length of the mobile file, and carrying out modular operation on a result of the logarithmization and 256 to obtain the length of the normalized mobile file; and the content distribution coefficient is calculated as formula (1):
Figure FDA0002867285090000021
parameter q in equation (1)1、q2、q3Derived from the Pearson Hash value Classification statistics of fixed-length strings in Mobile applications, where q is2Is a wholeMedian value of distribution, q1Low 4 quantile, q, in a global distribution3Is integrally distributed high 4 quantile points;
2) and calculating the similarity of the head index based on the content sensitive hash value: suppose any two mobile applications to be compared are denoted by symbols g, h, respectively, and the similarity based on the content sensitive hash header index is calculated as in equation (2):
Figure FDA0002867285090000031
the cdiff is a comparison result of the cyclic redundancy check codes, and if the two cyclic redundancy check codes are the same, the value is 0, and if the two cyclic redundancy check codes are different, the value is 1;
ldiff,
Figure FDA0002867285090000032
and
Figure FDA0002867285090000033
calculated from equation (3):
γ(g,h,r)=Min((g-h)mod r,(h-g)mod r) (3)
wherein r is the length of the circular queue and takes the value of 16 or 256, and γ (g, h, r) represents the distance difference between g and h in the circular queue with the size of r;
when calculating idiff, r in formula (3) takes 256, and if the calculation result γ (g, h, r) >1, idiff is idiff × 12, otherwise idiff is γ (g, h, r);
when calculating
Figure FDA0002867285090000034
In formula (3), r is 16, and if result γ (g, h, r) is calculated>1, then
Figure FDA0002867285090000035
Or
Figure FDA0002867285090000036
Otherwise
Figure FDA0002867285090000037
Or
Figure FDA0002867285090000038
3. The method for detecting false mobile applications based on multidimensional similarity as claimed in claim 1, wherein the Minhash algorithm is adopted in step 4 as follows:
1) extracting resource file information of the mobile application by means of decompilation technology and vectorizing the characteristics of the resource file information, namely expressing the resource characteristics of a single mobile application as S ═ { a ═1,a2,…,am};
2) Performing characteristic representation on resources by means of the step 1) for each mobile application in the mobile application set to be compared, and if there are n mobile applications in the mobile application set to be compared, representing that D ═ S1,S2,...,Sn}; normalizing all the mobile application feature vectors to be compared, and then uniformly expressing the resource feature vectors of the mobile application sets to be compared as U ═ S (S)1∪S2∪...∪Sn) Defining the length x of UuLen (U), then any element a in UiCertain element belonging to D, and converting the characteristic vector into a corresponding characteristic matrix:
Figure FDA0002867285090000041
wherein the row elements are
Figure FDA0002867285090000042
Column element is S1,S2,...,SnThe value of each element in the matrix is 0 or 1, and the corresponding rule is that if the element in the ith row and the j column belongs to sjIf yes, the value of the element is 1, otherwise, the value is 0;
3) constructing a characteristic signature matrix equivalent to the matrix, wherein the row number of the characteristic signature matrix is a fixed value k, the value of k is the median of all resource characteristic vector lengths, and the calculation method of the characteristic signature matrix is as shown in a formula (4);
Figure FDA0002867285090000043
wherein h isi(u) is an independent set of random functions, and the random functions are defined as formula (5)
Figure FDA0002867285090000044
In the formula (5), a and b are both random numbers, the value range is [1, c), and c is a Messenbergin number, and the size of the Messenbergin number is determined according to the size of the random function set;
4) calculating the similarity between different columns in the feature signature matrix by using the standardized Euclidean distance, namely the similarity equivalent to the mobile application resource, and calculating the similarity as shown in a formula (6);
Figure FDA0002867285090000045
4. the method for detecting false mobile applications based on multidimensional similarity as claimed in claim 1, wherein the Motif structure algorithm based on the similarity of code structure features in step 6 is as follows:
1) decompiling each mobile application to be processed, and acquiring a function call graph G of a code of the mobile application to be processed, wherein V represents a vertex, namely different functions, and E represents an edge, namely a call relation between the different functions;
2) aiming at the function call obtained in the step 1), taking the Motif structures of three vertexes as a basic structure, extracting the corresponding Motif structures, including the structure types and the corresponding frequency distribution, and forming a feature representation { Motif type and Motif frequency distribution } of the mobile application code, wherein the Motif type is represented by MT and the Motif frequency distribution is represented by MFD;
3) then, the similarity of the Motif categories is calculated by using the Jacobsad distance similarity, and the calculation is as the formula (7)
Figure FDA0002867285090000051
If the calculation result is close to 1, further calculating the similarity of the frequency distribution, and calculating as formula (8);
Figure FDA0002867285090000052
4) calculating the similarity of the mobile application codes by using a formula (9)
Figure FDA0002867285090000053
5. The multi-dimensional similarity-based false mobile application detection method according to claim 1, wherein in step 7, malicious behavior determination is performed by using VirusTotal maintained by Google.
6. The method for detecting false mobile application based on multi-dimensional similarity as claimed in claim 1, wherein the determination conditions in step 2 are: and if the similarity calculation result is less than or equal to 50, determining the mobile application is suspicious false mobile application, otherwise, determining the mobile application is normal mobile application.
7. The method according to claim 1, wherein in step 3, if the joint policy computation result is > 98%, the condition is satisfied.
8. The method for detecting false mobile application based on multidimensional similarity as claimed in claim 1, wherein in step 4, the determination condition is: and if the similarity calculation result is less than or equal to 0.5, determining the mobile application is suspicious false mobile application, otherwise, determining the mobile application is normal mobile application.
9. The method according to claim 1, wherein in step 5, if the joint policy computation result is > 98%, the condition is satisfied.
10. The method for detecting false mobile application based on multidimensional similarity as claimed in claim 1, wherein the determination condition in step 6 is: and if the similarity calculation result is less than or equal to 1.0, determining the mobile application is suspicious false mobile application, otherwise, determining the mobile application is normal mobile application.
CN201910835333.9A 2019-09-05 2019-09-05 False mobile application detection method based on multi-dimensional similarity Active CN110647747B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910835333.9A CN110647747B (en) 2019-09-05 2019-09-05 False mobile application detection method based on multi-dimensional similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910835333.9A CN110647747B (en) 2019-09-05 2019-09-05 False mobile application detection method based on multi-dimensional similarity

Publications (2)

Publication Number Publication Date
CN110647747A CN110647747A (en) 2020-01-03
CN110647747B true CN110647747B (en) 2021-02-09

Family

ID=69010123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910835333.9A Active CN110647747B (en) 2019-09-05 2019-09-05 False mobile application detection method based on multi-dimensional similarity

Country Status (1)

Country Link
CN (1) CN110647747B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112187823B (en) * 2020-10-13 2022-04-19 绍兴文理学院 Internet of things availability evaluation method for malicious program diffusion under fog computing architecture
CN112328977B (en) * 2020-11-09 2024-03-22 杭州安恒信息技术股份有限公司 Application software authenticity detection method, device, equipment and medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9286549B1 (en) * 2013-07-15 2016-03-15 Google Inc. Sublinear time classification via feature padding and hashing
CN106682505B (en) * 2016-05-04 2020-06-12 腾讯科技(深圳)有限公司 Virus detection method, terminal, server and system
CN106599686B (en) * 2016-10-12 2019-06-21 四川大学 A kind of Malware clustering method based on TLSH character representation
CN106803040B (en) * 2017-01-18 2021-08-10 腾讯科技(深圳)有限公司 Virus characteristic code processing method and device
CN107103235A (en) * 2017-02-27 2017-08-29 广东工业大学 A kind of Android malware detection method based on convolutional neural networks
CN108491458A (en) * 2018-03-02 2018-09-04 深圳市联软科技股份有限公司 A kind of sensitive document detection method, medium and equipment

Also Published As

Publication number Publication date
CN110647747A (en) 2020-01-03

Similar Documents

Publication Publication Date Title
Wang et al. Constructing features for detecting android malicious applications: issues, taxonomy and directions
Gopinath et al. A comprehensive survey on deep learning based malware detection techniques
Gharib et al. Dna-droid: A real-time android ransomware detection framework
Lin et al. Identifying android malicious repackaged applications by thread-grained system call sequences
US10867038B2 (en) System and method of detecting malicious files with the use of elements of static analysis
Zhang et al. Semantics-aware android malware classification using weighted contextual api dependency graphs
Homayoun et al. A blockchain-based framework for detecting malicious mobile applications in app stores
Mehtab et al. AdDroid: rule-based machine learning framework for android malware analysis
US20120174227A1 (en) System and Method for Detecting Unknown Malware
Vidal et al. A novel pattern recognition system for detecting Android malware by analyzing suspicious boot sequences
Lee et al. Screening smartphone applications using malware family signatures
Li et al. Opcode sequence analysis of Android malware by a convolutional neural network
Shen et al. Detect android malware variants using component based topology graph
Abbas et al. Low-complexity signature-based malware detection for IoT devices
US20210334371A1 (en) Malicious File Detection Technology Based on Random Forest Algorithm
Wang et al. LSCDroid: Malware detection based on local sensitive API invocation sequences
CN110647747B (en) False mobile application detection method based on multi-dimensional similarity
Xu et al. SoProtector: Safeguard privacy for native SO files in evolving mobile IoT applications
Shrivastava et al. Privacy issues of android application permissions: A literature review
Han et al. Identifying malicious Android apps using permissions and system events
Akhtar Malware detection and analysis: Challenges and research opportunities
Kandukuru et al. Android malicious application detection using permission vector and network traffic analysis
Zhang et al. A multiclass detection system for android malicious apps based on color image features
Wu et al. Detection of fake IoT app based on multidimensional similarity
Ding et al. Automaticlly learning featurs of android apps using cnn

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant