CN111417121B

CN111417121B - Multi-malware hybrid detection method, system and device with privacy protection function

Info

Publication number: CN111417121B
Application number: CN202010097900.8A
Authority: CN
Inventors: 王静雯; 闫峥; 于熙洵; 彭立; 魏文涛
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2022-04-12
Anticipated expiration: 2040-02-17
Also published as: CN111417121A

Abstract

The invention belongs to the technical field of malicious software detection, and discloses a multi-malicious software hybrid detection method, system and device with privacy protection. A third party generates a public and private key pair according to a homomorphic encryption algorithm and issues a public key; the client collects behavior data of the user group using software, performs primary calculation, uses a third party public key for encryption, adds the encrypted data with the generated random number and uploads the result to the server; the server side encrypts data according to the uploaded user group by using a credit evaluation algorithm, performs interactive decryption with a third party by using homomorphism properties, completes calculation of credit values of different software, and determines a software detection sequence according to the credit values of the software; during detection, the server side sequentially calls API use frequency data obtained by the decompilated software APK from the client side according to the sequence, static detection is carried out on the software according to the static learning model, if the detection result is non-malicious, related encrypted data and a public key of the encrypted data are called according to a system collected by the client side, and real-time detection is carried out by utilizing a homomorphism property and a dynamic learning model.

Description

Multi-malware hybrid detection method, system and device with privacy protection function

Technical Field

The invention belongs to the technical field of malicious software detection, and particularly relates to a multi-malicious software hybrid detection method, system and device with privacy protection.

Background

Currently, the closest prior art: the mobile malicious software is application software which is operated on a mobile terminal with a mobile communication function, such as a smart phone, and has malicious behaviors of eavesdropping user calls, stealing user information, destroying user data, using payment services without permission, sending junk information, pushing advertisements or fraud information, influencing the operation of the mobile terminal, damaging the network security of the internet and the like. With the comprehensive popularization of smart phones and the vigorous development of the mobile application industry, mobile malicious software also widely floods the application market. The mobile malicious software often introduces security holes into the mobile equipment, so that economic loss of users is caused, and privacy disclosure and other problems are caused. Therefore, establishing mechanisms to protect users from being attacked by discovering and preventing malware has become a direction of interest to many researchers in the security field. Today, the Android system is most widely used, and therefore, we focus research on malware detection for the Android system. In actual detection of malicious software, considering different popularity of different software, once malicious behavior occurs, the influence surface and the brought harm are different, and an optimized scheme should introduce a software influence degree evaluation mechanism with privacy protection and perform preferential detection on software with wide influence range, so as to early discover and early defend and reduce the brought serious damage. Meanwhile, the scheme for detecting the malicious software has high calculation cost, related services are often provided by means of a cloud server, and some privacy of the user can be hidden by data uploaded to the cloud server by the user, which brings privacy problems.

Android malware detection schemes are roughly divided into three categories: static detection, dynamic detection, and hybrid detection. The existing scheme is briefly described as follows:

(1) static detection refers to determining malware by looking for malicious features and malicious code segments without executing an application. In existing solutions, information such as permissions, code, etc. is usually obtained by decompiling the APK file of the application, from which malicious features are sought for detection. Sujithra and Padmavathi [1] propose a scheme for Android malware detection using a classification method and an optimization method in machine learning by using authority information. The method comprises the steps of decompiling an APK file of an Android application program, obtaining information related to authority from a configuration file as characteristics, after characteristic selection is completed, carrying out model training by using a classification algorithm in machine learning to obtain a classifier, enabling the classifier to divide software into normal software and malicious software, and then detecting the malicious software by using the classifier. However, if the APK file is generated by using the obfuscation technique, the method may not obtain the correct configuration information, and thus the detection may not be performed. Kapse et al [2] obtain data information in the configuration file related to permissions, components, and API calls by decompiling. And distributing weight values according to the malicious behaviors of the malicious software, and distributing the most common authority and API in the malicious software with the maximum weight values. And determining a threshold value of the weight by analyzing the malicious software and the normal software, and judging whether the software is malicious or not according to the threshold value. The scheme can cope with the confusion policy attack of the malicious software, but cannot detect the action of the application program defined by the runtime. Arp et al [3] propose a solution for static analysis by obtaining various characteristics from configuration files and software code. According to the scheme, the APK file is decompiled to obtain information such as permission, API call and network address in a configuration file and a code, the information is mapped to a vector space, then machine learning is used for model training, and a detection model is obtained for later detection. However, this approach does not detect malware that uses obfuscation techniques as well as dynamic code loading techniques. The static detection method based on the API grade obtains the API, the package and the parameter information in the codes as the characteristics by decompiling the APK files of the malicious software and normal software samples, and then uses a machine learning KNN algorithm to train a classifier to detect whether the software is malicious or not. The scheme is based on a KNN algorithm, and the operation cost is high. Wu et al [5] designed a static detection system DroidMat that extracts information such as request permission, intent, etc. from a manifest file of software by decompiling an APK file, and simultaneously obtains API calls of each component, and performs enhanced modeling using a K-means algorithm. And finally, completing the classification of the software by using a KNN algorithm. Also, this approach fails to detect dynamically loaded malicious code. In summary, the static detection method is simple and efficient in the data extraction stage, but is easily deceived by the obfuscation scheme, and cannot detect the behavior of the application program defined at runtime.

(2) And dynamic detection, namely executing the application program in the isolation environment to acquire the dynamic behavior of the application program, and detecting the malicious software according to the dynamic behavior. Burguera et al [6] propose a solution for analyzing software dynamic behavior by collecting system calls from multiple real users using crowd sourcing and a central server. After the Linux system call of the application is collected, the Linux system call is sent to the central server, and the central server detects the corresponding software by using a clustering algorithm in machine learning. However, in this scheme, the system call collected from the user implies the use behavior information of the software by the user, and the information belongs to the behavior privacy information of the user and is not protected in the scheme. Shabtai et al [7] propose a behavior-based Android malware detection system. The system acquires data by continuously monitoring status information of the equipment, such as power consumption, CPU consumption and the like. Thereafter, normal software and malware are distinguished using machine learning algorithms. But the solution stays in the theoretical part and no real data set is used for testing. Zhao et al [8] designed a detection framework based on SVM algorithm-AntiMalDroid, which utilizes machine learning for detection. The framework is roughly divided into two phases, a training phase and a detection phase. In the training phase, the software sample which is known to be malicious or not is utilized, the behavior and the characteristic of the software are monitored in the software execution process, and the behavior and the characteristic serve as characteristics, and the SVM algorithm is utilized to train the detection model. In the detection stage, the software to be detected is detected by using the model obtained in the training stage. The scheme consumes more time in the detection process. Dini et al [9] designed a multi-layered Android malware detector that could monitor the Android system at both the kernel and user layers and use machine learning techniques to distinguish between normal and malicious behavior to detect software. And in the kernel layer, the system call, the running process, the CPU use condition and other information are evaluated. At the user level, it evaluates the information of keystrokes, dialed numbers, SMS, bluetooth and Wi-Fi sent and received. However, the scheme is complex in process and is not suitable for real-time detection of real scenes. In summary, compared to the static detection method, the result of the dynamic detection is more accurate, but a large amount of computing resources are consumed.

(3) The hybrid detection combines the static detection and the dynamic detection, and balances the advantages and the disadvantages of the hybrid detection and the dynamic detection. Architecture for detection using data such as opcodes, text information, system calls, administrator privileges, etc. Martinelli et al [10] propose a hybrid detection architecture. For the software to be detected, the scheme firstly carries out decompiling on the software to be detected to obtain an operation code, and then uses an SVM algorithm in machine learning as a classifier to divide the application into normal and malicious applications. And for the software with a normal detection result, a dynamic detection method is used for acquiring text information, system call and administrator permission of the software as characteristics, and a classifier and a security policy are used for detecting whether the software is malicious or not. However, the scheme does not consider that the software is sequenced according to the influence degree before detection, and the software with high influence degree is preferentially detected so as to optimally reduce the damage caused by the malicious behaviors of the software. Meanwhile, during real-time detection, the used system call implies the use behavior of the user on the application, which relates to privacy, so that the privacy protection problem needs to be considered. Yuan et al [11] proposed a method of extracting features from static analysis and dynamic analysis, respectively, and detecting using deep learning. It extracts permissions, sensitive APIs and dynamic behavior as features. Wherein the authority and sensitive API information is extracted through an APK file of a decompilation application, and the dynamic behavior information is extracted through a safe isolatable execution environment-sandbox. And finally, detecting the software by using deep learning as a detection algorithm. The scheme has no real-time detection process and the deep learning is more complex. Blaising et al [12] proposed AASandbox to test software by a combination of static and dynamic assays. A static analysis part, which first analyzes using the decompiled dex file. And then analyzing the interaction information of the application and the system bottom layer by executing the application program in the isolated sandbox environment to complete dynamic analysis. However, this solution does not involve detection of the application at real-time runtime, nor does it take priority order and privacy protection into account. In summary, in the existing solutions, sequencing the detected sequence of the software during detection according to the influence degree of the application is not considered, and the problem of privacy protection in the detection process is also not considered.

In summary, the problems of the prior art are as follows:

(1) the prior art lacks a scheme capable of effectively detecting whether an application program is malicious or not before installation and use and in real-time running.

(2) In the prior art, the detection priority is not considered, and for popular and widely used software, whether the software is malicious or not should be preferentially detected in order to prevent damage caused by malicious behaviors.

(3) In the prior art, privacy protection of a user in a cloud server malicious software detection process is not considered.

The difficulty of solving the technical problems is as follows:

the technical problem is that how to determine an effective scheme for detection before installation and during real-time operation of an application program, how to determine a detection priority order can reduce damage caused by malicious applications, and how to protect privacy of a user in the whole detection process.

The significance of solving the technical problems is as follows:

(1) the application program is detected before installation and in real-time operation, so that the detection completeness and effectiveness can be ensured, and damage caused by malicious application is prevented.

(2) The privacy information of the user is protected in the detection process, and the privacy of the user can be prevented from being invaded.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a multi-malicious software hybrid detection method, a system and a device with privacy protection.

The invention is realized in such a way that a multi-malware mixed detection method with privacy protection comprises the following steps:

firstly, a third party generates a public and private key pair according to a homomorphic encryption key generation algorithm and publishes a public key to all clients and a server;

secondly, the client collects behavior data of different software used by the user group, performs simple preliminary calculation, encrypts the data by using a public key from a third party, adds the data with the generated random number, and uploads the result to the server;

thirdly, the server side completes calculation of different software credit values by using homomorphic addition property according to the uploaded encrypted data of the user group and through interactive decryption with a third party under the condition that privacy data of the client side are not obtained by using a credit evaluation algorithm, sorts the software credit values according to the sizes of the software credit values and determines the detection sequence of the software;

and fourthly, during detection, the server side sequentially and alternately calls API use frequency data which are obtained by using decompilated software APK and correspond to the software with the client side according to the sequence, sequentially and statically detects the software according to a static learning model, and for the software with a static detection result of non-malicious, carries out real-time detection by utilizing homomorphic addition property and a dynamic learning model according to system call related encrypted data collected by the client side and a public key which is issued by the client side and generated according to a homomorphic encryption key generation algorithm.

Further, the reputation evaluation of the multi-malware mixed detection method quantifies the popularity of the software according to the use condition of the software by a user, and the popularity is expressed by a reputation value; a semi-trusted third party is introduced, and privacy protection is achieved by means of a homomorphic encryption technology.

Further comprising:

firstly, a third party generates a public and private key Pair (PK) according to a homomorphic encryption key generation algorithm KeyGen_p，SK_p) And the public key PK_pPublishing to all clients and servers;

secondly, the client collects the use times, duration and frequency information of the software in a given time window, and preliminarily calculates the recommendation credit value s of the software according to a formula_kAggregated reputation values

And their product

Third, the client uses the public key PK provided by the third party_pAnd a self-generated random number r_kEncrypting the data to obtain HE(s)_k+r_k) And

the encrypted data and r_kSending the data to the server side together;

fourthly, after the server side obtains the data from all the clients, all the random numbers are summed to obtain

Calculated using the homomorphism properties mentioned

And

and sending the data to a third party for decryption;

fifthly, the third party uses the private key SK of the third party_pDecrypting the received encrypted data to obtain decrypted data

And

and sending the data to the server;

sixthly, the server receivesData known to itself after coming from a third party

Are subtracted to obtain

And

secondly, calculating by using a formula to obtain an applied credit value R (i), and obtaining the credit value of the software by the server side on the premise of not knowing data information sent by each client side;

wherein,

to S^kIs calculated as follows:

wherein,

the initial value of gamma is 0, and when y is less than 0, gamma is equal to gamma + 1; thr is 3, δ is 0.05, and μ is 0.1.

Further, the second step includes:

1) by monitoring the number of times N that a user uses software in a given time window_i(t), duration UT_i(t) and frequency FE_i(t), quantifying the use behavior UB, the reflection behavior RB and the association behavior CB of the software, wherein the formula is as follows:

a) quantification of UB, expressed as the UB component of the personal trust value of the user for software i at time t, is formulated as follows:

b) the quantification of RB, expressed as the RB component of the user's personal trust value for software i at time t, is formulated as follows:

T_i(t)_RB＝2(d_t{N_i(t)+UT_i(t)+FE_i(t)})；

wherein,

c) the quantification of CB, expressed as the CB component of the user's personal trust value for software i at time t, is formulated as follows:

wherein,

2) calculating the trust value T of the user to the software i according to the quantized values of UB, RB and CB_i(t)：

Wherein,

3) reputation value R (i) of software i, and recommended trust value S for software i by user using the software_kAnd an aggregate reputation value calculated based on the user's usage experience

Correlation, the calculation formula is as follows:

wherein,

further, after the application program is downloaded and before the application program is installed and used, the static detection method of the multi-malware hybrid detection method obtains the danger level authority and the information of the corresponding API as the characteristics by decompiling the APK file, and then completes the detection of the malware by using a machine learning method, wherein the detection process is as follows:

the method comprises the steps that firstly, the service end performs decompiling on the APK of the existing normal software and malicious software of known types, the frequency of occurrence of a system API corresponding to the used dangerous authority is obtained, the frequency is used as a characteristic, a supervised learning algorithm in machine learning is operated to perform model training, and a classifier for performing static detection on the software is obtained;

secondly, the client side decompiles the downloaded APK, obtains the occurrence times of the API corresponding to the danger level authority in the file, and uploads the data to the server side for detection;

and thirdly, after receiving the data from the client, the server detects the software by using a classifier obtained by offline training, judges whether the software is malicious or not and returns the result to the client.

Further, the real-time detection of the multi-malicious software hybrid detection method utilizes the system call sequence data during the software operation to perform real-time detection, and once a malicious behavior is found, the malicious behavior is immediately reflected to the client. Meanwhile, the collected system call information implies the behavior privacy of the user using the software; the method specifically comprises the following steps:

the method comprises the following steps that firstly, key generation of a client side and online model training of a server side are divided into two stages;

off-line training: the server-side carries out simulation operation on the existing sample sets of normal and malicious software, respectively obtains the system calling sequences of the normal and malicious software, selects the feature set of the sequences by using a feature selection algorithm, and converts each sample into a feature vector form for representation based on the feature set; training the model by using an SVM algorithm and obtaining omega and b values in a decision function for real-time detection; the decision function is formulated as:

and (3) key generation: the client generates a public and private key Pair (PK) according to a homomorphic encryption key generation algorithm KeyGen_p，SK_p) And the public key is issued to the server;

secondly, the client side obtains a system call sequence in the software use process in a given time window, and the frequency ({ x) of the corresponding feature occurrence is counted according to the feature set_i1, …, n) and uses the public key PK_pFor each characteristic value x_iEncrypted characteristic vector [ HE (x) is obtained by encryption₁)，HE(x₂)，...，HE(x_n)]Then sending the data to a server;

thirdly, after receiving the encrypted data from the client, the server uses the public key PK_pEncrypting the value b obtained in the first step of off-line training to obtain HE (b); according to homomorphism properties, calculating the HE (omega x + b) by the following formula;

then, sending HE (ω x + b) to the client;

fourthly, the client uses the private key SK_pAnd decrypting the data to obtain omega x + b, and obtaining whether the software is malicious or not according to a decision function formula.

It is another object of the present invention to provide a program storage medium for receiving user input, the stored computer program causing an electronic device to execute steps comprising:

firstly, a client collects behavior data of a user group using different software, simply performs primary calculation, and uploads a result to a server;

secondly, the server side calculates credit values corresponding to different software according to the uploaded data by using a credit evaluation algorithm, sorts the credit values according to the credit values of the software and determines the detection sequence of the software;

and thirdly, during detection, the server side sequentially interacts with the client side according to the sequence to call API use time data which are obtained by using decompilated software APK and correspond to the software, the software is sequentially subjected to static detection according to a static learning model, and for the software with a non-malicious static detection result, the server side calls related data according to the system collected by the client side and performs real-time detection according to a dynamic learning model.

Another object of the present invention is to provide a computer program product stored on a computer readable medium, comprising a computer readable program for providing a user input interface to implement the multi-malware hybrid detection method when executed on an electronic device.

Another object of the present invention is to provide a multi-malware hybrid detection system implementing the multi-malware hybrid detection method, including:

the credit evaluation module is used for quantifying the popularity and the influence of the software, determining the detection sequence of the software and preferentially detecting the software with high credit value;

the static detection module is used for decompiling and detecting the APK by using a static detection method after the software is downloaded, and if the software is malicious, the user is advised not to install the software and the software is directly deleted; installing the remaining software which is detected to be normal;

the real-time detection module is used for monitoring a system calling sequence of the software in a specified time window when a user uses the software, acquiring data and detecting the data in real time; and if the malicious software is detected, feeding back to the user.

Another object of the present invention is to provide a multi-malware hybrid detection apparatus equipped with the multi-malware hybrid detection system, the multi-malware hybrid detection apparatus including:

the client is installed in all the user equipment and is responsible for collecting data used for priority evaluation and malicious software detection of each piece of equipment and sending the data to the server;

the server side evaluates related data according to the priorities uploaded by the client side, calculates the credit values of all software and evaluates the priorities of the software; judging whether the software is malicious or not according to the uploaded detection related data and returning a result to the client;

and the third-party module is used for assisting the interactive calculation process of the client and the server and realizing the privacy protection of the client during the priority evaluation and detection of the malicious software.

In summary, the advantages and positive effects of the invention are: the prior art lacks a scheme capable of effectively and simultaneously detecting whether software is malicious or not before installation and use and in real-time running in a real application scene. Existing malware detection schemes are broadly divided into three categories: static detection, dynamic detection, and hybrid detection. Static detection has the advantage that it is convenient and fast to acquire detection data, but is vulnerable to confusing systems. At the same time, it does not support the detection of applications defined at runtime. The dynamic detection can make up for the defects of the static detection. However, dynamic detection generally runs a program in an isolated environment to acquire data, and the detection is not real-time detection in a real use scene of a user. The hybrid detection method integrates static detection and dynamic detection, balances the advantages and disadvantages of the static detection and the dynamic detection, but still does not support real-time detection in a real scene. Therefore, the invention provides a scheme which can simultaneously and effectively carry out static detection and real-time detection under a real application use scene.

Existing malware detection schemes do not take into account differences in software priorities in real scenarios. In a real scene, different software has different influence degrees, the software with strong influence degree is popular, the downloading amount is large, the use times are large, and once the software has malicious behaviors, the damage to a user is large. Therefore, in the detection of malicious software, software with high influence degree should be preferentially detected so as to optimally reduce the damage degree of the software to users.

The existing malware detection scheme does not consider the privacy problem of user data. In an actual application scenario, a user needs to use a cloud service or other third-party services to integrate and evaluate the malicious degree of various types of software and outsource detection operation of malicious software. However, when the user interacts with the third-party service, data implying own private information is uploaded, which may cause the privacy of the user to be violated. Therefore, the invention provides a malicious software detection scheme with privacy protection, thereby relieving the worries of the user and preventing the user from being invaded.

The invention provides a detection system capable of simultaneously detecting an application program before installation and use and in real-time operation. The software detection priority ordering scheme with privacy protection optimizes the overall detection process of multiple malicious software. And a protection mechanism is added to realize privacy protection during real-time detection of the malicious software. A detection system capable of simultaneously carrying out malicious detection before software installation and in real-time operation is provided. The detection priorities are sorted. And evaluating the influence degree of the software before detection, and preferentially detecting the software with strong influence degree. And a privacy protection technology is added to protect data which may reveal the user privacy in the priority sequencing and detection processes.

Compared with the prior art, the invention has the following advantages:

(1) the detection is more comprehensive: before software installation, some malicious software can be discovered by decompiling the APK file for detection, but some applications may reload codes in the running process to implement malicious behaviors. Aiming at the phenomenon, the invention carries out real-time detection by acquiring a system calling sequence when software runs. By the detection mode, the detection can be more comprehensive.

(2) And (3) reducing damage: the detection process can be time-costly. The invention provides the method for evaluating the influence of software before detection, firstly preferentially detects the software with strong influence degree, and can find and prevent the software early if the application has problems. The optimization reduces the damage caused by malicious activities.

(3) Privacy protection: at present, with the continuous development of technologies such as data mining and the like, once data implying user privacy is leaked, the user privacy is easily violated. The invention introduces a privacy protection algorithm to protect the data of the user and protect the privacy of the user.

(4) Effectiveness: the scheme for effectively detecting the system API data of the APK through decompiling and the scheme for detecting the malicious software through obtaining the system calling sequence are both schemes for effectively detecting the malicious software.

(5) Flexibility: the detection system provided by the invention can be used for detecting before and during software installation, so that the risk of using malicious software by a user is reduced. Meanwhile, a part of malicious software is filtered by detection before installation, so that the number of the real-time detection software is reduced, and the system overhead of the detection of the malicious software during running is further reduced.

TABLE 3 comparative analysis of the present work with the present invention

Note: x represents not mentioned, and √ represents solved

[1]M.Sujithra and G.Padmavathi,“Enhanced Permission Based Malware Detection in Mobile Devices Using Optimized Random Forest Classifier with PSO-GA,”Research Journal of Applied Sciences,Engineering and Technology,vol.12,no.7,pp.732-741,2016.

[2]G.Kapseand A.Gupta,“Detection of Malware on Android based on Application Features,”International Journal of Computer Science and Information Technologies,vol.6,no.4,pp.3561-3564,2015.

[3]D.Arp,M.Spreitzenbarth,M.Hubner,H.Gascon,andK.Rieck,“Drebin:Effective and explainable detection of android malware in your pocket,”in Network and Distributed System Security Symposium,2014,vol.14,pp.23-26.

[4]Y.Aafer,W.Du,and H.Yin,“DroidAPIMiner:Mining API-level features for robust malware detection in android,”in Security and Privacy in Communication Networks-9th International ICST Conference,SecureComm 2013,Revised Selected Papers,2013,vol.127,pp.86-103.

[5]D.Wu,C.Mao,T.Wei,H.Lee,and K.Wu,“Droidmat:Android malware detection through manifest and api calls tracing，”in 2012Seventh Asia Joint Conference on Information Security,Tokyo,2012,pp.62-69.

[6]I.Burguera,U.Zurutuza,and S.Nadjm-Tehrani,“Crowdroid:behavior-based malware detection system forAndroid,”in Proceedings ofthe 1st ACM workshop on Security and privacy in smartphones andmobile devices,2011,pp.15–26.

[7]A.Shabtai,U.Kanonov,Y.Elovici,C.Glezer,and Y.Weiss,“Andromaly:a behavioral malware detection framework for android devices,”Journal ofIntelligent Information Systems,vol.38,no.1,pp.161–190,2012.

[8]M.Zhao,F.Ge,T.Zhang,and Z.Yuan,“AntiMalDroid:An efficient SVM-based malware detection framework for android,”Communications in Computer and Information Science,vol.243,pp.158–166,2011.

[9]G.Dini,F.Martinelli,A.Saracino,and D.Sgandurra,“MADAM:a multi-level anomaly detector for android malware,”in International Conference on Mathematical Methods,Models and Architectures for ComputerNetwork Security,2012,pp.240-253.

[10]F.Martinelli,F.Mercaldo,andA.Saracino,“BIRDEMAID:a hybrid tool for accurate detection ofAndroid malware,”inProceedings ofthe 2017ACM on Asia Conference on Computer andCommunications Security,2017,pp.899–901.

[11]Z.Yuan,Y.Lu,Z.Wang,and Y.Xue,“Droid-sec:deep learning in android malware detection,”ACMSIGCOMM Computer CommunicationReview,vol.44,no.4,pp.371-372,2014.

[12]T.

L.Batyuk,A.D.Schmidt,S.A.Camtepe,and S.Albayrak,“An android application sandbox system for suspicious software detection,”in Proceedings of the 5th IEEE International Conference Malicious Unwanted Software,2010,pp.55–62.

[13]Z.Yan,P.Zhang,and R.H.Deng,“TruBeRepec:a trust-behavior-based reputation and recommender system for mobile applications,”Personal and Ubiquitous Computing,vol.16,no.5,pp.485-506,2012.

Drawings

Fig. 1 is a flowchart of a multi-malware hybrid detection method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a multi-malware hybrid detection system according to an embodiment of the present invention;

in the figure: 1. a reputation evaluation module; 2. a static detection module; 3. and a real-time detection module.

Fig. 3 is a schematic structural diagram of a multi-malware hybrid detection apparatus according to an embodiment of the present invention;

in the figure: 4. a client; 5. a server side; 6. a third party module.

Fig. 4 is a schematic diagram of a multi-malware hybrid detection system provided in an embodiment of the present invention.

Fig. 5 is a flowchart of an implementation of a multi-malware hybrid detection method according to an embodiment of the present invention.

FIG. 6 is a flow diagram of reputation evaluation module interaction provided by an embodiment of the present invention.

Fig. 7 is a flowchart of real-time detection interaction provided by an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In view of the problems in the prior art, the present invention provides a method, a system, and an apparatus for multi-malware hybrid detection with privacy protection, which is provided with a privacy-enhanced multi-malware hybrid detection system with priority evaluation, and the present invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a multi-malware hybrid detection method provided in an embodiment of the present invention includes the following steps:

s101, a third party generates a public and private key pair according to a homomorphic encryption key generation algorithm and publishes a public key to all clients and a server;

s102: the client collects behavior data of a user group using different software, performs simple preliminary calculation, encrypts the data by using a public key from a third party, adds the encrypted data with the generated random number, and uploads the result to the server;

s103: the server side completes calculation of different software credit values by using a credit evaluation algorithm according to the uploaded encrypted data of the user group, by using homomorphic addition property and through interactive decryption with a third party under the condition of not obtaining privacy information of the client side, sorts the software credit values according to the magnitude of the software credit values and determines the detection sequence of the software;

s104: during detection, the server side sequentially and interactively calls API use frequency data, obtained by using decompilated software APK, corresponding to the software with the client side according to the sequence, static detection is sequentially carried out on the software according to a static learning model, and for the software with a non-malicious static detection result, real-time detection is carried out by utilizing homomorphic addition property and a dynamic learning model according to system call related encrypted data collected by the client side and a public key generated by a homomorphic encryption key generation algorithm and issued by the client side.

As shown in fig. 2, the multi-malware hybrid detection system provided in the embodiment of the present invention includes:

and the credit evaluation module 1 is used for quantifying the popularity and the influence of the software, determining the detection sequence of the software and preferentially detecting the software with high credit value.

The static detection module 2 is used for decompiling and detecting the APK by using a static detection method after the software is downloaded, and if the software is malicious, the user is advised not to install the software and the software is directly deleted; and installing the remaining software which is detected to be normal.

The real-time detection module 3 is used for monitoring a system calling sequence of the software in a specified time window when a user uses the software, acquiring data and detecting the data in real time; and if the malicious software is detected, feeding back to the user.

As shown in fig. 3, the apparatus for detecting multiple malware mixture provided by the embodiment of the present invention includes:

the client 4 is installed in all the user equipment, and is responsible for collecting data used for priority evaluation and malicious software detection of each piece of equipment and sending the data to the server;

the server 5 evaluates the related data according to the priorities uploaded by the client, calculates the credit values of the software and evaluates the priorities of the software; judging whether the software is malicious or not according to the uploaded detection related data and returning a result to the client;

and the third-party module 6 is used for assisting the interactive calculation process of the client and the server, and realizing the privacy protection of the client during the priority evaluation and detection of the malicious software.

The technical solution of the present invention is further described below with reference to the accompanying drawings.

Table 1 abbreviations

TABLE 2 legends and definitions

1. Background of the invention

In the present invention, reputation evaluation, supervised learning, and homomorphic encryption related knowledge are required, and are briefly introduced here.

1.1 reputation evaluation

The invention uses a reputation evaluation method [13] to quantify the impact strength of software. It defines reputation as the degree to which the public believes an application can complete a task as desired. Reputation evaluation is the quantification of reputation using attributes that affect the reputation value. The reputation evaluation scheme quantifies the reputation value of the software by utilizing the use condition of the public to the software, and the higher the reputation value is, the more the software is used by the public, the stronger the influence degree is. The reputation evaluation method is described below.

T_i(t)_RB＝2(d_t{N_i(t)+UT_i(t)+FE_i(t)}) (2)

wherein,

wherein,

Wherein,

3) the reputation value R (i) of the software i and the recommended trust value s for the user using the software_kAnd an aggregate reputation value calculated based on the user's usage experience

Correlation, the calculation formula is as follows:

wherein,

wherein,

to s^kIs calculated as follows:

wherein,

And (3) obtaining a credit value R (i) after the calculation, representing the popularity of the software and measuring the influence of the software.

1.2 supervised learning

In the invention, the acquisition of the detection model needs to be assisted by a supervised learning technology in machine learning. Supervised learning is the training of a function (model) from a training data set given known classes. According to the trained function (model), the new unknown class data can be predicted. The sample data in the training data set is composed of an input object (usually a feature vector) and an output value (the result data associated with the input object, usually called a label). Supervised learning is further classified into both regression-type supervised learning and classification-type supervised learning, depending on whether the output value is continuous or discrete. In the present invention, a classification method SVM is used, which is briefly described below.

Given a training data set:

T＝{(x₁，y₁)，(x₂，y₂)，...，(x_N，y_N)}；

wherein x is_iRepresenting a feature vector, y_iFlags representing output values, i.e. classes, usually y_iE {1, -1}, which means that the two categories are divided into y_iWhen 1, represents x_iThe positive case is the opposite case.

The goal of this algorithm learning is to find a separate hyperplane from the feature space, and instances can be divided into two categories by this hyperplane. SVMs are classified into linear and nonlinear types, and the present invention uses linear SVMs.

The linear SVM obtains a separation hyperplane (wherein omega, x are vectors, and omega · x is vector inner product operation) by interval maximization or learning of solving a convex quadratic programming problem for a given training data set:

ω·x+b＝0 (8)

the corresponding classification decision function is:

1.3 homomorphic encryption

Homomorphic encryption is an encryption algorithm capable of realizing multiple operation functions among ciphertexts, namely, a decryption result after calculation among the ciphertexts is equivalent to a result of direct calculation of a plaintext. By utilizing the characteristic, a third party can be entrusted to process data without revealing information so as to realize privacy protection. Therefore, data related to privacy in the invention are protected by using a homomorphic encryption algorithm Paillier, and the Paillier cryptosystem is introduced as follows:

key generation (KeyGen):

1) two large prime numbers p and q are randomly selected, and the two prime numbers satisfy gcd (pq, (p-1) (q-1)) -1.

2) N ═ pq and λ ═ lcm (p-1, q-1) were calculated.

3) Selecting a random integer g to satisfy

And in the presence of (L (g))^λmod n²))^-1mod n, wherein

Given a public key of (n, g) and a private key of (λ, μ).

Encryption (Enc): for the message m needing encryption, an integer r is randomly selected, and the condition that r is more than 0 and less than n is satisfied

That is, gcd (r, n) is guaranteed to be 1. Then, m is encrypted with the public key (n, g): c is g^m·rⁿmod n²And obtaining a ciphertext c.

Decryption (Dec): for ciphertext c, it is decrypted using the private key (λ, μ): m ═ L (c)^λmod n²)·μmod n，

The invention realizes data privacy protection by utilizing the additive homomorphism property of the Paillier encryption algorithm, and the property is as follows:

HE(m₁)*HE(m₂)＝HE(m₁+m₂)

HE(m)^k＝HE(m*k)

2. the system structure and the interaction process comprise three entities, and the functions of each entity are shown in fig. 4:

a client: the device is installed in all user equipment and is responsible for collecting data used for priority evaluation and malicious software detection of each equipment and sending the data to a server;

the server side: and evaluating related data according to the priority uploaded by the client, calculating the reputation value of each software, and evaluating the priority of each software. Judging whether the software is malicious or not according to the uploaded detection related data and returning a result to the client;

a third party: and the interactive calculation process of the client and the server is assisted, and the privacy protection of the client is realized during the priority evaluation and detection of the malicious software.

In this system, the client is installed on each user device, has the same rights as the user device, and is trusted. The server and the third party are semi-trusted, and both feel curious about data content sent by the client, and may snoop privacy information of the user uploaded by the client. Due to respective interests, the three parties can not mutually collude with each other and can integrity complete corresponding functions and tasks. Meanwhile, different entities in the system communicate with each other by adopting a secure channel. However, when the client communicates with the server, the data uploaded by the client implies the privacy of the behavior of the client and needs to be protected. Here the semi-trusted third party is responsible for privacy protection of the user.

3. Scheme flow and specific design

The invention provides an Android malicious software hybrid detection system with priority sequencing and privacy protection, and the general flow is shown in fig. 5. Firstly, the client collects behavior data of different software used by a user group, simply performs primary calculation, and uploads the result to the server. And then, the server side calculates credit values corresponding to different software according to the uploaded data by using a credit evaluation algorithm, sorts the credit values according to the credit values of the software, and determines the detection sequence of the software. During detection, the server side sequentially interacts with the client side according to the sequence to call API use frequency data, obtained by using decompilated software APK, corresponding to the software, and sequentially performs static detection on the software according to the static learning model, and for the software with a static detection result being non-malicious, real-time detection is performed according to the dynamic learning model and system call related data collected by the client side.

The scheme of the invention is mainly divided into three functional modules: reputation evaluation, static detection, and real-time detection. The reputation evaluation module is used for quantifying the popularity and the influence of the software and determining the detection sequence of the software. And preferentially detecting software with high reputation value. After the software is downloaded, a static detection method can be used for decompiling and detecting the APK, and if the software is malicious, a user is advised not to install the software but to delete the software directly. And installing the remaining software which is detected to be normal. When a user uses the software, the system calling sequence of the software is monitored in a specified time window, and data are acquired for real-time detection. And if the malicious software is detected, feeding back to the user.

3.1 reputation evaluation module, according to the user's use of software to quantify the software popularity, here expressed by reputation value, reputation value higher indicates more popular, once appear malicious behavior cause damage bigger, should detect this kind of software preferentially. However, the data sent by the user to the server implies the use condition of the user to the software, belongs to privacy data, and needs to be protected in privacy. Therefore, a semi-trusted third party is introduced, and privacy protection is achieved by means of a homomorphic encryption technology. The interaction process of this module is illustrated in fig. 6, which is described in detail below:

first step (key generation): third party according to the same as in 1.3A state encryption key generation algorithm KeyGen for generating a public and private key Pair (PK)_p，SP_p) And the public key PK_pAnd publishing the data to all clients and the server.

Second step (data collection): the client collects the use times, duration and frequency information of the software in a given time window, and preliminarily calculates the recommendation credit value s of the software according to the formula (1) -formula (5) and formula (7) in 1.1_kAggregated reputation values

And their product

Third step (data upload): the client uses the public key PK provided by the third party_pAnd a self-generated random number r_kEncrypting the data to obtain HE(s)_k+r_k) And

the encrypted data and r_kAnd sending the data to the server side together.

Fourth step (preliminary calculation): after the server side obtains data from all the clients, all the random numbers are summed to obtain

Calculation of homomorphism Using the homomorphism Properties mentioned in 1.3

And

and send the data to a third party for decryption.

Fifth step (data decryption): the third party uses its own private key SK_pDecrypting the received encrypted data to obtain decrypted data

And

and sends the data to the server.

Sixth step (final calculation): the server receives the data from the third party and then the data known to the server

Are subtracted to obtain

And

the reputation value r (i) for the application can then be calculated using equation (6) in section 1.1. Therefore, the server side obtains the reputation value of the software on the premise of not knowing the data information sent by each client side.

3.2 static detection Module

After the application program is downloaded and before the application program is installed and used, the module obtains the danger level authority and the information of the corresponding API as the characteristics by decompiling the APK file, and then completes the detection of the malicious software by utilizing a machine learning method. The detailed detection process is as follows:

first step (model training under line): the service end performs decompiling on the APK of the existing normal software and malicious software of known types, obtains the occurrence frequency of the system API corresponding to the used danger authority, and operates a supervised learning algorithm in machine learning to perform model training by taking the occurrence frequency as a characteristic to obtain a classifier for performing static detection on the software.

Second step (data collection): and the client side decompiles the downloaded APK, acquires the occurrence times of the API corresponding to the danger level authority in the file, and uploads the data to the server side for detection.

Third step (on-line detection): after receiving the data from the client, the server detects the software by using the classifier obtained by offline training, judges whether the software is malicious or not, and returns the result to the client.

3.3 real-time detection module

For the installed and used software, the module utilizes the system call sequence data of the software runtime to perform real-time detection, and once malicious behaviors are found, the malicious behaviors are immediately reflected to the client. Meanwhile, the collected system call information implies the behavior privacy of the user using software, and the invention protects the user using the homomorphic encryption technology. The interaction flow of this module is given in fig. 7, which is described in detail below:

first step (initialization phase): the method mainly comprises two stages of key generation of a client and online model training of a server.

Off-line training: the server-side carries out simulation operation on the existing sample sets of normal and malicious software, respectively obtains the system calling sequences of the normal and malicious software, selects the feature set of the sequences by using a feature selection algorithm, and converts each sample into a feature vector form for representation based on the feature set. Then, the model is trained using SVM algorithm and the decision function (ω and b values in equation (9)) is obtained for real-time detection.

And (3) key generation: the client generates a public and private key Pair (PK) according to a homomorphic encryption key generation algorithm KeyGen in 1.3_p，SK_p) And issues the public key to the server.

Second step (real-time monitoring): the client acquires a system call sequence in the use process of software in a given time window, and counts the frequency ({ x) of the corresponding feature according to the feature set_i1, n) and uses the public key PK_pFor each characteristic value x_iEncrypted characteristic vector [ HE (x) is obtained by encryption₁)，HE(x₂)，...，HE(x_n)]And then sends it to the server.

Third step (real-time detection): after receiving the encrypted data from the client, the server uses the public key PK_pAnd encrypting the b value obtained in the off-line training of the first step to obtain HE (b). From the homomorphic nature, HE (ω x + b) can be calculated from the following equation.

After that, HE (ω x + b) is sent to the client.

Fourth step (get result): client side using private key SK_pAnd decrypting the data to obtain omega x + b, and knowing whether the software is malicious or not according to a formula (9).

Through the process, the server can complete the real-time detection of the malicious software on the premise that the server does not obtain any effective private data of the user.

The invention mainly comprises three parts of credit evaluation, static detection and real-time monitoring. Here, a specific embodiment implemented using the Java language is given. The entire implementation may use a client-server architecture. The tasks of the client are completed by compiling Android client codes. Two servers are used for respectively writing a server code and a third party code in the invention.

In the reputation evaluation stage, the generation of a key is completed according to a specific algorithm of Paillier homomorphic encryption when the server code of the third party is realized. In addition, it is necessary to include code to decrypt and differencing the data from the server. The client code needs to complete the functions of data collection, preliminary processing, public key acquisition and sending to the server according to the specific algorithm in the invention. In the server code of the server, the functions of summing the client data, forwarding to a third party and calculating a reputation value need to be completed according to the specific algorithm of the invention.

In the static detection stage, the client code needs to implement the functions of decompiling the APK, counting the number of system APIs, and sending data to the server. The server code of the server needs to complete two functions of off-line training and on-line detection according to a specific algorithm in the invention. The machine learning algorithm can select supervised learning algorithms such as a support vector machine and a random forest, and the algorithms can be realized in a programming mode.

In the real-time detection stage, a client code needs to complete generation of a public and private key pair according to a Paillier algorithm, and functions of real-time monitoring system call data and encryption sending to a server side also need to be realized. The realization of server codes at the server side needs to complete the functions of offline training and real-time detection.

All of the above can be implemented programmatically according to the specific algorithm details of the present invention. The implementer can select the programming language and the architecture according to the requirement of the implementer.

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A multi-malware mixed detection method with privacy protection is characterized by comprising the following steps:

fourthly, during detection, the server side sequentially and interactively calls API use frequency data which are obtained by using decompilated software APK and correspond to the software with the client side according to the sequence, static detection is sequentially carried out on the software according to a static learning model, and for the software with a static detection result being non-malicious, real-time detection is carried out by utilizing homomorphic addition property and a dynamic learning model according to system call related encrypted data collected by the client side and a public key which is issued by the client side and is generated according to a homomorphic encryption key generation algorithm;

the real-time detection of the multi-malicious software mixed detection method utilizes the system calling sequence data during the software operation to carry out real-time detection, and once a malicious behavior is found, the malicious behavior is immediately reflected to a client; meanwhile, the collected system call information implies the behavior privacy of the user using the software; the method specifically comprises the following steps:

secondly, the client side obtains a system call sequence in the software use process in a given time window, and the frequency ({ x) of the corresponding feature occurrence is counted according to the feature set_i1, n) and uses the public key PK_pFor each characteristic value x_iEncrypted characteristic vector [ HE (x) is obtained by encryption₁)，HE(x₂)，...，HE(x_n)]Then sending the data to a server;

then, sending HE (ω x + b) to the client;

2. The multi-malware hybrid detection method with privacy protection as claimed in claim 1, wherein reputation evaluation of the multi-malware hybrid detection method quantifies popularity of software according to usage of software by users, expressed in reputation values; a semi-trusted third party is introduced, and privacy protection is achieved by means of a homomorphic encryption technology.

3. The multi-malware hybrid detection method with privacy protection as recited in claim 2, further comprising:

And their product

the encrypted data and r_kSending the data to the server side together;

Calculated using the homomorphism properties mentioned

And

and sending the data to a third party for decryption;

And

and sending the data to the server;

sixthly, the server receives the data from the third party and then the data known by the server

Are subtracted to obtain

And

wherein,

to s^kIs calculated as follows:

wherein y ═ ρ - | r (i) -V_i ^k(i)|，

4. The multi-malware hybrid detection method with privacy protection as recited in claim 3, wherein the second step comprises:

T_i(t)_RB＝2(d_t{N_i(t)+UT_i(t)+FE_i(t)})；

wherein,

wherein,

Wherein,

Correlation, the calculation formula is as follows:

wherein,

5. the multi-malware hybrid detection method with privacy protection as claimed in claim 1, wherein static detection of the multi-malware hybrid detection method is implemented after an application program is downloaded and before the application program is installed and used, by decompiling an APK file, a danger level authority and information of a corresponding API are obtained as features, and then detection of malware is completed by using a machine learning method, and the detection process is as follows:

6. A program storage medium storing a computer program for causing an electronic device to perform steps comprising:

thirdly, during detection, the server side sequentially interacts with the client side according to the sequence to call API use frequency data which are obtained by using decompilated software APK and correspond to the software, static detection is sequentially carried out on the software according to a static learning model, and for the software with a non-malicious static detection result, real-time detection is carried out according to a dynamic learning model and system call related data collected by the client side;

the real-time detection of the multi-malicious software mixed detection method utilizes the system call sequence data during the software operation to carry out real-time detection, and once a malicious behavior is found, the malicious behavior is immediately reflected to a client; meanwhile, the collected system call information implies the behavior privacy of the user using the software; the method specifically comprises the following steps:

then, sending HE (ω x + b) to the client;

7. A computer program product stored on a computer readable medium, comprising a computer readable program for providing a user input interface to implement the multi-malware hybrid detection method of any one of claims 1-5 when executed on an electronic device.

8. A multi-malware hybrid detection system implementing the multi-malware hybrid detection method of any one of claims 1 to 5, comprising:

9. A multi-malware hybrid detection apparatus on which the multi-malware hybrid detection system with privacy protection of claim 8 is mounted, the multi-malware hybrid detection apparatus comprising: