CN113971282A

CN113971282A - AI model-based malicious application program detection method and equipment

Info

Publication number: CN113971282A
Application number: CN202010721509.0A
Authority: CN
Inventors: 潘宣辰; 郭辰; 张路
Original assignee: Wuhan Antiy Mobile Security Co ltd
Current assignee: Wuhan Antiy Mobile Security Co ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2022-01-25

Abstract

The embodiment of the invention provides a malicious application program detection method and equipment based on an AI model. Analyzing a target file in an application program installation package to be detected, and extracting static information in the target file, wherein the static information comprises at least one dimension information of behavior dimension, authority dimension and content dimension; processing the static information into a digital feature vector in a feature transformation mode, wherein the digital feature vector consists of 0 and 1; inputting the digital feature vector into a trained AI model to obtain a malice detection result of the application program to be detected; the AI model is trained according to the input digital feature vector and outputs the probability that the application program corresponding to the digital feature vector is malicious application and/or non-malicious application. The method solves the problems of difficult rule extraction, low coverage, poor expansibility, easy bypassing and the like in the traditional malicious application detection, and has higher accuracy and timeliness.

Description

AI model-based malicious application program detection method and equipment

Technical Field

The embodiment of the invention relates to the technical field of mobile network security, in particular to a malicious application program detection method and equipment based on an AI model.

Background

In the period of high technology, the development of Android software shows explosive growth. According to data displayed by 'review report in 2019 of global mobile application market' issued by App Annie at the day, the global APP download amount in 2018 exceeds 1940 hundred million times, and compared with the global application download amount in 2016, the increase rate is 35%. Unfortunately, such popularity may also attract malware developers, defeating applications such as provisioning, bundled downloads, over-acquiring privileges, emulational applications, and the like. The prevalence of malicious applications gradually makes the personal privacy of users transparent, and it is mentioned in 2017Q1 research report on security market of mobile phones, 89.6% of visited users have personal privacy information disclosure, fraud calls, etc., and nowadays, information security becomes a serious trouble in the heart and abdomen of many users.

At present, many security manufacturers also invest in the field of mobile security, and the basic principle of virus killing of the software is to confirm intrusion behaviors by matching known malicious Trojan horse features, and to actively defend in ways of fire wall, dynamic monitoring and the like, but the software has the defect of being dependent on updating of malicious feature libraries and weak in learning of novel malicious detection.

However, new malicious applications are layered endlessly, and their maliciousness is different, and the malicious detection is performed by means of the malicious feature library, which is not updated in time, so that it is difficult to achieve the ideal security protection effect.

Disclosure of Invention

Aiming at the problems in the prior art, the embodiment of the invention provides a malicious application detection method and equipment based on an AI model, which summarize the attack intentions and means of the Android platform malicious software on the basis of analyzing a large amount of Android platform mainstream malicious software, realize the malicious AI detection of the application through a deep learning algorithm, and provide a new direction for the malicious detection of the Android platform application.

In a first aspect, an embodiment of the present invention provides a malicious application detection method based on an AI model, including:

analyzing a target file in an application program installation package to be detected, and extracting static information in the target file, wherein the static information comprises at least one dimension information of behavior dimension, authority dimension and content dimension;

processing the static information into a digital feature vector in a feature transformation mode, wherein the digital feature vector consists of 0 and 1;

inputting the digital feature vector into a trained AI model to obtain a malice detection result of the application program to be detected; the AI model is trained according to the input digital feature vector and outputs the probability that the application program corresponding to the digital feature vector is malicious application and/or non-malicious application.

Specifically, the behavior dimension includes behavior information of the application program during running;

the permission dimension comprises permission information required by the application program when a specific action is carried out;

the content dimension includes at least one of: the total file size of the application, the number of files contained by the application, the size of specific files in the application, the number of specific components in the application, and application reinforcement information.

Further, the processing the static information into a digital feature vector by a feature transformation manner specifically includes:

respectively carrying out feature transformation on numerical data in each dimension data based on each dimension data in the static information, and converting the numerical data into coded numbers consisting of 0 and 1 through one-hot coding;

respectively converting Boolean data in each dimension data into a digital 0 or 1;

and splicing all the converted numbers of the numerical data and the Boolean data into a digital feature vector consisting of 0 and 1 according to a preset sequence.

Further, after performing feature transformation on the numerical data in each dimension data, the numerical data is converted into a coded number composed of 0 and 1 through unique hot coding, which specifically includes:

obtaining N numerical value subsections according to a first preset rule based on numerical value type data in each dimensional data, and sequencing the N numerical value subsections according to the numerical value size to obtain the sequencing order of each numerical value subsection, wherein N is an integer larger than 0;

matching each numerical data in each dimensional data with the N numerical segments according to a second preset rule to enable each numerical data to be matched with one numerical segment, and taking the sequencing order of the matched numerical segments as a feature transformation number of each numerical data;

the characteristic transformation number of each numerical type data is subjected to one-hot coding, so that a coded number consisting of 0 and 1 is obtained.

dividing the numerical value interval into N parts based on numerical value data in each dimension data by taking the maximum value and the minimum value in the numerical value data as the numerical value interval, obtaining N numerical value segments, and sequencing the N numerical value segments according to the numerical value to obtain the sequencing order of each numerical value segment, wherein N is an integer greater than 0;

for each numerical data in each dimension data, if the value of the numerical data is within the numerical range of a numerical segment S, taking the ranking order of the numerical segment S as the feature transformation number of the numerical data;

the characteristic transformation number of each numerical type data is subjected to one-hot encoding, thereby obtaining an encoded number composed of 0 and 1.

segmenting the numerical data according to the total number of the numerical data based on the numerical data in each dimension data to obtain N numerical segments, and sequencing the N numerical segments according to the numerical value to obtain the sequencing order of each numerical segment, wherein N is an integer greater than 0;

sorting the numerical data in each dimension data according to size to obtain a sorting number of each numerical data;

for each numerical data, taking the sorting order of the numerical segment corresponding to the sorting order of each numerical data as the feature transformation number of each numerical data;

based on numerical data in each dimension data, segmenting partial data of the numerical data according to the number of the data to obtain n numerical segments, wherein n is an integer larger than 0;

removing part of numerical data segmented according to the number of data in the numerical data, dividing the numerical interval into N-N parts by taking the maximum value and the minimum value in the residual data as numerical intervals, and obtaining N-N numerical segments, wherein N is an integer greater than N;

sorting the N numerical value segments according to the numerical values to obtain the sorting order of each numerical value segment;

sorting the numerical data in each dimension data according to size, respectively matching with N numerical segments, and taking the sorting order of the matched numerical segments as the feature transformation number of the numerical data;

Further, the network structure of the AI model includes four parts: the device comprises an input layer, a decomposition machine layer, a hidden layer and an output layer; the AI model is obtained by training according to the following method:

acquiring an application program installation package sample and a malicious label of the application program installation package sample;

analyzing a second target file in the application program installation package sample, and extracting second static information in the second target file; the second static information comprises at least one dimension information of a behavior dimension, a permission dimension and a content dimension;

processing the second static information into a second digital feature vector in a feature transformation mode, wherein the second digital feature vector consists of 0 and 1;

and inputting the second digital feature vector and the malicious label of the application program installation package sample into the constructed AI model, and training the AI model so as to obtain the AI model meeting the expected requirement.

Further, the inputting the second digital feature vector and the malicious label of the application program installation package sample into the constructed AI model, and training the AI model specifically includes:

converting the malicious label of the application program installation package sample into the malicious label of the second digital feature vector;

and inputting the second digital feature vector and the malicious label of the second digital feature vector into the constructed AI model, and training the AI model.

Further, the converting the malicious label of the application installation package sample into the malicious label of the second digital feature vector specifically includes:

based on the malicious labels of all the application program installation package samples, taking the malicious label of any application program installation package sample as the malicious label of the second digital feature vector corresponding to any application program installation package sample;

and based on all the second digital feature vectors and the corresponding malicious labels thereof, performing duplicate removal on the data in which the second digital feature vectors and the corresponding malicious labels thereof are completely the same, and obtaining the duplicate-removed second digital feature vectors and the malicious labels of the second digital feature vectors.

In a second aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, and the processor calls the program instructions to perform the AI model-based malicious application detection method according to the first aspect of the embodiments of the present invention and the method according to any optional embodiment of the AI model-based malicious application detection method.

In a third aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer instructions for executing the AI model-based malicious application detection method according to the first aspect of the embodiment of the present invention and the method according to any optional embodiment of the AI model-based malicious application detection method.

The AI model-based malicious application detection method provided by the embodiment of the invention extracts the static information of a target file in an application installation package to be detected, and processes the static information into a digital feature vector in a feature transformation mode; and on the basis of the trained AI model, calculating the digital feature vector through the AI model to obtain the probability that the application program corresponding to the digital feature vector is malicious application and/or the probability of non-malicious application. The embodiment of the invention solves the problems of difficult rule extraction, low coverage, poor expansibility, easy bypass and the like when the malicious application program is detected based on the conventional manual extraction rule, and has higher accuracy and timeliness for malicious program detection.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a malicious application detection method based on an AI model according to an embodiment of the present invention;

FIG. 2 is a numerical segmentation diagram of an embodiment of the present invention;

fig. 3 is a schematic diagram of a network structure of the AI model according to the embodiment of the present invention;

FIG. 4 is a schematic diagram of an AI model training process according to an embodiment of the invention;

FIG. 5 is a malicious application detection apparatus based on AI model according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a frame of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the problems in the prior art, the embodiment of the invention redesigns the algorithm based on the deep learning algorithm to obtain an Artificial Intelligence (AI) model; training an AI model based on static information extraction and data statistics characteristics of massive Android platform application programs, and finally obtaining the AI model for detecting the malicious application programs of the Android platform, wherein the training process comprises the following steps: extracting static information of sample data (application program installation package samples); performing feature transformation on the extracted static information to obtain a digital feature vector corresponding to the sample data; and training the AI model by taking the digital feature vector corresponding to the sample data and the malicious label corresponding to the sample data as input data of the AI model, and evaluating the effect through a model evaluation strategy to obtain the AI model meeting expected requirements.

When the application program is subjected to malicious detection, extracting static information of a target file in an application program installation package to be detected, and performing feature transformation on the extracted static information to obtain a digital feature vector corresponding to the application program installation package to be detected; and taking the digital characteristic vector corresponding to the application program installation package to be detected as input data of the AI model, and outputting the probability that the application program installation package to be detected is malicious application and/or the probability that the application program installation package to be detected is non-malicious application after the AI model is operated.

In the embodiment of the invention, an AI model training stage and a malicious application program detection stage both comprise static information extraction and feature transformation on the extracted static information, and the processing methods of the two are completely the same; in all the alternative schemes of the embodiment of the invention, the scheme adopted by the application program in the malicious detection stage is consistent with the scheme adopted in the AI model training stage so as to obtain the expected detection effect.

Hereinafter, the malicious application detection method based on the AI model according to the embodiment of the present invention is described in detail from the perspective of the malicious application detection stage.

Fig. 1 is a schematic flow chart of a malicious application detection method based on an AI model according to an embodiment of the present invention. The AI model-based malicious application detection method shown in fig. 1 includes:

101, analyzing a target file in an application program installation package to be detected, and extracting static information in the target file, wherein the static information comprises at least one dimension information of behavior dimension, authority dimension and content dimension;

102, processing the static information into a digital feature vector in a feature transformation mode, wherein the digital feature vector consists of 0 and 1;

103, inputting the digital feature vector into the trained AI model to obtain a malice detection result of the application program to be detected; the AI model is trained according to the input digital feature vector and outputs the probability that the application program corresponding to the digital feature vector is malicious application and/or non-malicious application.

When the application program is maliciously detected, the method and the device for detecting the application program installation package are used for firstly obtaining the application program installation package to be detected, analyzing a target file in the application program installation package to be detected, and extracting static information in the target file, wherein the static information comprises at least one of behavior dimension, authority dimension and content dimension, and each kind of dimension information can comprise one or more kinds of different information. For example, a behavior dimension may include behavior 1, behavior 2, behavior 3, etc., a rights dimension may include rights 1, rights 2, rights 3, rights 4, etc., and a content dimension may include content 1, content 2, content 3, etc.

Step 102, feature transformation is carried out on the extracted static information, and the embodiment of the invention converts different types of static information data into uniform digital feature vectors, so as to ensure the generalization capability of the AI model while keeping more feature information. The embodiment of the invention mainly comprises two data types, wherein one data type is a numerical type formed by counting the number of keywords or components, and the other data type is a Boolean type which indicates whether behaviors and authorities exist or not. Typically, the static information for the behavior dimension and the rights dimension is a boolean type of data and the static information for the content dimension is a numeric type of data, but this does not exclude exceptions.

103, inputting the digital feature vector consisting of 0 and 1 into a pre-trained AI model, and calculating to obtain a malice detection result of the application program to be detected; the malice detection result in the embodiment of the invention refers to the probability that the application program to be detected is a malicious application and/or the probability of a non-malicious application. The AI model of the embodiment of the invention can output the probability that the application program is malicious application, or output the probability that the application program is non-malicious application, or simultaneously output the probability that the application program is malicious application and the probability that the application program is non-malicious application.

Assuming that the AI model outputs both the probability of malicious application and the probability of non-malicious application, if the former is greater than the latter, the application is non-malicious, otherwise the application is malicious, for example, output "[ 0.98654, 0.01346 ]" where 0.98654 is the probability of non-malicious application and 0.01346 is the probability of malicious application, which indicates that the program is non-malicious, thereby completing the determination of the malicious property of the application.

The malicious application program detection method based on the AI model extracts static information of a target file in an application program installation package to be detected, and processes the static information into a digital feature vector in a feature transformation mode; and on the basis of the trained AI model, calculating the digital feature vector through the AI model to obtain the probability that the application program corresponding to the digital feature vector is malicious application and/or the probability of non-malicious application. The embodiment of the invention solves the problems of difficult rule extraction, low coverage, poor expansibility, easy bypass and the like when the malicious application program is detected based on the conventional manual extraction rule, and has higher accuracy and timeliness for malicious program detection.

Based on the above embodiment, the analyzing 101 a target file in an application installation package to be detected, and extracting static information in the target file specifically includes:

101.1, analyzing information of each subfile in the application program installation package to be detected to obtain a target file;

101.2, extracting a target character string matched with a specific keyword from the target file; the target character string comprises at least one dimension information of a behavior dimension, a permission dimension and a content dimension;

101.3, classifying and summarizing the target character strings according to the behavior dimension, the authority dimension and the content dimension to obtain the static information of the application program to be detected.

In step 101.1 of the embodiment of the invention, a target file is obtained from the subfile of the application program installation package to be detected, wherein the target file can be android manifest.

The embodiment of the invention extracts the static information in the target file by means of keyword matching, wherein the specific keyword in step 101.2 is a keyword related to behaviors, authorities and contents, and the target character string matched by the specific keyword is character string information related to the behaviors, the authorities and the contents. All the specific keywords set by the embodiment of the invention are matched, so that all the information of behavior dimension, authority dimension and content dimension can be matched, and partial information can also be matched.

And 101.3, classifying and summarizing character strings matched with the keywords, classifying character string information of all behavior dimensions into behavior dimensions, classifying character string information of all content dimensions into content dimensions, classifying character string information of all authority dimensions into authority dimensions, and classifying and summarizing all information of the behavior dimensions, the authority dimensions and the content dimensions as static information.

In an optional embodiment, the static information of the content dimension is data of a numerical value type, the static information of the behavior dimension and the authority dimension is data of a boolean type, and when the data are classified and collected, the character strings of the content dimension can be converted into numerical values for collection, and the character strings of the behavior dimension and the authority dimension can be converted into boolean values for collection.

The application program reinforcement in the embodiment of the invention refers to reinforcement protection of the application program, and blocks bad behaviors by detecting the running state of the application program, so that malicious programs are prevented from damaging a computer by using vulnerabilities of the application program.

The static information of the embodiment of the invention comprises at least one dimension information of behavior dimension, authority dimension and content dimension; the specific method for extracting the dimension information includes, but is not limited to, the following descriptions, and the dimension information at least includes one of the descriptions:

in the behavior dimension, the behavior of the application program refers to behaviors such as accessing a local file, sending a short message, networking and the like performed by the program during operation, and is specifically represented by the way that the behavior information exists in a class. ". Based on analysis of a large number of existing applications, the correspondence between part of behaviors used in the embodiment of the present invention and the names of the android application methods is shown in table 1.

TABLE 1

In the permission dimension, when an application program reads a file, permission for reading the file needs to be declared first, wherein the permission contains some sensitive permissions, so that whether some permissions are declared or not has a certain reference value for judging the maliciousness of the application program. Preferably, the embodiment of the present invention obtains information on the authority declaration from a "< uses-permission name ═ xxx" > "field of an android manifest. Based on the analysis of a large number of existing applications, the partial permission fields of the android application used in the embodiment of the present invention are shown in table 2.

TABLE 2

In the content dimension, the total file size, classes, dex file size, the number of contained files of the application program and whether the application program uses a reinforcement means are counted; xml file, the number of activity components, service components, receiver components and meta _ data components are counted respectively by using keywords "< activity >", "< service >", "< receiver >" and "< meta _ data >".

The embodiment of the invention extracts the dimension information of the application program installation package, including the specific behaviors, the authority, the sensitive character strings, the certificates and the like of the application program based on the analysis of a large number of existing application programs, and the application program with the malicious intent has strong discrimination with the normal application program in the dimensions, thereby being an important basis for judging the malicious nature of the application program.

As described above, the static information extracted in the embodiment of the present invention mainly includes two types of data, one is data of a numeric type, and the other is data of a boolean type.

Based on any of the above optional embodiments, the processing the static information into a digital feature vector in step 102 by a feature transformation manner specifically includes:

102.1, respectively carrying out feature transformation on numerical data in each dimension data based on each dimension data in the static information, and converting the numerical data into coded numbers consisting of 0 and 1 through one-hot coding;

102.2, respectively converting the Boolean data in each dimension data into a digital 0 or 1;

and 102.3, splicing all the converted numbers of the numerical data and the Boolean data into a digital feature vector consisting of 0 and 1 according to a preset sequence.

Step 102.1 of the embodiment of the invention carries out feature transformation on numerical data, mainly carrying out feature transformation on static information of content dimensions; based on the content dimension information in the static information, the content dimension information is converted into numerical data, each numerical data is subjected to feature transformation, and each data in the dimension information after the feature transformation obtains a feature transformation number. For example, assuming that the content dimension includes 4 pieces of numerical data, 10, 11, and 15, respectively, when performing feature transformation, an alternative method is to classify 10, 11, and 11 into one class and give a feature transformation number 1, and give 15 as another class and a feature transformation number 2, so that the numerical data 10, 11, and 15 and the feature transformation numbers after feature transformation are 1, 1, 1, and 2, respectively, so that the data can be discretized to improve the AI model adaptation and generalization capability. The above examples are merely hypothetical alternatives and embodiments of the present invention are not limited to other alternatives based on the needs of a particular application.

The characteristic transformation number is then one-hot coded, i.e. converted into a coded number consisting of 0 and 1. One-hot encoding, also known as one-bit-efficient encoding, uses an N-bit status register to encode N states, each having its own independent register bit and only one of which is active at any one time.

For example, six states are encoded:

the natural sequence code is: 000, 001, 010, 011, 100, 101;

the one-hot encoding is then: 000001, 000010, 000100, 001000, 010000, 10000.

It should be noted that, in the embodiment of the present invention, the step 102.1 is mainly performed by using information of content dimensions, and it is not excluded that numerical data of other dimensions may exist to perform feature transformation through the step 102.1.

And 102.2, performing feature transformation on the Boolean type data, wherein the feature transformation is mainly performed on static information of behavior dimensions and authority dimensions (other dimensions are not excluded). If the character string of the application program can be matched through the specific keyword, the character string is 1, otherwise, the character string is 0. The embodiment of the present invention does not limit the sequence of the steps 102.1 and 102.2.

102.3, splicing all the coded numbers after the numerical data conversion and the numbers 0 and 1 after the boolean data conversion into character strings according to a preset sequence.

The feature transformation mode of step 102.1 in the embodiment of the present invention is flexible and various, and different feature transformation modes can be adopted based on the specific data of the extracted static information and the operation requirements of different scenes.

Based on any optional embodiment, after performing feature transformation on the numerical data in each dimension data in step 102.1, the numerical data is converted into a coded number composed of 0 and 1 through one-hot coding, which specifically includes:

FIG. 2 is a sectional view of numerical values according to an embodiment of the present invention. The embodiment of the invention carries out segmentation according to the numerical data of each dimension; specifically, the numerical data in the embodiment of the present invention is mainly data of a content dimension, and the content dimension may extract a plurality of different content information, such as the number of components a and the number of components B, and therefore the content dimension includes a plurality of dimensions, such as the number of components a and the number of components B.

Preferably, each dimension data described in the embodiment of the present invention may refer to a dimension in the dimension. Taking the number of the components A as an example, assuming that the total number of data related to the number of the components A extracted in the embodiment of the invention is 1150, namely 1150 numerical data are obtained, segmenting according to the total number of 1150 numerical data, wherein the segmenting method is a first preset rule; the first preset rule may be that segmentation is performed according to the value size of 1150 pieces of numerical data, or segmentation is performed according to the number of 1150 pieces of numerical data, or segmentation is performed according to a mixture of the value size and the number of data, and a specific segmentation method may be determined according to a requirement, which is not limited in this embodiment of the present invention. Similarly, the number of B components for the content dimension is segmented according to the total number of the extracted B components; other segmentation methods of extracting information and so on.

As shown in fig. 2, it is assumed that the embodiment of the present invention obtains N numerical value segments by segmenting 1150 numerical value type data in the example through a first preset rule, and the N numerical value segments are sorted by size, and the sorted order is 1, 2, …, N respectively.

After the numerical values are sorted in segments, the embodiment of the present invention "registers" all the numerical data according to a second preset rule, where the second preset rule is a matching method corresponding to the first preset rule, that is: if the first preset rule is segmented according to the numerical value, the second preset rule is matched according to the numerical value; if the first preset rule is segmented according to the number of data, the second preset rule is matched according to the serial number of the data; and if the first preset rule is segmented according to the numerical value and the data number, the second preset rule is matched according to the numerical value and the data number.

Referring to fig. 2, taking an example that a first preset rule is segmented according to the size of a value, for 1150-valued data in the above example, 1150-valued data are respectively matched with N-valued segments, if the values of 50 data are within the range of segment 1, rank 1 is taken as a feature transformation number of the 50 data, that is, each of the 50 data obtains a feature transformation number 1; if the values of 100 data are in the range of segment 2, taking the rank 2 as the feature transformation number of the 100 data, that is, obtaining the feature transformation number 2 for each of the 100 data; by analogy, each numerical data will obtain a feature transformation number, which is not described in detail.

After each numerical data is subjected to segmentation matching, each numerical data obtains a feature transformation number, and each feature transformation number is subjected to independent hot coding in the embodiment of the invention, so that a coding number consisting of 0 and 1 is obtained; the one-hot encoding process of the following embodiments is the same and will not be described further.

The embodiment of the invention performs characteristic transformation on the numerical data, can classify and discretize the numerical data with similar characteristics (for example, numerical values are different but very close), and can improve the generalization capability of the AI model during the training of the AI model. Preferably, the embodiment of the present invention provides three feature transformation modes, which are feature transformation modes according to numerical values, according to data numbers, and according to a mixture of numerical values and data numbers, respectively, as follows:

first feature transformation: the first preset rule is segmented according to the numerical value, and the second preset rule is matched according to the numerical value;

second feature transformation: the first preset rule is segmented according to the number of data, and the second preset rule is matched according to the serial number of the data;

the third feature transformation mode: the first preset rule is segmented according to the numerical value and the data number, and the second preset rule is matched according to the numerical value and the data number.

Each feature transformation is described in detail below.

Based on any optional embodiment, step 102.1, after performing feature transformation on the numerical data in each dimension data, the numerical data is converted into a coded number composed of 0 and 1 through one-hot coding, which specifically includes:

This embodiment is a first feature transformation method, and is suitable for a scenario where values of numerical data are continuously distributed, which will be described below by way of example. Assuming the 1150 numerical data exemplified above, the values of which are continuously distributed, the maximum value is 1000, and the minimum value is 0; taking N as 10, the numerical interval is 1000, the numerical interval 1000 is averagely divided into 10 sections, the numerical range of each section is [0, 100], [101, 200], [201, 300], [301, 400], [401, 500], [501, 600], [601, 700], [701, 800], [801, 900], [901, 1000], the numerical ranges are sorted according to the numerical range size of the numerical interval, and the sorting orders corresponding to the numerical sections are 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 respectively.

Respectively matching 1150 numerical data with the 10 numerical segments, and if one numerical data a has a value of 10 and is in a [0, 100] segment, taking the sorting order 1 of the [0, 100] segment as a feature transformation number of a; by analogy, all numerical data are respectively subjected to segment matching, each numerical data is subjected to a feature transformation number in a range of 1 to 10, and therefore 1150 numerical data are divided into 10 parts, one part is subjected to feature transformation number 1, the other part is subjected to feature transformation number 2, and by analogy, the other part is subjected to feature transformation number 10.

It should be noted that the sorting of the N numerical value segments may be arranged in an ascending order or in a descending order, and the sorting manner during malicious application detection is consistent with the sorting manner during AI model training.

In addition, for a value at a boundary point of a numerical segment, it is only required to match the value to a smaller numerical segment or a larger numerical segment according to an agreed rule, and the embodiment of the present invention does not limit this.

In addition, the above example equally divides the value interval 1000 into 10 segments, and if a specific rule is used to perform non-average segmentation, the method also falls into the protection scope of the embodiment of the present invention.

Performing one-hot encoding on the numerical data with the obtained feature transformation numbers, for example, if the feature transformation number of the numerical data a is 1, and the total feature transformation numbers are 10, then the one-hot encoding is 1000000000; if the feature transformation number of the value type data b is 2, the one-hot code is 0100000000; and so on.

This embodiment is a second feature transformation method, and is suitable for a scenario in which values of numerical data are distributed discretely, which is described below by way of example. Assuming the 1150 numerical data exemplified above and their values are distributed discretely, if there is no matching data on some numerical segments and there are very many matching data on some numerical segments according to the first characteristic transformation method, the data operation of the AI model is adversely affected.

In order to avoid the above situation, the second feature transformation method uses the number of data to perform segmentation and matching. Assuming the 1150 pieces of numerical data exemplified above, and taking N as 10, 10 numerical segments are obtained, and each numerical segment may match 115 pieces of numerical data; sequencing 1150 numerical data according to numerical values, matching the sequenced 1 st to 115 th numerical data with a numerical value segment 1 to obtain a feature transformation number 1, matching the sequenced 116 th to 230 th numerical data with a numerical value segment 2 to obtain a feature transformation number 2, and so on, matching the sequenced 1036 th to 1150 th numerical data with a numerical value segment 10 to obtain a feature transformation number 10. The one-hot encoding of the numerical data obtained with the feature transformation numbers is the same as the above embodiment, and is not described herein again.

It should be noted that, the above example equally divides the data number 1150 into 10 segments, and if a specific rule is used to perform non-average segmentation, the data number also falls into the protection scope of the embodiment of the present invention.

This embodiment is a third feature transformation method, and is suitable for a scenario where the values of the numerical data are distributed discretely or continuously, as will be described below by way of example. It should be noted that the text description sequence of the third feature transformation method is not used to limit the actual step sequence; in fact, all methods of segmenting by a mixture of numerical values and data numbers and matching by a mixture of numerical values and data numbers fall within the scope of the embodiments of the present invention.

Assuming the 1150 numerical data exemplified above, the 1150 numerical data are sorted according to the numerical size, wherein the first segment of data and the last segment of data are discrete, and the middle segment of data is continuous; then the first segment of data, i.e. x% of the total data number, is taken as a first numerical segment with a ranking order of 1, the last segment of data, i.e. y% of the total data number, is taken as an nth numerical segment with a ranking order of N. If N is 10, x is 10, y is 20, 1150 is 10% is 115, 1150 is 20% is 230, the first value segment can match 115 type data, and the nth value segment can match 230 type data; matching the 1 st to 115 th numerical data after sequencing the 1150 th numerical data according to the numerical values with the first numerical segment to obtain a feature transformation number 1, and matching the 921 th to 1150 th numerical data with the Nth numerical segment to obtain a feature transformation number 10.

Taking the maximum value and the minimum value in 116 th to 920 th numerical data in the 1150 numerical data sorting bits as a numerical interval, dividing the numerical interval into 10-2 sections, namely 8 numerical segmentation sections according to an average or non-average method, wherein the sorting bit levels are 2, 3, 4, 5, 6, 7, 8 and 9 respectively; and respectively matching each numerical data in the 116 th to 920 th numerical data with the numerical segments 2 to 9, and if the value of the numerical data is in the numerical range of one numerical segment S, taking the sequencing order of the numerical segment S as the characteristic transformation number of the numerical segment S.

The one-hot encoding of the numerical data obtained with the feature transformation numbers is the same as the above embodiment, and is not described herein again.

It should be noted that, in the above example, the first and last two segments of numerical data sorting are segmented according to the number of data, and the middle portion is segmented according to the numerical range, or the first and last two segments are segmented according to the numerical range and the middle portion is segmented according to the number of data according to the time application requirement, and so on.

The embodiment of the invention carries out feature transformation on the extracted static information, then carries out unique hot coding, is based on the data, and ensures the final effect of the AI model while ensuring the information contained in the data can be extracted to the maximum extent when the AI model is trained; when the malicious application program is detected, the detection accuracy is guaranteed to the maximum extent.

Based on any optional embodiment, the AI model network structure of the embodiment of the present invention includes, but is not limited to, a convolutional neural network, a cyclic neural network, a deep learning network, a machine learning network, and the like, and mainly includes four parts: fig. 3 is a schematic diagram of a network structure of the AI model according to the embodiment of the present invention. As shown in fig. 3, the network structure of the AI model includes four parts: input Layers, decomposition Layers, Hidden Layers and Output Layers;

the input layer is used for receiving the digital feature vectors and the application program malicious labels as input data; the decomposition machine layer is used for extracting low-order features in the input data and calculating according to the low-order features; the hidden layer is used for extracting high-order features in the input data, calculating malicious features of the application program according to the high-order features, and segmenting the malicious features of the application program with different maliciousness from a high-dimensional space; and the output layer is used for merging the calculation result of the decomposition machine layer and the calculation result of the hidden layer and outputting the probability that the application program corresponding to the digital feature vector is malicious application and/or non-malicious application.

FIG. 4 is a schematic diagram of an AI model training process according to an embodiment of the invention. As shown in fig. 4, the AI model according to the embodiment of the present invention is obtained by training as follows:

400, obtaining an application program installation package sample and a malicious label of the application program installation package sample;

401, analyzing a second target file in the application program installation package sample, and extracting second static information in the second target file; the second static information comprises at least one dimension information of a behavior dimension, a permission dimension and a content dimension;

402, processing the second static information into a second digital feature vector by a feature transformation manner, wherein the second digital feature vector is composed of 0 and 1;

and 403, inputting the second digital feature vector and the malicious label of the application program installation package sample into the constructed AI model, and training the AI model to obtain the AI model meeting the expected requirements.

It should be noted that the second static information and the second digital feature vector of the embodiment of the present invention are only used for noun distinction from the static information and the digital feature vector mentioned above, where "second" has no practical meaning; the second static information and the static information have the same actual meaning, and the second digital feature vector and the digital feature vector have the same actual meaning.

In the embodiment of the present invention, the application installation package sample data and the malicious tag data obtained in step 400 may be a result of manual analysis, and sample collection and malicious judgment are performed manually. According to the embodiment of the invention, a large number of complete Android platform application programs are collected, the rules for judging the maliciousness of the application programs are further extracted by manually analyzing the characteristics of the application program, such as malicious trojans, and the like, and then the collected Android platform application programs are judged for the maliciousness based on the rules to obtain the malicious labels corresponding to the application program installation package samples.

As described above, the AI model training phase and the malicious detection phase of the application program in the embodiment of the present invention both include static information extraction and feature transformation on the extracted static information, and the processing methods thereof are completely the same.

Step 401 of the AI model training method according to the embodiment of the present invention is completely the same as the processing method of step 101 and all the optional embodiments of step 101 of the AI model-based malicious application detection method; if there are multiple different optional embodiments, the effect of detecting the malicious application can be achieved by keeping the optional embodiments adopted in step 101 and step 401 consistent, and details are not described here.

Step 402 of the AI model training method according to the embodiment of the present invention is completely the same as the processing method of the malicious application detection method based on the AI model, step 102, and all the optional embodiments of step 102; if there are multiple different optional embodiments, the effect of detecting the malicious application can be achieved by keeping the optional embodiments adopted in step 102 and step 402 consistent, and details are not repeated here.

In step 403 of the AI model training method according to the embodiment of the present invention, the difference from step 103 of the AI model-based malicious application detection method is that the input data in step 403 includes a malicious label of an application installation package sample in addition to a digital feature vector after feature transformation, and the AI model is trained through the digital feature vector and the malicious label of the application installation package sample, so that the AI model has the capability of identifying malicious applications according to the digital feature vector.

Data received by an Input layer (Input Layers) of an AI model in the embodiment of the invention are digital feature vectors formed after feature transformation, the digital feature vectors comprise features extracted from dimensions such as authority, behavior, content and the like and used for representing the maliciousness of an application program APP, and the number of the digital feature vectors is composed of 0 and 1.

The decomposition machine layer (Factorization mechanisms) of the AI model in the embodiment of the invention is used for extracting low-order features in data. The maliciousness of the APP is usually a series of actions, and the malicious behavior can be finished only by mutually matching the actions, the permissions, the contents and the like, so that the combination of the characteristics can reflect the maliciousness of the APP to a great extent. And the decomposition method mainly extracts feature combinations through the implicit variable inner product of each dimension of features so as to ensure that the model can store the combination information among the features.

Preferably, the Hidden layer (Hidden Layers) of the AI model according to the embodiment of the present invention is a feed-forward neural network, and includes Hidden units (Hidden units) and Sigmoid functions (Sigmoid functions); the hidden layer comprises 6 hidden layers, the number of neuron nodes of each layer is different from the range of 32-256, the specific number of the neuron nodes of each layer can be 64, 128, 256, 128, 64 or 32, the hidden layers are used for extracting high-order features in data, and APP features with different maliciousness are segmented from a high-dimensional space.

The Output Layers (Output Layers) of the AI model in the embodiment of the invention are obtained by combining the forward calculation results of the decomposer layer and the hidden layer, and the final Output node number is 2 or 1: if the sample is 2, the probability that the sample is a non-malicious application and the probability of a malicious application are respectively determined; if 1, it is the probability that the sample is a non-malicious application or a malicious application.

Preferably, the detection requirement of a low false alarm scene can be met by setting a probability threshold of malicious application and non-malicious application.

Finally, model training is carried out in a back propagation mode; and adjusting parameters, iteration times, a model structure and the like of the AI model by taking cross validation as a means, taking accuracy, recall rate, precision rate and F value as standards and considering the influences of time factors and business backgrounds, and finally obtaining the AI model meeting expected requirements.

Based on any optional embodiment, in step 403, inputting the second digital feature vector and the malicious label of the application installation package sample into the constructed AI model, and training the AI model specifically includes:

403.1, converting the malicious label of the application program installation package sample into the malicious label of the second digital feature vector;

and 403.2, inputting the second digital feature vector and the malicious label of the second digital feature vector into the constructed AI model, and training the AI model.

In the embodiment of the invention, the digital feature vector is obtained by performing feature transformation on static information extracted from an application program installation package sample, and the malicious labels of the application program installation package sample are the result of labeling the malicious nature of APP after analyzing a large amount of mainstream malicious software of an Android platform, one application program installation package sample corresponds to one digital feature vector and corresponds to one malicious label, so that the digital feature vector of the application program installation package sample and the malicious labels can be corresponded, and the malicious labels of the application program installation package sample are converted into the malicious labels of the digital feature vector, namely one digital feature vector corresponds to one malicious label.

Thus, in the embodiment of the present invention, one input data of the AI model is the digital feature vector, and the other input data is the malicious label corresponding to the digital feature vector, so as to train the AI model.

According to the embodiment of the invention, the digital feature vector is corresponding to the malicious label, and the generalization capability of the AI model can be improved by labeling the digital feature vector.

Based on any optional embodiment, the converting the malicious label of the application installation package sample into the malicious label of the second digital feature vector in step 403.1 specifically includes:

According to the embodiment of the invention, the malicious label of the application program installation package sample is converted into the malicious label of the digital feature vector, so that the data with the digital feature vector completely identical to the malicious label is inevitable. For example, the application a is malicious after analysis, the application B is malicious after analysis, and the application C is malicious after analysis; assuming that after the static information is extracted by the application programs A, B and C and subjected to feature transformation, the obtained digital feature vectors are completely the same, namely 0010000000, while the malicious labels of A, B and C are both malicious, and assuming 1, three pieces of completely identical data (0010000000, 1) will appear. The embodiment of the invention removes the duplication of the data with the completely same digital characteristic vector and the corresponding malicious label, and can greatly reduce the data volume, thereby improving the generalization capability of the AI model and accelerating the training process of the AI model. After the data are subjected to the laboratory scene test and the data are subjected to de-duplication, the data volume is reduced by more than 1 order of magnitude.

Further, if the corresponding malicious labels comprise malicious labels and non-malicious labels, if the number of the malicious labels is larger than the number of the non-malicious labels, the digital feature vector is marked as malicious labels, otherwise, the digital feature vector is marked as non-malicious labels.

For example, after the static information is extracted and feature-transformed by the applications A, B, C, D and E, the obtained digital feature vectors are completely the same, and are assumed to be 0010000000, while the malicious labels of A, B and C are both malicious, and are assumed to be 1, and the malicious labels of D and E are both non-malicious, and are assumed to be 0, then the number of the corresponding malicious labels of 1 in the digital feature vector 0010000000 is greater than the number of the corresponding malicious labels of 0, and then the digital feature vector 0010000000 is marked as 1.

According to the embodiment, the digital feature vectors are subjected to duplicate removal and duplicate marking according to the number of malicious labels and non-malicious labels corresponding to the digital feature vectors, so that the data volume is further reduced, the data processing pressure of the AI model is relieved, and meanwhile, the generalization capability of the AI model is further improved.

The training data of the AI model of the embodiment of the invention reaches millions of orders, in a training environment, the number of malicious applications reaches over 396 thousands, and the number of non-malicious applications reaches over 88 thousands, namely, the total training data is about 500 thousands. Acquiring digital feature vectors of all application programs, marking and de-duplicating the corresponding digital feature vectors by using malicious labels of the application programs, and acquiring about 50 ten thousand of digital feature vectors with malicious or non-malicious labels; wherein about 22 thousands of digital feature vectors marked as malicious and about 28 thousands of digital feature vectors marked as non-malicious. From the above data (digital feature vectors with marking and de-weighting) a training set is selected for training the AI model.

In the testing process, test data is selected by adopting a strategy which does not intersect with the training set, specifically, 121946 malicious applications and 81757 non-malicious applications are selected, and the testing results are shown in table 3.

As can be seen from table 3, of the total 121946 training data, 119508 data were truly malicious and predicted to be malicious; 81349 data which are truly non-malicious and predicted to be non-malicious; 2348 data which are truly malicious and predicted to be non-malicious; the actual non-malicious data and the predicted malicious data are 408.

TABLE 3

Therefore, the indexes of the AI model of the embodiment of the invention can be calculated as follows:

the detection rate is as follows: 119508/(119508+2348) ═ 98.00%;

the false alarm rate is: 408/(408+81349) ═ 0.50%;

the accuracy is as follows: (119508+81349)/203703 ═ 98.60%;

therefore, the AI model provided by the embodiment of the invention has extremely high detection rate and accuracy, extremely low false alarm rate and very good beneficial effect.

In summary, in the embodiment of the present invention, based on analysis of an application program of an Android platform, and in combination with mobile application security knowledge, a plurality of features such as file size, whether to shell, the number of supported languages, and the like are extracted from three dimensions of behavior, authority, and content; secondly, carrying out data preprocessing such as feature transformation on the feature data; thirdly, performing multi-round model training by using the redesigned model based on the deep learning algorithm, and enabling the effect of the model to reach the expected target through the evaluation strategy of the model effect; and finally, extracting features of the application program with unknown maliciousness, and predicting through a model to further finish the maliciousness judgment of the application program.

The static information is a base stone for AI model training and is also a basis for malicious application detection. The static information contains information such as specific behaviors, authorities, sensitive character strings, certificates and the like of the application program, and the application program with malicious intent has strong discrimination with the normal application program in the specific dimensions, and is an important basis for judging the malicious nature of the application program.

Feature transformation is a guarantee of the AI model effect. The feature transformation is based on the data, and from the model perspective, the final effect of the model is ensured while the information contained in the data can be maximally extracted.

Model design is the key point for the success of AI models. And designing a proper model to convert information in the data into knowledge to the maximum extent, and further solidifying the knowledge for judging the maliciousness of the application program.

In summary, the embodiment of the invention performs static feature extraction and data statistics on massive Android platform applications, performs algorithm redesign based on a deep learning algorithm, further performs model training on data after feature extraction, and finally obtains an AI model for detecting malicious applications of the Android platform. The method and the device solve the problems of difficult rule extraction, low coverage, poor expansibility, easiness in bypassing and the like existing in the traditional method for detecting the malicious application program based on the manual extraction rule, and have higher accuracy and timeliness.

Fig. 5 is a malicious application detection apparatus based on an AI model according to an embodiment of the present invention. An embodiment of the present invention further provides an AI model-based malicious application detection apparatus, as shown in fig. 5, including a static information extraction module 501, a feature transformation module 502, and a malicious application detection module 503:

the static information extraction module 501 is configured to analyze a target file in an application installation package to be detected, and extract static information in the target file, where the static information includes at least one of behavior dimension, authority dimension, and content dimension;

the feature transformation module 502 is configured to process the static information into a digital feature vector in a feature transformation manner, where the digital feature vector is composed of 0 and 1;

the malicious application detection module 503 is configured to input the digital feature vector into the trained AI model to obtain a malicious detection result for the application program to be detected; the AI model is used for operating according to the input digital feature vector and outputting the probability that the application program corresponding to the digital feature vector is malicious application and/or non-malicious application.

The malicious application detection device based on the AI model according to the embodiment of the present invention is used for executing the technical solution of the malicious application detection method based on the AI model shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 6 is a schematic diagram of an electronic device framework according to an embodiment of the invention. Referring to fig. 6, an embodiment of the invention provides an electronic device, including: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 complete communication with each other through the bus 640. The processor 610 may call logic instructions in the memory 630 to perform methods comprising: analyzing a target file in an application program installation package to be detected, and extracting static information in the target file, wherein the static information comprises at least one dimension information of behavior dimension, authority dimension and content dimension; processing the static information into a digital feature vector in a feature transformation mode, wherein the digital feature vector consists of 0 and 1; inputting the digital feature vector into a trained AI model to obtain a malice detection result of the application program to be detected; the AI model is trained according to the input digital feature vector and outputs the probability that the application program corresponding to the digital feature vector is malicious application and/or non-malicious application.

An embodiment of the present invention discloses a computer program product, which includes a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer can execute the methods provided by the above method embodiments, for example, the method includes: analyzing a target file in an application program installation package to be detected, and extracting static information in the target file, wherein the static information comprises at least one dimension information of behavior dimension, authority dimension and content dimension; processing the static information into a digital feature vector in a feature transformation mode, wherein the digital feature vector consists of 0 and 1; inputting the digital feature vector into a trained AI model to obtain a malice detection result of the application program to be detected; the AI model is trained according to the input digital feature vector and outputs the probability that the application program corresponding to the digital feature vector is malicious application and/or non-malicious application.

Embodiments of the present invention provide a non-transitory computer-readable storage medium, which stores computer instructions, where the computer instructions cause the computer to perform the methods provided by the above method embodiments, for example, the methods include: analyzing a target file in an application program installation package to be detected, and extracting static information in the target file, wherein the static information comprises at least one dimension information of behavior dimension, authority dimension and content dimension; processing the static information into a digital feature vector in a feature transformation mode, wherein the digital feature vector consists of 0 and 1; inputting the digital feature vector into a trained AI model to obtain a malice detection result of the application program to be detected; the AI model is trained according to the input digital feature vector and outputs the probability that the application program corresponding to the digital feature vector is malicious application and/or non-malicious application.

Those of ordinary skill in the art will understand that: the implementation of the above-described apparatus embodiments or method embodiments is merely illustrative, wherein the processor and the memory may or may not be physically separate components, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a usb disk, a removable hard disk, a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A malicious application detection method based on an AI model is characterized by comprising the following steps:

2. The method of claim 1, wherein the behavior dimension contains behavior information for an application runtime;

3. The method according to claim 1 or 2, wherein the processing the static information into a digital feature vector by a feature transformation manner specifically comprises:

4. The method according to claim 3, wherein the converting the numerical data in each dimension data into a coded number consisting of 0 and 1 by one-hot coding after performing feature transformation specifically comprises:

5. The method according to claim 3 or 4, wherein the converting the numerical data in each dimension data into a coded number consisting of 0 and 1 by one-hot coding after performing feature transformation specifically comprises:

6. The method according to claim 3 or 4, wherein the converting the numerical data in each dimension data into a coded number consisting of 0 and 1 by one-hot coding after performing feature transformation specifically comprises:

7. The method according to claim 3 or 4, wherein the converting the numerical data in each dimension data into a coded number consisting of 0 and 1 by one-hot coding after performing feature transformation specifically comprises:

8. The method according to any of claims 1-7, wherein the network structure of the AI model comprises four parts: the device comprises an input layer, a decomposition machine layer, a hidden layer and an output layer; the AI model is obtained by training according to the following method:

9. The method according to claim 8, wherein the step of inputting the second digital feature vector and the malicious label of the application installation package sample into the constructed AI model and training the AI model comprises:

10. The method according to claim 9, wherein the converting the malicious label of the application installation package sample into the malicious label of the second digital feature vector comprises:

11. An electronic device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 10.

12. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 10.