CN115934501A

CN115934501A - Application program detection method and device, storage medium and electronic equipment

Info

Publication number: CN115934501A
Application number: CN202211013587.0A
Authority: CN
Inventors: 骆媛媛; 邹一心
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2023-04-07

Abstract

The disclosed embodiment relates to an application program detection method and device, a storage medium and an electronic device, and relates to the technical field of network security, wherein the method comprises the following steps: acquiring application data of an application program to be detected; acquiring a text vector of the application data, fitting the text vector through a trained detection model to acquire a plurality of classification values, and performing integrated classification on the classification values to determine a detection result of the application data; and determining a compliance detection result according to the detection result, and performing corresponding operation according to the compliance detection result. The application program detection method and the application program detection device can improve the accuracy of application program detection.

Description

Application program detection method and device, storage medium and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of network security, and in particular relates to an application program detection method, an application program detection device, a computer-readable storage medium and an electronic device.

Background

Applications involve the collection and use of large amounts of personally sensitive information, and face a number of privacy risks.

In the related art, the application program can be evaluated by adopting a manual combination tool. The method has the problems of low detection efficiency, high labor cost and low accuracy.

It is noted that the information of the invention in the above background section is only for enhancement of understanding of the background of the present disclosure and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to an application detection method, an application detection apparatus, a computer-readable storage medium, and an electronic device, which overcome at least some of the problems of low security and poor accuracy due to the limitations and disadvantages of the related art.

According to an aspect of the present disclosure, there is provided an application detection method including: acquiring application data of an application program to be detected; acquiring a text vector of the application data, fitting the text vector through a trained detection model to acquire a plurality of classification values, and performing integrated classification on the classification values to determine a detection result of the application data; and determining a compliance detection result according to the detection result, and performing corresponding operation according to the compliance detection result.

In an exemplary embodiment of the present disclosure, the detection model includes a first model, a second model, and a third model; the fitting of the text vector through the trained detection model to obtain a plurality of classification values, and the integrated classification of the classification values to determine the detection result of the application data includes: respectively extracting features of the text vectors through a first model, a second model and a third model in the detection model to obtain a plurality of classification values; and performing integrated voting on the plurality of classification values to determine a detection result of the text vector.

In an exemplary embodiment of the present disclosure, the method further comprises: acquiring reference application data of an application program, and screening the reference application data based on a judgment rule to generate sample data; and training a detection model according to the sample data to obtain the trained detection model.

In an exemplary embodiment of the present disclosure, the acquiring reference application data of an application includes: acquiring page data of the application program, and extracting link information according to the page data; and acquiring an installation package of the application program according to the link information, and acquiring the reference application data based on the installation package.

In an exemplary embodiment of the present disclosure, the training a detection model according to the sample data to obtain the trained detection model includes: dividing the sample data into a training set and a test set; updating the model parameters of the detection model according to the training set so as to train the detection model; and testing the detection model according to the test set to obtain the trained detection model.

In an exemplary embodiment of the disclosure, the obtaining a text vector of the application data includes: and performing feature extraction on the application data through a multilayer network in a feature extraction model to obtain a text vector of the application data.

In an exemplary embodiment of the present disclosure, the extracting the feature of the application data through a multi-layer network in a feature extraction model to obtain a text vector of the application data includes: extracting the characteristics of the application data through a self-attention mechanism layer in each layer of the network to obtain characteristic vectors; and carrying out full-connection processing on the characteristic vectors based on a feedforward neural network layer in each layer of network to obtain the text vectors of the application data.

According to an aspect of the present disclosure, there is provided an application detecting apparatus including: the data acquisition module is used for acquiring application data of the application program to be detected; the integrated classification module is used for acquiring a text vector of the application data, fitting the text vector through a trained detection model to acquire a plurality of classification values, and performing integrated classification on the classification values to determine a detection result of the application data; and the detection result determining module is used for determining a compliance detection result according to the detection result and performing corresponding operation according to the compliance detection result.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the application detection method of any one of the above.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the above application detection methods via execution of the executable instructions.

In the application detection method, the application detection device, the computer-readable storage medium and the electronic device provided in the embodiments of the present disclosure, on one hand, a trained detection model is used to fit a text vector of application data, and a plurality of classification values are integrated and classified to determine a detection result of the text vector, so that large-scale and fine-grained evaluation detection is performed based on the application data and the integrated classification, problems existing in the application are efficiently and accurately located, and the detection efficiency and accuracy are improved. On the other hand, the detection model is used for receiving multi-dimensional application data as input, multi-dimensional information can be comprehensively considered, the comprehensiveness and accuracy of application program detection are improved, and the application range is enlarged.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It should be apparent that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived by those of ordinary skill in the art without inventive effort.

Fig. 1 schematically illustrates a flowchart of an application detection method according to an embodiment of the present disclosure.

Fig. 2 schematically illustrates a flowchart of acquiring rights data according to an embodiment of the present disclosure.

Fig. 3 schematically illustrates a schematic diagram of obtaining sample data by performing tagging according to an embodiment of the present disclosure.

Fig. 4 schematically illustrates a detailed flow diagram of integrated voting according to an embodiment of the present disclosure.

Fig. 5 schematically illustrates a schematic diagram of obtaining a detection result according to a model according to an embodiment of the present disclosure.

Fig. 6 schematically illustrates a specific flowchart of the detection application according to an embodiment of the present disclosure.

Fig. 7 schematically shows a block diagram of an application detection apparatus according to an embodiment of the present disclosure.

Fig. 8 schematically illustrates a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In order to solve the problems in the related art, in the embodiments of the present disclosure, an application detection method is provided. Referring to fig. 1, the application detection method mainly includes the following steps:

step S110, acquiring application data of an application program to be detected;

step S120, obtaining a text vector of the application data, fitting the text vector through a trained detection model to obtain a plurality of classification values, and performing integrated classification on the classification values to determine a detection result of the application data;

and step S130, determining the compliance detection result of the application program to be detected according to the detection result.

In the embodiment of the disclosure, the trained detection model is used for fitting the text vector of the application data, the classification values are integrated and classified to determine the detection result of the text vector, and large-scale and fine-grained evaluation and detection are performed based on the application data and the integration and classification to obtain the compliance detection result, so that the problems of the application program are efficiently and accurately positioned, and the detection efficiency and accuracy are improved. In addition, the detection model is used for receiving multi-dimensional application data as input, multi-dimensional information can be comprehensively considered, the comprehensiveness and accuracy of application program detection are improved, and the application range is enlarged.

Next, each step of the application detection method will be described in detail with reference to fig. 1.

In step S110, application data of an application to be detected is acquired.

In the embodiment of the present disclosure, the application to be detected may be various types of applications, such as a shopping application, an interactive application, a game application, and the like, and may include other types of applets associated with some third-party applications. The application to be detected may be an application under various operating systems, and is not limited herein.

The application data refers to data associated with the application degree to be detected, and may include, but is not limited to, authority data, user data, operation data, and the like. Wherein, the authority data can be a decompiled APK file; the user data may be privacy policies and double-manifest text; the execution data may be dynamic execution behavior data. Decompiling the APK file refers to obtaining the APK file according to an installation package (APK package) of the downloaded application program to be detected. The user data and the operation data can be obtained after the APK package is installed. The privacy policy and the double-list text comprise collected personal information lists and personal information lists shared with third parties; the collected personal information list requires dynamic real-time update, and the collection frequency may be set to 1 minute/time, or may be set according to actual requirements. The dynamic execution behavior data comprises a data capture packet and a page screenshot, wherein the data can be captured by using an Apium and the page screenshot can be obtained by performing automatic screenshot, and then Character information in the page screenshot is identified and extracted based on an OCR (Optical Character Recognition) algorithm; the dynamic execution can simulate running software in a sandbox, adopt an automatic traversal mode or manual operation of a user, and can also be connected with real machine running software, which is not limited here. A sandbox may be a virtual system program, which refers to a tool used to test the behavior of untrusted files or applications in an isolated environment in network security. The app is an automation framework of a mobile terminal, and can be used for testing native applications, mobile web applications and hybrid applications. The Appium is cross-platform, and test cases can be written with a set of apis for different platforms. For example, some operation instructions may be sent to the app, and the app drives the mobile device according to different instructions to complete different actions.

In some embodiments, decompiled APK files may include, but are not limited to, bytecode files, signature files, manifest files, and resource files. When the decompiled APK file is obtained, the installation package APK of the application program needs to be determined. Specifically, the basic information of the mobile APP in the application market can be crawled, the APP installation package can be downloaded, and then the decompiled APK file is generated according to the information of the installation package. The returned page data can be analyzed and extracted by using a regular expression, an XPath analysis library, a CSS selector and the like based on a Scapy crawler framework. The page data refers to the content of the whole web page, and then link information of the download link and basic information are extracted from the page data by using a regular expression or the like, and the basic information may include, for example, but is not limited to, APP version number, developer unit, update time and all application-related information. On the basis, the download link can be input into File pipeline for download management and scheduling. File pipe refers to a File pipeline, is a function provided by an application framework script written for crawling website data and extracting structural data, and can be used for converting a link into an APK installation package so as to facilitate downloading of the APK installation package and storing the downloaded APK installation package in a folder.

The application data can be used for representing the detection use case, and the application data can comprise decompiled APK files, privacy policies, double-list texts and dynamic execution behavior data, so that the detection use cases of multiple dimensions can be obtained, and the comprehensiveness and the accuracy of the application data are improved.

Fig. 2 schematically shows a flow chart of acquiring rights data, and referring to fig. 2, the method mainly includes the following steps:

in step S201, a request is simulated. The request may be a request to crawl page data for a web page.

In step S202, the response is analyzed, and data is extracted. Here, page data of the entire web page is extracted.

In step S203, it is determined whether target data is extracted; if yes, go to step S204; if not, go to step S201. The target data may be link information.

In step S204, the target data is packaged into an Item object and sent to the File pipeline File pipeline. An Item object is a simple container for collecting crawled data, which provides a dictionary-like API interface.

In step S205, the installation package is downloaded.

In step S206, save to the file system.

In step S207, it is saved to the database.

In the embodiment of the disclosure, the application data and the basic information can be acquired through the link information and the page data, and the comprehensiveness and the accuracy of the application data are improved.

In step S120, a text vector of the application data is obtained, the text vector is fitted through a trained detection model to obtain a plurality of classification values, and the classification values are integrated and classified to determine a detection result of the application data.

In the embodiment of the disclosure, the feature extraction of the application data can be performed through the feature extraction model to obtain the text vector, and the trained detection model is used for fitting the text vector to obtain the detection result of the application data. The feature extraction model can be connected with the trained detection model in series to obtain the detection result of the application data.

In some embodiments, the text vector may be a vector of detection use cases represented by application data corresponding to the application to be detected, and thus the detection use cases may correspond to the text vector. The feature extraction model may be a BERT model for performing pre-training. Specifically, feature extraction is carried out on the application data through a feature extraction model to obtain a feature vector; and carrying out full connection processing on the feature vector to obtain a text vector of the application data. The BERT model may use an encoder Encoderlayer of a bidirectional Transformer model for feature extraction. Illustratively, the Transformer model is generally used by multi-layer stacking, that is, the feature extraction model includes a multi-layer network, and each layer of the network may include two sublayers, namely a first layer and a second layer. The first layer may be a multi-head attention mechanism layer and the second layer may be a feedforward neural network layer. The multi-head attention mechanism layer is used, so that the features can be extracted from different angles, and the comprehensiveness and accuracy of feature extraction are improved. Calculating the current word and simultaneously utilizing the word of the context thereof so as to extract the long-distance dependency relationship between the words; because the calculation of each word is independent and independent, the features of all words can be simultaneously calculated in parallel, and the method has strong language characterization capability and feature extraction capability.

The feedforward neural network layer may be a fully connected structure of two layers and may be activated by a ReLU function. The feedforward neural network layer is used for integrating all the feature vectors, namely performing two-layer linear mapping on the output of the previous layer and using a nonlinear activation function for activation in the middle. The fully-connected layer represented by the feed-forward neural network layer serves to map the learned distributed feature representation to the sample label space, which may be implemented by a convolution operation. The input of the feedforward neural network layer represented by the second layer is x, and the corresponding sub-layer, that is, the output of the second layer is shown in formula (1):

FFN(x)＝max(0,xW ₁ +b ₁ )W ₂ +b ₂ formula (1)

Where x is the input matrix, wi represents the weight matrix corresponding to the ith linear operation, and bi is the intercept constant.

Usually, the Transformer is used by stacking multiple layers, and each layer is divided into two sublayers, namely a multi-head attention mechanism layer and a feedforward neural network layer. The input of each sub-layer is x, and the output corresponding to each sub-layer is shown in formula (2):

y = layerNorm (x + Sublayer (x)) formula (2)

Where Sublayer (x) is a function of each Sublayer.

Based on the multilayer network in the feature extraction model, each layer of network can comprise a multi-head attention mechanism layer and a feedforward neural network layer, and feature extraction can be performed on the application data through the multi-head attention mechanism layer in the first layer of network to obtain a first feature vector; and carrying out full-connection processing on the first feature vector based on a feedforward neural network layer in a first layer network to obtain a first text vector of the application data. Extracting the features of the first text vector through a multi-head self-attention mechanism layer in a second layer network to obtain a second feature vector; and carrying out full-connection processing on the second characteristic vector based on a feedforward neural network layer in a second layer network to obtain a second text vector of the application data. And processing the data sequentially through a multi-head self-attention mechanism layer and a feedforward neural network layer of each layer of the network until all the layers of the network are executed so as to obtain the text vector of the application data.

After the text vector of the application program to be detected is obtained, classification prediction can be performed on the text vector according to the detection model connected with the feature extraction model in series, so that a prediction result is obtained. In order to ensure the accuracy of the model processing, the detection model may be first subjected to model training to obtain a trained detection model. In the disclosed embodiment, the detection model may include a first model, a second model, and a third model. The first model may be a CNN (Convolutional Neural Networks) learner, the second model may be an RNN (Recurrent Neural Networks) learner, and the third model may be an LSTM (Long Short-Term Memory) learner.

In some embodiments, model parameters of the first model, the second model, and the third model may be adjusted according to the sample data to train the first model, the second model, and the third model, respectively. In particular, the sample data may be partitioned into a training set and a test set, e.g., randomly scaling the sample data into the training set and the test set. Further, the CNN model may be trained based on a training set to obtain a CNN learner; training an RNN model based on a training set to obtain an RNN learner; and training the LSTM model based on the training set to obtain an LSTM learner, and generating a trained detection model based on the CNN learner, the RNN learner and the LSTM learner.

The first model, the second model, and the third model may be trained simultaneously or sequentially, and the training order of the models is not particularly limited here.

After the trained detection model is obtained, the trained detection model can be verified based on the test set to judge whether the performance of the trained detection model meets the performance requirements. Performance can be evaluated by model accuracy. If the accuracy of the trained detection model is smaller than the preset value, the model parameters of the detection model need to be continuously adjusted. And if the accuracy of the trained detection model is greater than the preset value, determining the trained detection model according to the current model parameters.

In some embodiments, the sample data required for model training may be obtained according to the following manner: and acquiring reference application data of the application program, and screening the reference application data based on a judgment rule to generate sample data. The application program may be an application program for training a model, may be the same as the application program to be detected, or may be different from the application program to be detected, and is not specifically limited herein. The reference application data may be data corresponding to the application program, and the type of the reference application data may be the same as the type of the application data of the model to be detected, and specifically may include, but is not limited to, authority data, user data, operation data, and the like of the application program. Wherein, the authority data can be a decompiled APK file; the user data may be a privacy policy and a double manifest text; the execution data may be dynamic execution behavior data. The specific manner of acquiring the reference application data is the same as the specific manner of acquiring the application data, and is not described herein again.

It should be noted that after the reference application data is acquired, the reference application data may be screened according to the evaluation rule, so as to select a part of the reference application data that meets the evaluation rule as sample data. Screening herein may be understood as annotation. For example, some evaluation rules may be obtained from the rule set, and then the reference application data is labeled based on the evaluation rules to obtain sample data. Specifically, a rule set may be obtained first, and the rule set may be determined according to various types of application standard rules published by the relevant organization, such as rule 1, rule 2, rule 3, and so on. Moreover, the rule set can be updated along with the development of time so as to ensure the accuracy and timeliness of the rule set. The rule set may be parsed to make real-time adjustments to the detection cases. The rule set may include evaluation rules and non-evaluation rules. The evaluation rules may be obtained from a rule set to obtain sample data.

In some embodiments, the collected reference application data may be preprocessed and annotated to create an annotated corpus. The data preprocessing method includes word segmentation, sentence length statistics, noise elimination, etc., and is not limited in detail here. The relevant websites or evaluation rules on the application can be crawled. The evaluation rules may include, but are not limited to, a list of notifications related to application violations or violations of user interests, test reports from a professional organization, or other types of rules. Based on this, the reference application data may be labeled based on the evaluation rule to obtain labeled data as sample data. In addition, the reference application data can also be automatically labeled through a labeling tool, and the labeling tool can be, for example, brat, doccano, marktool and the like; and manual marking can be carried out to obtain a marked corpus so as to take the marked corpus as sample data.

Referring to fig. 3, reference application data 304 may be acquired from the right data 301, the user data 302, and the operation data 303; and annotates reference application data 304 according to evaluation rules 305 to obtain sample data 306.

In the embodiment of the disclosure, the judgment rule is acquired from the rule set, the reference application data is labeled based on the judgment rule to acquire the sample data, and the detection model is trained according to the sample data, so that the accuracy and the reliability of the detection model can be improved. The static analysis and the dynamic analysis are combined, the privacy policy, the double-list text, the permission request, the API calling and the behavior information during dynamic operation are extracted to serve as mixed feature vectors, the problems of the APP are efficiently and accurately positioned from multiple dimensions, and the accuracy and the comprehensiveness of sample data are improved.

After the trained detection model is obtained, fitting prediction can be performed on the text vector of the application data based on the trained detection model, and a plurality of classification values are obtained. It should be noted that each application to be detected may include multiple detection use cases with multiple dimensions, and each text vector of the application data may be used to represent a vector of a detection use case. Based on the above, each detection case can be individually fitted and predicted through the trained detection model, so that the detection result of each detection case is obtained.

Since the trained detection model may include a plurality of models, the plurality of classification values correspond to the plurality of models one-to-one. Illustratively, the text vector may be fitted through the first model to obtain a first classification value; performing feature extraction on the text vector through the second model to obtain a second classification value; and fitting the text vector through the third model to obtain a third classification value. The first model, the second model, and the third model may be fitted to the text vector at the same time, or may be fitted in a sequential order, which is not specifically limited herein. The classification value may be used to represent an output of the text vector through any one of the models, and is used to represent a prediction result predicted by each model in the trained detection model individually.

Because a single learner is easy to be under-fitted or over-fitted, in order to obtain a learner with excellent generalization performance, a plurality of individual learners can be trained, and a strong learner is finally formed through a certain combination strategy.

Referring to fig. 4, a CNN learner 401 is trained from training data and CNN models; an RNN learner 402 is obtained according to training data and RNN model training; the LSTM learner 403 is trained from the training data and the LSTM model. The CNN learner 401 in the detection model 400 outputs a classification value of 1, the RNN learner 402 outputs a classification value of 2, and the LSTM learner 403 outputs a classification value of 3. Further, the integrated voting may be performed on the plurality of classification values by the integrated classifier to generate the detection result as the output result 404. The detection result may include a compliance detection result, or may include a non-compliance detection result, which may be specifically determined according to actual requirements.

The integrated voting is used for fusing output results of all models which are independently predicted so as to obtain an integral detection result of each detection case in the application program to be detected. The integrated voting may include regression voting and classification voting. In the regression voting method, the prediction result is the average of all model prediction results. In the classification voting method, the prediction result is the prediction result with the most occurrence in all model types. The classification voting method can be further divided into hard voting and soft voting, wherein the prediction result of the hard voting is the class with the most occurrence of all voting results, and the prediction result of the soft voting is the class with the highest probability of the summation of all voting results. In the embodiment of the present disclosure, any one voting method may be used to fuse the classification values generated by each of the trained detection models, so as to obtain a final detection result. The detection result may be represented, for example, as a classification value indicating the category of the application to be detected, such as whether it belongs to a compliance detection result or a non-compliance detection result. The detection result can be used for evaluating whether the application program violates the collection of the user information so as to facilitate accurate detection.

Fig. 5 schematically shows a flow chart of processing according to a model, and referring to fig. 5, application data 501 of an application program to be detected may be input to a feature extraction model 502, and a text vector 504 of the application data may be obtained through a plurality of networks 503 in the feature extraction model. The text vector is input into the detection model 505, the classification value 1 is obtained through the first model 5051, the classification value 2 is obtained through the second model 5052, the classification value 3 is obtained through the third model 5053, and a plurality of classification values such as the classification value 1, the classification value 2, the classification value 3 and the like are integrated and classified to obtain the detection result 506.

In the embodiment of the disclosure, an APP collection and user information compliance detection model is established based on a machine learning (CNN-RNN-LSTM) integration method, an APP compliance detection judgment rule is formed according to a rule set formed by various rules, collected decompiling APK files, privacy policies and double-list texts are comprehensively utilized, a behavior data training model is dynamically executed, and then a detection result is determined in an integrated classification mode. The compliance of APP collection and user information use can be automatically evaluated in a large-scale and fine-grained manner; static analysis and dynamic analysis are combined, privacy policies, double-list texts, permission requests, API calling and behavior information during dynamic operation are extracted to serve as mixed feature vectors, problems of APP can be efficiently and accurately located from multiple dimensions, detection accuracy is improved, and efficiency of application program compliance detection is improved. And pre-training by using a BERT model, simultaneously considering the upper text and the lower text of a word, comprehensively utilizing three neural network algorithms, and voting the classification values output by the three learners in an integrated classifier by adopting a voting integration mode, so that a detection result is judged, and the accuracy can be improved.

In step S130, a compliance detection result is determined according to the detection result, and a corresponding operation is performed according to the compliance detection result.

In the embodiment of the present disclosure, a plurality of detection cases may exist in the application to be detected, and each detection case may be a compliance detection result or a non-compliance detection result. On the basis of step S120, the compliance detection result may be screened according to all the detection results of the application to be detected. The compliance test result may be part or all of all test results. During screening, the operation can be performed according to the screening rule, the determination that the screening rule is met is the compliance detection result, and the determination that the screening rule is not met is the non-compliance detection result. The screening rule may be configured according to actual requirements, and is not specifically limited herein. Based on this, it is possible to screen out the compliance detection results based on all the detection results and perform the corresponding operation based on the compliance detection results. The corresponding operation may be a statistical analysis, and may also be an auxiliary reference, specifically set according to an application scenario, and is not particularly limited herein.

In the embodiment of the disclosure, the detection and evaluation criteria include authority, privacy policy text, double-list and the like, and the correspondingly collected application data includes decompiled APK files, privacy policy and double-list text, and dynamic execution behavior data. Because the application data is data with multiple dimensions, detection cases with multiple dimensions such as authority, privacy policy text, double lists and the like can be detected, the related range is wide, and the comprehensiveness and the reliability are improved.

Fig. 6 schematically shows a specific flowchart of application detection, and referring to fig. 6, the specific flowchart mainly includes the following steps:

in step S601, a rule set is obtained, which may be a set of user information compliance testing criteria collected and used by an application.

In step S602, a detection model is constructed based on the CNN-RNN-LSTM integration method.

In step S603, basic information of the application is crawled and the installation package APK is downloaded.

In step S604, application data of the APP is collected, where the application data includes decompiled APK files, privacy policies and double-manifest texts, and dynamic execution behavior data.

In step S605, the collected application data is preprocessed and labeled to establish a labeled corpus.

In step S606, the labeled corpus is divided into a training set and a test set, and pre-trained using a feature extraction model.

In step S607, model training and testing are detected.

In step S608, the compliance detection result is output and statistical analysis is performed.

According to the technical scheme, an APP collection and user information compliance detection model is built based on a machine learning integration method, multidimensional characteristics of a training sample are received by a machine learning algorithm to serve as input, the method is not limited by existing fixed discrimination logic, characteristics related to privacy compliance are comprehensively considered, and internal relation between characteristics of multidimensional application data and detection cases for illegally collecting and using user information is discovered. And large-scale and fine-grained evaluation and detection can be performed. The detection efficiency is high, the labor cost can be reduced, the problems of APP can be efficiently and accurately positioned, and the problems that the detection efficiency is low, the labor cost is high, the report quality is uneven and the like when the APP is illegally collected and used by users are solved. In addition, statistical analysis can be carried out on the basis of the compliance detection result, and the output result can provide reference for the formulation of a personal information security policy and the scientific and effective review of a supervision department, so that the convenience is improved. In addition, the application data of multiple dimensions such as the decompiled APK file, the privacy policy, the double-list text and the dynamic execution behavior data are collected, detection cases such as the authority, the privacy policy text and the double-list can be detected, the related range is wide, and the comprehensiveness is improved.

The disclosure also provides an application program detection device. Referring to fig. 7, the application detection method 700 mainly includes the following modules:

a data obtaining module 701, configured to obtain application data of an application to be detected;

an integrated classification module 702, configured to obtain a text vector of the application data, fit the text vector through a trained detection model to obtain a plurality of classification values, and perform integrated classification on the plurality of classification values to determine a detection result of the application data;

the detection result determining module 703 is configured to determine a compliance detection result according to the detection result, and perform a corresponding operation according to the compliance detection result.

In an exemplary embodiment of the disclosure, the extracting the feature of the application data through a multi-layer network in a feature extraction model to obtain a text vector of the application data includes: extracting the characteristics of the application data through a self-attention mechanism layer in each layer of the network to obtain characteristic vectors; and carrying out full-connection processing on the characteristic vectors based on a feedforward neural network layer in each layer of network to obtain the text vectors of the application data.

It should be noted that, the specific details of each module in the application detection apparatus have been described in detail in the corresponding application detection method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 800 according to this embodiment of the disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 8, the electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, a bus 830 that couples various system components including the memory unit 820 and the processing unit 810, and a display unit 840.

Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, the processing unit 810 may perform the steps as shown in fig. 1.

The memory unit 820 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.

Storage unit 820 may also include a program/utility module 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 800, and/or with any device (e.g., router, modem, etc.) that enables the electronic device 800 to communicate with one or more other computing devices. Such communication may occur over input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or an electronic device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.

According to the program product for implementing the above method of the embodiments of the present disclosure, it may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not so limited, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described drawings are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. An application detection method, comprising:

acquiring application data of an application program to be detected;

acquiring a text vector of the application data, fitting the text vector through a trained detection model to acquire a plurality of classification values, and performing integrated classification on the classification values to determine a detection result of the application data;

and determining a compliance detection result according to the detection result, and performing corresponding operation according to the compliance detection result.

2. The application detection method of claim 1, wherein the detection model comprises a first model, a second model, and a third model; the fitting of the text vector through the trained detection model to obtain a plurality of classification values, and the integrated classification of the classification values to determine the detection result of the application data includes:

respectively extracting features of the text vector through a first model, a second model and a third model in the detection model to obtain a plurality of classification values;

and performing integrated voting on the plurality of classification values to determine a detection result of the text vector.

3. The application detection method of claim 1, further comprising:

acquiring reference application data of an application program, and screening the reference application data based on a judgment rule to generate sample data;

and training a detection model according to the sample data to obtain the trained detection model.

4. The method according to claim 3, wherein the obtaining reference application data of the application program comprises:

acquiring page data of the application program, and extracting link information according to the page data;

and acquiring an installation package of the application program according to the link information, and acquiring the reference application data based on the installation package.

5. The method according to claim 3, wherein the training a detection model according to the sample data to obtain the trained detection model comprises:

dividing the sample data into a training set and a test set;

updating the model parameters of the detection model according to the training set so as to train the detection model;

and testing the detection model according to the test set to obtain the trained detection model.

6. The method of claim 1, wherein the obtaining the text vector of the application data comprises:

and performing feature extraction on the application data through a multilayer network in a feature extraction model to obtain a text vector of the application data.

7. The method for detecting an application program according to claim 6, wherein the extracting the feature of the application data through a multi-layer network in a feature extraction model to obtain a text vector of the application data comprises:

extracting the characteristics of the application data through a self-attention mechanism layer in each layer of the network to obtain characteristic vectors;

and carrying out full-connection processing on the characteristic vectors based on a feedforward neural network layer in each layer of network to obtain the text vectors of the application data.

8. An application detection apparatus, comprising:

the data acquisition module is used for acquiring application data of the application program to be detected;

the integrated classification module is used for acquiring a text vector of the application data, fitting the text vector through a trained detection model to acquire a plurality of classification values, and performing integrated classification on the classification values to determine a detection result of the application data;

and the detection result determining module is used for determining a compliance detection result according to the detection result and carrying out corresponding operation according to the compliance detection result.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the application detection method of any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the application detection method of any of claims 1 to 7 via execution of the executable instructions.