CN112395884A - Android API semantic relation map construction method based on code document - Google Patents

Android API semantic relation map construction method based on code document Download PDF

Info

Publication number
CN112395884A
CN112395884A CN202011274561.2A CN202011274561A CN112395884A CN 112395884 A CN112395884 A CN 112395884A CN 202011274561 A CN202011274561 A CN 202011274561A CN 112395884 A CN112395884 A CN 112395884A
Authority
CN
China
Prior art keywords
api
android
semantic
relationship
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011274561.2A
Other languages
Chinese (zh)
Other versions
CN112395884B (en
Inventor
杨珉
张源
张晓寒
张谧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202011274561.2A priority Critical patent/CN112395884B/en
Publication of CN112395884A publication Critical patent/CN112395884A/en
Application granted granted Critical
Publication of CN112395884B publication Critical patent/CN112395884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Stored Programmes (AREA)

Abstract

The invention belongs to the technical field of mobile security, and particularly relates to a code document-based android API semantic relation map construction method. The android system has a large number of APIs, and rich semantic relationships are contained among the APIs. The semantic relations have important application value, and particularly in a machine learning task using the API as an input feature, the semantic relations can provide stronger generalization for the model. The method for constructing the android API semantic relation map based on the code document mainly comprises the steps of classifying the android API semantic relation, generalizing a template capable of expressing the API relation and generating the template in an iterative manner, and constructing the relation map based on natural language processing and the template. The method can comprehensively and accurately construct the semantic relation map between the android APIs.

Description

Android API semantic relation map construction method based on code document
Technical Field
The invention belongs to the technical field of mobile security, and particularly relates to an android security and natural language processing method.
Background
The android system is currently the most popular mobile operating system. According to data reports of International Data Corporation (IDC), the market share of the android system on the mobile phone is 86.6% by 2019, and the android system is the most popular mobile operating system. Besides smart phones and tablets, android systems are also widely applied to other various smart devices, such as smart wearable devices, smart cars, smart homes, and various IoT devices.
The android system provides a number of Application Programming Interfaces (APIs). Taking the latest android system 10.0 as an example, the system of this version has nearly 6 ten thousand APIs. Through the APIs, the android program can access system resources and services, interact with the system and the user, and provide various functions.
There are a number of semantic relationships implied between android APIs. Various semantic relationships exist between different APIs, such as two APIs may use the same permissions, and yet another API must be called after another API is called. However, the android API is organized only by a package, class, or other structure, and does not effectively reflect the semantic relationships, nor systematically classify the semantic relationships.
The semantic relation of the android API has important application value. At present, a great deal of work is developed based on android APIs, such as malicious behavior analysis, malware detection and the like. In particular, current methods based on machine learning are widely prevalent. However, these methods all treat each API as an independent input feature, and ignore the direct semantic relationship of the API, which reduces the generalization of the machine learning model. For example, android malware is rapidly evolving to circumvent existing analysis methods, but in order to maintain coherence of malicious behaviors in old and new versions, it is often replaced with APIs with strong semantic similarity. The semantic similarity between the API learned by the model can be improved, the detection rate of the malicious software can be improved, and the generalization of the model can be enhanced.
An API semantic relationship graph can be constructed from the android code document using natural language processing techniques. Natural language processing technology is becoming one of the most important technologies in the information age, and is widely used in the fields of language translation, question answering and the like. The natural language processing technology can help a computer to process and understand natural language, and in the android API language problem, the natural language processing technology can help to understand semantic information in API description, and is helpful to accurately construct an API semantic relation graph.
Disclosure of Invention
The invention aims to provide a code document-based android API semantic relation map construction method capable of comprehensively and accurately constructing a semantic relation map between android APIs, and help or inspiration is provided for the fields of android malicious behavior analysis, malicious software detection and the like.
According to the android API semantic relationship map construction method based on the code document, API semantic relationships are mined through the android official code document. Specifically, scientific classification is carried out on android API semantics, a template capable of representing API relations in a generalization mode is summarized, an iterative generation method is provided, a relation graph is constructed based on natural language processing and the template, and an API semantic relation graph is constructed by utilizing an android official code document.
The integral architecture of the android API semantic relation graph construction method provided by the invention is shown in figure 1, and the method mainly comprises the following three aspects: an android API semantic relationship classification method; a template capable of generalizing API relationship and an iterative generation method thereof; a relation graph construction method based on natural language processing and a template. The method comprises the following specific steps.
Android API semantic relationship classification
The android official document was first studied and found to contain rich information, with API being the most basic element. Besides the API, the document also comprises class (class), package (package) and permission (permission), the class and the package are basic elements in java, and the permission describes resources required by the execution process of the android API, and the resources and the API have semantic relation.
The invention scientifically classifies the semantic relations between the API and the API, the class, the package and the authority, defines 5 major-class relations, namely a structural relation, a prototype relation, a reference relation, a usage relation and an authority relation, and the description of the 5 major-class relations is as follows.
(1) A structural relationship, the category describing the relationship between different APIs, classes, packages and permissions in code organization. Android API documents have a hierarchical structure, organized from top to bottom by the structure of packages, classes, and APIs. Each containing a number of classes, each class being a separate HTML file, a class can be considered the smallest unit of file of an API document, each class document describing the basic hierarchical information and all APIs in that class. For example, a class belongs to a package, and a similar relationship has an API that belongs to a class, one class inherits another class.
(2) A prototype relationship, the class describing relationships related to the parameter type, return value type, and thrown exception type of the API. The parameter designates a value type transmitted to the API during calling, the returned value type designates what type of value is returned after the API is called, and the thrown exception type designates what kind of exception needs to be captured if an exception occurs during the API calling process, which are all related to the API usage information and directly reflect the function of one API.
(3) Usage relationships, this category describes the usage specification and method of the API. In android APIs, there are some APIs that are similar in function and usage, and the usage-type relationship concerns the alternatives in the description, such as one API being functionally similar to another API and being replaceable by another API. In a practical scenario, in addition to calling correctly according to the prototype of the API, the usage conditions between APIs need to be concerned, such as the usage of one API depends on another API.
(4) A reference relationship, a category describing a relationship where one API references another API, a class, a package, or a privilege, e.g., a relationship that indicates that use of one API requires reference to another API.
(5) And the permission relation concerns the relation of the API use permission, and the permission describes the resources required by the android API in the execution process and often reflects the function and the sensitivity of the API.
Template and iterative template generation for (II) generalizable representation of API relationships
In the android official document, information related to the semantics of the API mainly includes structured semantic information and unstructured semantic information, and fig. 2 is a document example of the API. The invention analyzes the structured semantic information in sequence according to the hierarchy from the top layer to the bottom layer to obtain the structured API semantic relationship. For unstructured semantic information, the unstructured semantic information is description information in an android official document, the description information describes information such as functions, usages, notes and the like of an API in a natural language mode, the invention summarizes a template capable of representing API relations in a generalization mode, and provides an iterative generation technology of the template.
Table 1 shows the templates and examples of the relationship of the generalizable API summarized in the present invention, where there are 4 relationships, namely, conditional relationship, selectable relationship, reference relationship, and permission relationship.
TABLE 1 templates that can generalize the representation of API relationships
Figure DEST_PATH_IMAGE002
In order to summarize templates capable of representing API relationships in a generalization manner, the invention provides a semi-automatic strategy iteration generation template for relationship matching, which comprises the following specific generation steps:
firstly, randomly selecting 1% of classes, including all description information of the classes after NLP preprocessing, and extracting a template in a manual mode;
then, another 1% of classes are randomly selected, and after NLP preprocessing, matching is performed using the template in the previous step to extract the relationship. For sentences each of which contains an API, a class, a package or a right, if the template can be matched and the relation is extracted, if not, the relation is deleted, sentences without the extracted relation are left, and the sentences are manually checked, the template is extracted from the sentences and added to the template set.
And repeating the steps until the template set is not increased any more.
(III) building a relation graph based on natural language processing and templates
The method uses a Natural Language Processing (NLP) technology to preprocess description information, and comprises sentence splitting, stemming extraction (stemming), reference resolution (co-reference resolution), naming normalization and the like, and then matching is carried out based on a predefined template to extract API semantic relation; the method comprises the following specific steps:
splitting a sentence, wherein a large segment of description is processed, and the description is split into one sentence by using a period number as a descriptor of the description;
extracting a stem, namely processing each word in the API description, removing affixes of the words and obtaining roots of the words;
resolving the pronouns, namely processing the pronouns in the description information and determining nouns or noun phrases pointed by the pronouns; in particular, the present invention also uses claim-based reference resolution;
naming normalization, wherein multiple naming forms exist for APIs, classes, packages and rights in an API document, and multiple different names of the APIs, the classes, the packages and the rights are replaced by predefined names based on a predefined naming mode so as to normalize the representation of an entity; for example, in the description information, the INTERNET rights have a plurality of representation forms, and the invention can unify the representation forms into one form.
After the preprocessing based on natural language processing is completed, the invention uses the regular expression to match through the template summarized in the front, and constructs the API semantic relation map.
The invention has the beneficial effects that:
(1) the invention designs and realizes an android API semantic map construction technology based on a code document;
(2) the invention provides the scientific classification of the semantics of the android API, and can fully cover various semantic relationships among the android API;
(3) the API semantic relation map constructed by the invention has important application value and can provide help or inspiration for the fields of android malicious behavior analysis, malicious software detection and the like.
Drawings
FIG. 1 is a diagram of the overall architecture of the present invention.
FIG. 2 is an API code document example.
FIG. 3 is an android API semantic relationship classification.
Detailed Description
The invention designs and realizes a code document-based android API semantic relationship map construction technology, and fully utilizes android official code documents to mine API semantic relationships. This section details specific implementations of the invention.
Operating environment
The method is based on the latest official code document of the android 10.0, and the document is acquired from the android website by using the python crawler to extract the API semantic relationship. And (3) using spaCy (a Python-implemented NLP toolkit) to realize NLP technologies such as sentence splitting, stemming extraction and reference resolution.
Method for classifying semantic relation of (I) android API
The invention divides the android API semantic relations into 5 categories of relations, and in order to capture richer semantic relations, the 5 categories of relations are divided into 10 relations in a specific implementation process, specifically shown in FIG. 3. Some details of the implementation of the classification method will be described below.
The structure category is specifically divided into a class _ of, a function _ of and an affinity relationship according to the relationship of the document organization, wherein the class _ of is the relationship from class to package, for example, the class belongs to the package of java.net; function _ of is a relationship from API to class, such as bluetooth device.getaddress () API belongs to the android. The inheritance describes an inheritance relationship between classes, such as a javax.
The prototype type is specifically divided into a uses _ parameter, return and rows relationship, wherein the uses _ parameter describes what kind of parameters are used by the API, for example, the parameter of java. returns describes what type of value the API returns, such as java.net.socket.getinputstream () the API returns a value of type java.io.inputstream; the throws relationship describes what kind of exception was thrown by the API, such as the locationmanager.
The usage category is specifically divided into a conditional relationship and an alternate relationship, wherein the conditional relationship describes a possible dependency relationship between one API and another API, for example, the API needs to be used after calling the API; alternative describes that one API is close to another API in usage and function, and can be replaced by another API, such as android recommendation usage after android9
android, telephony, cellidentitygsm, getmccstring () API substitution
android.telephony.CellIdentityGsm.getMcc() API。
Reference categories have a refer _ to relationship that describes the relationship of one API referring to another API, class, package, or right, such as the android
android.media.AudioManager.getVibrateSetting()。
The authority category has a usesjpermission relationship, which describes what kind of authority an API needs in the running process, for example, java.
Template capable of generalizing API relationship and iterative template generation method
In order to extract the relationship from the unstructured document information, the invention summarizes the template capable of generalizing the API relationship, and finally, after 5% of documents are manually analyzed according to the steps of the iterative template generation technology, the template set is not increased any more, so that 217 templates containing the condition relationship, the selectable relationship, the reference relationship and the authority use relationship are obtained.
For conditional relationships, there are 186 templates for extracting relationships, such as "call.. before.. be call", "before.. return", "wait.. for", and the like; for an alternative relationship, there are 22 templates, such as "place by.", "use.. instead", "be destecate, use." etc.; for the refer _ to relationship, there are 5 templates, such as "see also.", "query.", "refer to.", etc.; for the usesjpermission relationship, there are 4 templates, such as "require permission.", "require.. permission", "be grant.. permission", and the like.
(III) relation map construction method based on natural language processing and template
In the process of natural language processing, the root of word requiree is obtained for words in the forms of requires, required and the like; for the sentence "This method requirements Internet", the "This method" will be referred to as the API to which the sentence resolves belongs; in the description information, the INTERNET authority has a plurality of expression forms, and the invention can uniformly replace android.
The invention finally extracts 59125 APIs, 7368 classes, 446 packages and 270 permissions from the official code document of android 10.0. Accordingly, a total of 121345 relationships are extracted, and table 2 shows the details of the constructed API semantic relationship graph.
TABLE 2 relationships extracted from android 10.0 code documents
Figure DEST_PATH_IMAGE004

Claims (1)

1. A method for constructing an android API semantic relation map based on a code document is characterized by comprising the following specific steps:
android API semantic relationship classification
The API is the most basic element in the official document of the android, and further comprises classes, packages and rights;
classifying semantic relations between the API and the API, between the API and between the API and the authority;
(1) a structural relationship, wherein the category describes the relationship between different APIs, classes, packages and authorities on code organization; the android API document has a hierarchical structure and is organized according to the structures of packages, classes and APIs from the top layer to the bottom layer; each containing a plurality of classes, each class being a separate HTML file; a class is the minimum file unit of an API document, and each class document describes basic hierarchical information and all APIs in the class;
(2) a prototype relationship, the category describing a relationship related to a parameter type, a return value type and a thrown exception type of the API; the parameter type specifies a value type transmitted to the API during calling, the return value type specifies a value of which type is returned after the API is called, and the thrown exception type specifies what exception needs to be captured if an exception occurs in the API calling process, and the thrown exception type is related to the use information of the API and directly reflects the function of the API;
(3) usage relationships, the class describing the API usage specifications and methods; among android APIs, there are APIs that are similar in function and usage, and the usage-type relationships are focused on the selectivity in the description; in an actual scenario, in addition to correctly calling according to the prototype of the API, the usage condition between the API and the API needs to be concerned;
(4) a reference relationship, wherein the category describes a relationship that one API references another API, a class, a package or a right, and the reference relationship indicates that the use of one API needs to refer to another API;
(5) the permission relation is that the category concerns the relation of API use permission, and the permission describes resources required by the android API in the execution process and is used for the function and the sensitivity degree of the API;
template and iterative template generation for (II) generalizable representation of API relationships
In the android official document, information related to API semantics mainly comprises structured semantic information and unstructured semantic information; analyzing the structured semantic information in sequence according to the hierarchy from the top layer to the bottom layer to obtain a structured API semantic relation; for unstructured semantic information, the unstructured semantic information is description information in an android official document, and the description information describes information such as functions, usages, notes and the like of an API in a natural language mode; accordingly, a template capable of generalizing and representing the API relation is summarized, and an iterative formula generation method of the template is provided;
there are 4 relationships that can be generalized to represent the API, which are respectively a conditional relationship, a selectable relationship, a reference relationship, and an authority use relationship; the iterative generation steps are as follows:
firstly, randomly selecting 1% of classes, including all description information of the classes after NLP preprocessing, and extracting a template in a manual mode;
then, randomly selecting another 1% of classes, performing NLP pretreatment, and matching by using the template in the previous step to extract the relation; for sentences each of which contains API, classes, packages or permissions, if the sentences can be matched with the templates and the relations are extracted, otherwise, the sentences which are not extracted are deleted, the sentences are left, the sentences are manually checked, the templates are extracted from the sentences, and the sentences are added into the template set;
repeating the steps until the template set is not increased any more;
(III) building a relation graph based on natural language processing and templates
Preprocessing description information by using a natural language processing technology, wherein the preprocessing comprises sentence splitting, stem extraction, reference resolution and naming normalization, and then matching is performed based on a predefined template to extract an API (application programming interface) semantic relation; the method comprises the following specific steps:
sentence splitting, namely splitting a description into one sentence for a large-segment description by using a period number as a descriptor of the description;
extracting a stem, namely removing affixes of words from each word in the API description to obtain a root word;
referring to resolution, determining nouns or noun phrases pointed by pronouns for pronouns in description information;
naming normalization, wherein multiple naming forms exist for APIs, classes, packages and rights in an API document, and multiple different names of the APIs, the classes, the packages and the rights are replaced by predefined names based on a predefined naming mode so as to normalize the representation of an entity;
after the preprocessing based on natural language processing is completed, matching is performed through the template summarized above by using a regular expression, and an API semantic relation graph is constructed.
CN202011274561.2A 2020-11-15 2020-11-15 Android API semantic relation map construction method based on code document Active CN112395884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011274561.2A CN112395884B (en) 2020-11-15 2020-11-15 Android API semantic relation map construction method based on code document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011274561.2A CN112395884B (en) 2020-11-15 2020-11-15 Android API semantic relation map construction method based on code document

Publications (2)

Publication Number Publication Date
CN112395884A true CN112395884A (en) 2021-02-23
CN112395884B CN112395884B (en) 2022-04-12

Family

ID=74599369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011274561.2A Active CN112395884B (en) 2020-11-15 2020-11-15 Android API semantic relation map construction method based on code document

Country Status (1)

Country Link
CN (1) CN112395884B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113849163A (en) * 2021-10-09 2021-12-28 中国科学院软件研究所 API (application program interface) document map-based operating system intelligent programming method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130054612A1 (en) * 2006-10-10 2013-02-28 Abbyy Software Ltd. Universal Document Similarity
CN106843849A (en) * 2016-12-28 2017-06-13 南京大学 A kind of automatic synthesis method of the code model of the built-in function based on document
CN109299610A (en) * 2018-10-02 2019-02-01 复旦大学 Dangerous sensitizing input verifies recognition methods in Android system
CN109739994A (en) * 2018-12-14 2019-05-10 复旦大学 A kind of API knowledge mapping construction method based on reference documents
CN111797242A (en) * 2020-06-29 2020-10-20 哈尔滨工业大学 Code abstract generation method based on code knowledge graph and knowledge migration

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130054612A1 (en) * 2006-10-10 2013-02-28 Abbyy Software Ltd. Universal Document Similarity
CN106843849A (en) * 2016-12-28 2017-06-13 南京大学 A kind of automatic synthesis method of the code model of the built-in function based on document
CN109299610A (en) * 2018-10-02 2019-02-01 复旦大学 Dangerous sensitizing input verifies recognition methods in Android system
CN109739994A (en) * 2018-12-14 2019-05-10 复旦大学 A kind of API knowledge mapping construction method based on reference documents
CN111797242A (en) * 2020-06-29 2020-10-20 哈尔滨工业大学 Code abstract generation method based on code knowledge graph and knowledge migration

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113849163A (en) * 2021-10-09 2021-12-28 中国科学院软件研究所 API (application program interface) document map-based operating system intelligent programming method and device

Also Published As

Publication number Publication date
CN112395884B (en) 2022-04-12

Similar Documents

Publication Publication Date Title
Zhang et al. Enhancing state-of-the-art classifiers with api semantics to detect evolved android malware
Qian et al. Towards automated reentrancy detection for smart contracts based on sequential models
US10162610B2 (en) Method and apparatus for migration of application source code
Isberner et al. Learning register automata: from languages to program structures
KR101213890B1 (en) Using strong data types to express speech recognition grammars in software programs
Babur et al. Hierarchical clustering of metamodels for comparative analysis and visualization
CN109543410B (en) Malicious code detection method based on semantic mapping association
US11263062B2 (en) API mashup exploration and recommendation
Martín et al. A new tool for static and dynamic Android malware analysis
Alrabaee et al. On leveraging coding habits for effective binary authorship attribution
Capiluppi et al. Detecting Java software similarities by using different clustering techniques
CN112395884B (en) Android API semantic relation map construction method based on code document
Alalfi et al. An approach to clone detection in sequence diagrams and its application to security analysis
Trizna Quo Vadis: hybrid machine learning meta-model based on contextual and behavioral malware representations
Abaimov et al. A survey on the application of deep learning for code injection detection
Kuang et al. Automated data-processing function identification using deep neural network
Yang et al. Purext: Automated extraction of the purpose-aware rule from the natural language privacy policy in iot
CN117195233A (en) Open source software supply chain-oriented bill of materials SBOM+ analysis method and device
Xiong et al. Generic, efficient, and effective deobfuscation and semantic-aware attack detection for PowerShell scripts
Yang et al. A novel Android malware detection method with API semantics extraction
CN114879936A (en) Method and system for acquiring safety requirement facing natural language requirement
Viţel et al. Detection of msoffice-embedded malware: Feature mining and short-vs. long-term performance
Jang et al. ToolPhet: Inference of Compiler Provenance From Stripped Binaries With Emerging Compilation Toolchains
CN110705252A (en) Technical contract determination method, electronic device, computer device, and storage medium
Feng et al. Detection and classification of malware based on FastText

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant