CN106843849B - Automatic synthesis method of code model based on library function of document - Google Patents

Automatic synthesis method of code model based on library function of document Download PDF

Info

Publication number
CN106843849B
CN106843849B CN201611233727.XA CN201611233727A CN106843849B CN 106843849 B CN106843849 B CN 106843849B CN 201611233727 A CN201611233727 A CN 201611233727A CN 106843849 B CN106843849 B CN 106843849B
Authority
CN
China
Prior art keywords
model
code
syntax tree
class
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611233727.XA
Other languages
Chinese (zh)
Other versions
CN106843849A (en
Inventor
翟娟
赵建华
黄建军
马仕青
张翔宇
谭琳
秦锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201611233727.XA priority Critical patent/CN106843849B/en
Publication of CN106843849A publication Critical patent/CN106843849A/en
Application granted granted Critical
Publication of CN106843849B publication Critical patent/CN106843849B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/73Program documentation

Abstract

The invention relates to an automatic synthesis method of a code model based on a library function of a document, which comprises the following steps: 1. extracting useful information from the document; 2. generating a syntax tree for the sentence using a natural language processing tool; 3. performing structural transformation on the syntax tree generated in the step 2 to generate a plurality of syntax tree variants; 4. analyzing the syntax tree generated in the step 3, identifying parameters, program structures and operation semantics in the syntax tree, and generating a candidate code model; 5. and 4, checking the candidate model in the step 4, and deleting the candidate model with inconsistent behavior with the original class library. The method and the device comprehensively use the natural language processing technology and the automatic testing technology, successfully generate the code model for the Java container class, and the generated code model can effectively improve the correctness and efficiency of other program analysis technologies, thereby solving the problems of difficult analysis and the like caused by class library source code loss or class library source code complexity in the program analysis process.

Description

Automatic synthesis method of code model based on library function of document
Technical Field
The invention relates to an automatic synthesis method of a code model of a library function based on a document, which mainly solves the problem of automatic generation of the code model of the library function by utilizing a natural language processing technology and an automatic testing technology so as to improve the correctness and efficiency of other program analysis technologies. Belongs to the field of software engineering and program synthesis.
Background
In modern programs, class libraries are widely used, the behavior of which is an integral part of the behavior of the software, and which should be analyzed when analyzing the software program. However, it is very difficult to analyze class libraries, and first, in many cases, the source code of class libraries is not available. Even if source code is available, the code is often very complex, such as containing highly optimized code, complex engineering skills, or being implemented in multiple languages, which all for a number of reasons makes it very difficult to analyze class library source code.
At present, a lot of research works are carried out to manually establish a model for a class library, and the analysis of the model is used for replacing the analysis of the class library. However, manual modeling is not only time consuming, but is also prone to errors. Still other research works have been to track the relationships between inputs and outputs by dynamically executing programs, which rely on the sufficiency of test cases, and the dependencies between inputs and outputs cannot reflect the precise behavior of class libraries.
Disclosure of Invention
The technical problem is as follows: the documents of the class library usually contain rich information describing the behavior of the class library, so that the invention aims to extract useful information from the documents and automatically generate a code model for the class library by comprehensively using a natural language processing technology and an automatic testing technology according to the information. The code model simulates the behavior of the class library, solves the problems of difficult analysis and the like caused by source code loss or excessively complex source codes in the program analysis process, and effectively improves the effectiveness and efficiency of other program analysis technologies.
The technical scheme is as follows: given a Java API function, the present invention uses a natural language processing tool to generate a syntax tree for each sentence, then identifies parameters and program structures in the syntax tree to generate an intermediate representation in the form of a tree structure, and then matches the intermediate representation to a tree template of a predefined set of primitives, each primitive consisting of a tree template and a code template to which the tree template corresponds. In the matching process, the invention tries to cover the intermediate representation by using the tree templates of a plurality of primitives, when a proper tree template is found to completely cover the intermediate representation, the nodes in the tree template are instantiated by using the corresponding node information in the intermediate representation, the instantiated result is the code fragments corresponding to the subtrees matched with the tree template, and the code fragments are combined to generate a code model. Due to the ambiguity of natural language, uncertainty in the parameter identification process and the like, each sentence may correspond to a plurality of intermediate representations, and in addition, each intermediate representation may have a plurality of coverage methods, so that a plurality of candidate code models may be generated. The method comprises the following steps:
step 1: and extracting description information of the class and the function from the document, such as declaration of the function, behavior description of the function and the like.
Step 2: and (3) carrying out equivalence analysis, redundant information deletion and statement enhancement on the information extracted in the step (1).
And step 3: a syntax tree is generated for each natural language sentence processed in step 2 using natural language processing tools, which gives the part of speech of each word while labeling different phrases.
And 4, step 4: and (4) performing node transformation on the syntax tree generated in the step (3) to generate a plurality of variants, wherein different variants represent different semantics.
And 5: the nodes representing the parameters in the syntax tree generated in step 4 are identified and the structure of the program, i.e. the loop structure and the branch structure, is identified from the syntax tree, generating a corresponding intermediate representation for the syntax tree.
Step 6: combining the intermediate representations generated in step 5 with the given primitives to synthesize corresponding code fragments, then combining all the code fragments corresponding to the intermediate representations of one function to generate a model of each function, and then combining the models of different functions to generate a model of a class.
And 7: and (4) testing the candidate code models generated in the step 6 by using a testing tool, and filtering the candidate models which have behavior inconsistency with the class library.
Has the advantages that: the code model generated by the automatic construction method of the code model based on the library function of the document simulates the behavior of the class library, the code of the model is simpler to realize, the complexity is low, a local method and the like cannot be called, the average code line number of the code model is 1/3 of the function in the original class library, and the calling function is concise and clear. The code model can effectively assist other program analysis techniques, such as specification generation techniques of library functions, static taint analysis techniques, dynamic slicing techniques, and the like. Specifically, the method comprises the following steps:
(1) the generated code model is applied to the static taint analysis technology, and results show that the use of the code model can effectively improve the accuracy of static taint analysis, discover information leakage paths which cannot be discovered by using source codes, and simultaneously improve the analysis efficiency.
(2) The generated code model is applied to a dynamic slicing technology, and the result shows that the size of a slice generated by using the code model is far smaller than that of a slice generated by using a naive model, and the analysis efficiency can be improved.
Drawings
FIG. 1 is a flow chart of a method for automatic construction of a code model based on a library function of a document.
FIG. 2 is a diagram of an exemplary document of indexof method in ArrayList class according to an embodiment of the present invention.
FIG. 3 is a first syntax tree diagram according to an embodiment of the present invention.
FIG. 4 is a syntax tree diagram of the second embodiment of the present invention.
FIG. 5 is a syntax tree diagram of the third embodiment of the present invention.
Fig. 6 is a schematic intermediate representation of an embodiment of the invention.
Fig. 7 is a schematic overlay of an embodiment of the invention.
FIG. 8 is a code fragment of an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following examples and figures of the specification.
FIG. 1 is a flow chart of a method for automatic construction of a code model based on a library function of a document. The embodiment provides an automatic synthesis method of a code model based on a library function of a document. The method comprises the following steps: 1. extracting useful information from the document; 2. and preprocessing the extracted description sentences. 3. Generating a syntax tree for the sentence using a natural language processing tool; 4. performing structural transformation on the syntax tree generated in the step 3 to generate a plurality of syntax tree variants; 5. analyzing the syntax tree generated in the step 4, and identifying parameters, a program structure and intermediate representation of operation semantics in the syntax tree; 6. generating a candidate code model; 7. and 6, checking the candidate model in the step 6, and deleting the candidate model with inconsistent behavior with the original class library. The method and the device comprehensively use the natural language processing technology and the automatic testing technology, successfully generate the code model for the Java container class, and the generated code model can effectively improve the correctness and efficiency of other program analysis technologies, thereby solving the problems of difficult analysis and the like caused by class library source code loss or class library source code complexity in the program analysis process.
The present embodiment is described in detail by taking the document of indexof method in the ArrayList class shown in fig. 2 as an example.
Extracting information from a document
In the specific implementation, the invention takes a Javadoc document in an HTML format as input, and extracts information of classes and functions from the Javadoc document. For classes, package names and class declarations are mainly extracted. For functions, extracting mainly (1) declarations of the functions; (2) parameter names and corresponding interpretation sections; (3) statements describing the behavior of the function; (4) statements describing function return values; (5) the exceptions thrown by the function and the conditions under which the exceptions are thrown are described.
Second, pretreatment
The invention preprocesses the extracted description sentence, which mainly comprises the following three aspects:
(1) equivalence analysis: and carrying out equivalence class division on the words according to the domain dictionary to reduce repeated processing. For example, in a behavioral description of a function, insert and add are semantically equivalent.
(2) And (3) deleting redundant information: an attempt is made to delete sentences used to interpret other sentences. For example, the sentence beginning with "moreformmally" in fig. 2 is a sentence before which is further explained.
(3) Enhancing the sentence: the statements in the return value description and exception description in Javadoc are often incomplete and the present invention attempts to complement such statements.
Generating syntax trees for natural language sentences
The invention uses a natural language processing tool Stanford Parser to generate a grammar tree for each preprocessed sentence, and identifies the part of speech, phrase structure and the like of words in the sentence. Some words have fixed parts of speech in the computer domain, but the Stanford Parser does not have domain knowledge, so the invention develops a part of speech restriction module which leads the Stanford Parser to mark words related to the computer program domain as expected parts of speech. For the sentences "return of the index of the first occurrence of the specified element in the summary, or-1if the summary does not contain the element" in FIG. 2, Stanford Parser will generate the syntax tree shown in FIG. 3 for it.
Transforming a syntax tree structure to generate a plurality of variants
The invention carries out structural transformation on the syntax tree generated by the Stanford Parser to generate different variants, and each variant represents different semantics.
To resolve the ambiguity of natural language, when a user specifies a value of K, Stanford Parser returns K syntax trees of different semantics. If the value of K is set large to increase the probability of generating a correct syntax tree, it will take a lot of time to parse the syntax tree, which may cause performance problems. Moreover, even if the value of K is large, it cannot be guaranteed that a correct syntax tree can be generated.
The sentences analyzed by the invention belong to a specific field, and the grammar tree expressing correct semantics can be obtained by carrying out structural transformation on the grammar tree. Ambiguities often arise when "or" and "appear in a sentence. For this case, the present invention generates the correct syntax tree by moving "or", and "and all the right siblings up or down several times in the syntax tree with the highest probability. For the syntax tree in fig. 3, the present invention will obtain the syntax tree shown in fig. 4 by moving the "," or "and" -1if this is list not associated with the element "node five times upwards, and the syntax tree in fig. 4 is the semantic meaning that this sentence really wants to express.
Fifth, generating an intermediate representation
The invention constructs an intermediate representation in the form of a tree structure by identifying parameters and program structures in a syntax tree based on domain knowledge.
(1) Identifying parameters
The description of the parameters by the document is not chaotic, but regularly traceable. Thus, the present invention identifies the carrier of the parameters in the sentence according to the rules. In view of this ambiguity of natural language, when the system cannot determine whether a word or phrase describes a parameter, the system will model in two cases, one case considering the description as a parameter and the other case considering the description as not a parameter, with the last step of the model filter excluding the error case. For the syntax tree in fig. 4, the present invention associates "the specified element" with the parameter o, and the present invention cannot determine whether "the element" describes the parameter o in this step, so the present invention considers both possible cases as candidates.
In Javadoc, when the phrase "this WORD" appears, which is again the name of the currently processed class or an abbreviation for the class name, the present invention will add a "this" label to this subtree. This tag represents that the object of this operation is the instance that the operation is performed. For example, the "this list" appearing in fig. 4 indicates that the operation is for an object of the class java.
After the identification of the parameters, the present invention transforms the syntax tree of FIG. 4 into the syntax tree of FIG. 5.
(2) Identification program structure
By default, the present invention executes the default program statements sequentially in order, and thus, the present invention recognizes only loop structures and branch structures in the syntax tree.
With respect to the loop structure, in natural language, plural nouns and singular nouns decorated with each often mean that the behavior described in a sentence will be repeatedly executed a plurality of times, and thus one loop structure is required. Sometimes, however, the cycle is not as obvious. For example, the phrase "the first occure of" implies the order of loop iteration, and a loop structure is also needed, the present invention will add "ltr" (left to right) label in the node of the syntax tree, and delete the subtree representing "the first occure of".
For a branched structure, the words "if" and "where" represent an if branch and "otherwise" represents an else branch. The present invention will add the corresponding tags ("if" and "else") to the corresponding subtrees and delete the intermediate representations. For the syntax tree in FIG. 5, the present invention deletes the node representing "if" from the point, generating the "-1 tag: if" node shown in FIG. 6. It is to be noted that the present invention also carefully judges whether the condition description is positive or negative to decide the if condition. "contact" in FIG. 5 is modified by "dos not", which means that the behavior contained in the if branch is triggered in the opposite case of the result of the "contact" behavior, in which case the invention deletes the sub-tree representing "dos not" and adds the label "-" in the tree node representing "contact", as in the "contact: -" node in FIG. 6.
After the identification of the program structure, the present invention generates the intermediate representation in FIG. 6 from the syntax tree shown in FIG. 5.
Generating candidate code model
The invention identifies code models that operate on semantic generation candidates in the intermediate representation. The invention defines a plurality of primitives, and each primitive comprises a tree template and a code template corresponding to the tree template. The invention covers the intermediate representation by using the tree template from the root node of the intermediate representation, and instantiates the code models corresponding to the tree templates to generate the code segments corresponding to the intermediate representation after the intermediate representation is completely covered. For the intermediate representation in fig. 6, the invention will generate the overlay shown in fig. 7 and the code fragment shown in fig. 8 for it. After each code segment of the intermediate representation is generated, the code segments of the intermediate representation of all sentences in the document of each function are combined to generate a code model of the function, and then the code models of the functions are combined to generate a code model of the class. A sentence may have multiple intermediate representations, and thus, the present invention may generate multiple code models for a function, and likewise, multiple code models for a class.
Candidate model for removing errors
Not all syntax trees represent the correct semantics and therefore the invention will use testing techniques to filter out the wrong code model. For any candidate class model, the invention firstly compiles the candidate class model and generates unit test cases for the candidate class model by using a Ranwood tool, and the test cases embody the behavior of the candidate class model. These test cases are then run on the original JDK class library. If a test case fails, which means that the candidate class model and the original class library are not consistent in behavior, the invention discards the candidate class model. If all test cases of a candidate class model pass, the invention considers the candidate to be the code model of the expected class library.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (5)

1. A method for automatically synthesizing a code model based on a library function of a document is characterized by comprising the following steps:
step one, obtaining useful information in a document;
step two, generating a grammar tree for the natural language sentence;
step three, transforming the syntax tree to generate a plurality of variants of the syntax tree, wherein each variant represents different semantics;
identifying parameters, program structures and operation semantics in the syntax tree to generate candidate code models;
step five, checking the candidate model, and deleting the candidate model with inconsistent behavior with the original class library;
in the fourth step:
step 4.1, associating tree nodes in the syntax tree with variables, program structures and operation semantics forming program statements;
step 4.2, analyzing the associated syntax tree, and translating the syntax tree into code segments;
4.3, combining the code segments corresponding to all sentences in the document of one function to generate a code model of the function;
step 4.4 combines the code models of all functions in a class to generate a code model of the class.
2. The automated synthesis method according to claim 1, wherein: the document information includes package name, class name, function declaration, function behavior description, function return value description, and function thrown abnormal information.
3. The automated synthesis method according to claim 1, wherein: redundant information contained in the document is removed and attempts are made to complete the incomplete information.
4. The automated synthesis method according to claim 1, wherein: moving up or pushing down part of tree nodes in the syntax tree to generate different variants of the syntax tree, wherein each variant represents one possible semantic information; in this way, the ambiguity inherent to natural language is resolved.
5. The automated synthesis method according to claim 1, wherein in step five:
step 5.1, compiling any candidate class model and automatically generating unit test cases for the candidate class model by using a Ranwood tool, wherein the test cases embody the behavior of the candidate class model;
step 5.2, running the test case generated in the step 5.1 on an original JDK class library by using Junit; if one test case fails, which means that the candidate class model and the original class library have inconsistent behaviors, the candidate class model is discarded; if all test cases of a candidate class model pass, the candidate is considered to be the code model of the expected class library.
CN201611233727.XA 2016-12-28 2016-12-28 Automatic synthesis method of code model based on library function of document Active CN106843849B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611233727.XA CN106843849B (en) 2016-12-28 2016-12-28 Automatic synthesis method of code model based on library function of document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611233727.XA CN106843849B (en) 2016-12-28 2016-12-28 Automatic synthesis method of code model based on library function of document

Publications (2)

Publication Number Publication Date
CN106843849A CN106843849A (en) 2017-06-13
CN106843849B true CN106843849B (en) 2020-04-14

Family

ID=59114277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611233727.XA Active CN106843849B (en) 2016-12-28 2016-12-28 Automatic synthesis method of code model based on library function of document

Country Status (1)

Country Link
CN (1) CN106843849B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108733359B (en) * 2018-06-14 2020-12-25 北京航空航天大学 Automatic generation method of software program
CN109783079A (en) * 2018-12-21 2019-05-21 南京航空航天大学 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network
CN110134435B (en) * 2019-05-29 2023-01-10 北京百度网讯科技有限公司 Code repair case acquisition method, device, equipment and storage medium
CN110705316B (en) * 2019-09-29 2023-03-24 南京大学 Method and device for generating linear time sequence logic protocol of smart home
CN111880977B (en) * 2020-07-16 2022-02-08 北京天维信通科技有限公司 Fault self-healing method and device, equipment and storage medium
CN112306497B (en) * 2020-11-03 2024-04-26 高炼 Method and system for converting natural language into program code
CN112395884B (en) * 2020-11-15 2022-04-12 复旦大学 Android API semantic relation map construction method based on code document
CN114610313B (en) * 2022-02-28 2023-12-26 浪潮(山东)计算机科技有限公司 Method, system, device and medium for generating SPEC document

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464799A (en) * 2009-01-16 2009-06-24 天津大学 MPI parallel programming system based on visual modeling and automatic skeleton code generation method
CN101814065B (en) * 2009-02-23 2014-07-30 富士通株式会社 Syntactic analysis device and syntactic analysis method
CN104461566B (en) * 2014-12-25 2017-10-20 南京大学 A kind of JCOP extension implementation methods of behavior variant based on object instance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Automatic Model Generation from Documentation for Java API Functions";翟娟 等;《2016 IEEE/ACM 38th IEEE International Conference on Software-Engineering》;20160522;第380-391页 *

Also Published As

Publication number Publication date
CN106843849A (en) 2017-06-13

Similar Documents

Publication Publication Date Title
CN106843849B (en) Automatic synthesis method of code model based on library function of document
Zhai et al. Automatic model generation from documentation for Java API functions
CN106843840B (en) Source code version evolution annotation multiplexing method based on similarity analysis
US11327722B1 (en) Programming language corpus generation
US11294665B1 (en) Computerized software version control with a software database and a human database
Cordy Excerpts from the TXL cookbook
CN112579466A (en) Test case generation method and device and computer readable storage medium
CN116450616A (en) General heterogeneous relational database SQL migration method based on parse tree
Le et al. Interactive program synthesis
Abdelnabi et al. Generating uml class diagram from natural language requirements: A survey of approaches and techniques
CN112199115A (en) Cross-Java byte code and source code line association method based on feature similarity matching
CN108170661B (en) Method and system for managing rule text
CN109325217B (en) File conversion method, system, device and computer readable storage medium
CN111190643A (en) Program code annotation generation method, system, electronic device and storage medium
Zhang et al. Automated Extraction of Grammar Optimization Rule Configurations for Metamodel-Grammar Co-evolution
Anderson et al. Supporting analysis of SQL queries in PHP AiR
CN112905232B (en) Program code parallel corpus mining method and system based on syntax analysis tree
EP2535813B1 (en) Method and device for generating an alert during an analysis of performance of a computer application
KR20230040516A (en) Automation system and method for extracting intermediate representation based semantics of javascript
Bacchelli et al. Mining structured data in natural language artifacts with island parsing
CN113448982A (en) DDL statement analysis method and device, computer equipment and storage medium
CN110928535A (en) Derivative variable deployment method, device, equipment and readable storage medium
Grigorev et al. String-embedded language support in integrated development environment
CN110618809B (en) Front-end webpage input constraint extraction method and device
Fraternali et al. Almost rerere: An approach for automating conflict resolution from similar resolved conflicts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant