CN106843849B

CN106843849B - Automatic synthesis method of code model based on library function of document

Info

Publication number: CN106843849B
Application number: CN201611233727.XA
Authority: CN
Inventors: 翟娟; 赵建华; 黄建军; 马仕青; 张翔宇; 谭琳; 秦锋
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-12-28
Filing date: 2016-12-28
Publication date: 2020-04-14
Anticipated expiration: 2036-12-28
Also published as: CN106843849A

Abstract

The invention relates to an automatic synthesis method of a code model based on a library function of a document, which comprises the following steps: 1. extracting useful information from the document; 2. generating a syntax tree for the sentence using a natural language processing tool; 3. performing structural transformation on the syntax tree generated in the step 2 to generate a plurality of syntax tree variants; 4. analyzing the syntax tree generated in the step 3, identifying parameters, program structures and operation semantics in the syntax tree, and generating a candidate code model; 5. and 4, checking the candidate model in the step 4, and deleting the candidate model with inconsistent behavior with the original class library. The method and the device comprehensively use the natural language processing technology and the automatic testing technology, successfully generate the code model for the Java container class, and the generated code model can effectively improve the correctness and efficiency of other program analysis technologies, thereby solving the problems of difficult analysis and the like caused by class library source code loss or class library source code complexity in the program analysis process.

Description

Automatic synthesis method of code model based on library function of document

Technical Field

The invention relates to an automatic synthesis method of a code model of a library function based on a document, which mainly solves the problem of automatic generation of the code model of the library function by utilizing a natural language processing technology and an automatic testing technology so as to improve the correctness and efficiency of other program analysis technologies. Belongs to the field of software engineering and program synthesis.

Background

In modern programs, class libraries are widely used, the behavior of which is an integral part of the behavior of the software, and which should be analyzed when analyzing the software program. However, it is very difficult to analyze class libraries, and first, in many cases, the source code of class libraries is not available. Even if source code is available, the code is often very complex, such as containing highly optimized code, complex engineering skills, or being implemented in multiple languages, which all for a number of reasons makes it very difficult to analyze class library source code.

At present, a lot of research works are carried out to manually establish a model for a class library, and the analysis of the model is used for replacing the analysis of the class library. However, manual modeling is not only time consuming, but is also prone to errors. Still other research works have been to track the relationships between inputs and outputs by dynamically executing programs, which rely on the sufficiency of test cases, and the dependencies between inputs and outputs cannot reflect the precise behavior of class libraries.

Disclosure of Invention

The technical problem is as follows: the documents of the class library usually contain rich information describing the behavior of the class library, so that the invention aims to extract useful information from the documents and automatically generate a code model for the class library by comprehensively using a natural language processing technology and an automatic testing technology according to the information. The code model simulates the behavior of the class library, solves the problems of difficult analysis and the like caused by source code loss or excessively complex source codes in the program analysis process, and effectively improves the effectiveness and efficiency of other program analysis technologies.

The technical scheme is as follows: given a Java API function, the present invention uses a natural language processing tool to generate a syntax tree for each sentence, then identifies parameters and program structures in the syntax tree to generate an intermediate representation in the form of a tree structure, and then matches the intermediate representation to a tree template of a predefined set of primitives, each primitive consisting of a tree template and a code template to which the tree template corresponds. In the matching process, the invention tries to cover the intermediate representation by using the tree templates of a plurality of primitives, when a proper tree template is found to completely cover the intermediate representation, the nodes in the tree template are instantiated by using the corresponding node information in the intermediate representation, the instantiated result is the code fragments corresponding to the subtrees matched with the tree template, and the code fragments are combined to generate a code model. Due to the ambiguity of natural language, uncertainty in the parameter identification process and the like, each sentence may correspond to a plurality of intermediate representations, and in addition, each intermediate representation may have a plurality of coverage methods, so that a plurality of candidate code models may be generated. The method comprises the following steps:

step 1: and extracting description information of the class and the function from the document, such as declaration of the function, behavior description of the function and the like.

Step 2: and (3) carrying out equivalence analysis, redundant information deletion and statement enhancement on the information extracted in the step (1).

And step 3: a syntax tree is generated for each natural language sentence processed in step 2 using natural language processing tools, which gives the part of speech of each word while labeling different phrases.

And 4, step 4: and (4) performing node transformation on the syntax tree generated in the step (3) to generate a plurality of variants, wherein different variants represent different semantics.

And 5: the nodes representing the parameters in the syntax tree generated in step 4 are identified and the structure of the program, i.e. the loop structure and the branch structure, is identified from the syntax tree, generating a corresponding intermediate representation for the syntax tree.

Step 6: combining the intermediate representations generated in step 5 with the given primitives to synthesize corresponding code fragments, then combining all the code fragments corresponding to the intermediate representations of one function to generate a model of each function, and then combining the models of different functions to generate a model of a class.

And 7: and (4) testing the candidate code models generated in the step 6 by using a testing tool, and filtering the candidate models which have behavior inconsistency with the class library.

Has the advantages that: the code model generated by the automatic construction method of the code model based on the library function of the document simulates the behavior of the class library, the code of the model is simpler to realize, the complexity is low, a local method and the like cannot be called, the average code line number of the code model is 1/3 of the function in the original class library, and the calling function is concise and clear. The code model can effectively assist other program analysis techniques, such as specification generation techniques of library functions, static taint analysis techniques, dynamic slicing techniques, and the like. Specifically, the method comprises the following steps:

(1) the generated code model is applied to the static taint analysis technology, and results show that the use of the code model can effectively improve the accuracy of static taint analysis, discover information leakage paths which cannot be discovered by using source codes, and simultaneously improve the analysis efficiency.

(2) The generated code model is applied to a dynamic slicing technology, and the result shows that the size of a slice generated by using the code model is far smaller than that of a slice generated by using a naive model, and the analysis efficiency can be improved.

Drawings

FIG. 1 is a flow chart of a method for automatic construction of a code model based on a library function of a document.

FIG. 2 is a diagram of an exemplary document of indexof method in ArrayList class according to an embodiment of the present invention.

FIG. 3 is a first syntax tree diagram according to an embodiment of the present invention.

FIG. 4 is a syntax tree diagram of the second embodiment of the present invention.

FIG. 5 is a syntax tree diagram of the third embodiment of the present invention.

Fig. 6 is a schematic intermediate representation of an embodiment of the invention.

Fig. 7 is a schematic overlay of an embodiment of the invention.

FIG. 8 is a code fragment of an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following examples and figures of the specification.

FIG. 1 is a flow chart of a method for automatic construction of a code model based on a library function of a document. The embodiment provides an automatic synthesis method of a code model based on a library function of a document. The method comprises the following steps: 1. extracting useful information from the document; 2. and preprocessing the extracted description sentences. 3. Generating a syntax tree for the sentence using a natural language processing tool; 4. performing structural transformation on the syntax tree generated in the step 3 to generate a plurality of syntax tree variants; 5. analyzing the syntax tree generated in the step 4, and identifying parameters, a program structure and intermediate representation of operation semantics in the syntax tree; 6. generating a candidate code model; 7. and 6, checking the candidate model in the step 6, and deleting the candidate model with inconsistent behavior with the original class library. The method and the device comprehensively use the natural language processing technology and the automatic testing technology, successfully generate the code model for the Java container class, and the generated code model can effectively improve the correctness and efficiency of other program analysis technologies, thereby solving the problems of difficult analysis and the like caused by class library source code loss or class library source code complexity in the program analysis process.

The present embodiment is described in detail by taking the document of indexof method in the ArrayList class shown in fig. 2 as an example.

Extracting information from a document

In the specific implementation, the invention takes a Javadoc document in an HTML format as input, and extracts information of classes and functions from the Javadoc document. For classes, package names and class declarations are mainly extracted. For functions, extracting mainly (1) declarations of the functions; (2) parameter names and corresponding interpretation sections; (3) statements describing the behavior of the function; (4) statements describing function return values; (5) the exceptions thrown by the function and the conditions under which the exceptions are thrown are described.

Second, pretreatment

The invention preprocesses the extracted description sentence, which mainly comprises the following three aspects:

(1) equivalence analysis: and carrying out equivalence class division on the words according to the domain dictionary to reduce repeated processing. For example, in a behavioral description of a function, insert and add are semantically equivalent.

(2) And (3) deleting redundant information: an attempt is made to delete sentences used to interpret other sentences. For example, the sentence beginning with "moreformmally" in fig. 2 is a sentence before which is further explained.

(3) Enhancing the sentence: the statements in the return value description and exception description in Javadoc are often incomplete and the present invention attempts to complement such statements.

Generating syntax trees for natural language sentences

The invention uses a natural language processing tool Stanford Parser to generate a grammar tree for each preprocessed sentence, and identifies the part of speech, phrase structure and the like of words in the sentence. Some words have fixed parts of speech in the computer domain, but the Stanford Parser does not have domain knowledge, so the invention develops a part of speech restriction module which leads the Stanford Parser to mark words related to the computer program domain as expected parts of speech. For the sentences "return of the index of the first occurrence of the specified element in the summary, or-1if the summary does not contain the element" in FIG. 2, Stanford Parser will generate the syntax tree shown in FIG. 3 for it.

Transforming a syntax tree structure to generate a plurality of variants

The invention carries out structural transformation on the syntax tree generated by the Stanford Parser to generate different variants, and each variant represents different semantics.

To resolve the ambiguity of natural language, when a user specifies a value of K, Stanford Parser returns K syntax trees of different semantics. If the value of K is set large to increase the probability of generating a correct syntax tree, it will take a lot of time to parse the syntax tree, which may cause performance problems. Moreover, even if the value of K is large, it cannot be guaranteed that a correct syntax tree can be generated.

The sentences analyzed by the invention belong to a specific field, and the grammar tree expressing correct semantics can be obtained by carrying out structural transformation on the grammar tree. Ambiguities often arise when "or" and "appear in a sentence. For this case, the present invention generates the correct syntax tree by moving "or", and "and all the right siblings up or down several times in the syntax tree with the highest probability. For the syntax tree in fig. 3, the present invention will obtain the syntax tree shown in fig. 4 by moving the "," or "and" -1if this is list not associated with the element "node five times upwards, and the syntax tree in fig. 4 is the semantic meaning that this sentence really wants to express.

Fifth, generating an intermediate representation

The invention constructs an intermediate representation in the form of a tree structure by identifying parameters and program structures in a syntax tree based on domain knowledge.

(1) Identifying parameters

The description of the parameters by the document is not chaotic, but regularly traceable. Thus, the present invention identifies the carrier of the parameters in the sentence according to the rules. In view of this ambiguity of natural language, when the system cannot determine whether a word or phrase describes a parameter, the system will model in two cases, one case considering the description as a parameter and the other case considering the description as not a parameter, with the last step of the model filter excluding the error case. For the syntax tree in fig. 4, the present invention associates "the specified element" with the parameter o, and the present invention cannot determine whether "the element" describes the parameter o in this step, so the present invention considers both possible cases as candidates.

In Javadoc, when the phrase "this WORD" appears, which is again the name of the currently processed class or an abbreviation for the class name, the present invention will add a "this" label to this subtree. This tag represents that the object of this operation is the instance that the operation is performed. For example, the "this list" appearing in fig. 4 indicates that the operation is for an object of the class java.

After the identification of the parameters, the present invention transforms the syntax tree of FIG. 4 into the syntax tree of FIG. 5.

(2) Identification program structure

By default, the present invention executes the default program statements sequentially in order, and thus, the present invention recognizes only loop structures and branch structures in the syntax tree.

With respect to the loop structure, in natural language, plural nouns and singular nouns decorated with each often mean that the behavior described in a sentence will be repeatedly executed a plurality of times, and thus one loop structure is required. Sometimes, however, the cycle is not as obvious. For example, the phrase "the first occure of" implies the order of loop iteration, and a loop structure is also needed, the present invention will add "ltr" (left to right) label in the node of the syntax tree, and delete the subtree representing "the first occure of".

For a branched structure, the words "if" and "where" represent an if branch and "otherwise" represents an else branch. The present invention will add the corresponding tags ("if" and "else") to the corresponding subtrees and delete the intermediate representations. For the syntax tree in FIG. 5, the present invention deletes the node representing "if" from the point, generating the "-1 tag: if" node shown in FIG. 6. It is to be noted that the present invention also carefully judges whether the condition description is positive or negative to decide the if condition. "contact" in FIG. 5 is modified by "dos not", which means that the behavior contained in the if branch is triggered in the opposite case of the result of the "contact" behavior, in which case the invention deletes the sub-tree representing "dos not" and adds the label "-" in the tree node representing "contact", as in the "contact: -" node in FIG. 6.

After the identification of the program structure, the present invention generates the intermediate representation in FIG. 6 from the syntax tree shown in FIG. 5.

Generating candidate code model

The invention identifies code models that operate on semantic generation candidates in the intermediate representation. The invention defines a plurality of primitives, and each primitive comprises a tree template and a code template corresponding to the tree template. The invention covers the intermediate representation by using the tree template from the root node of the intermediate representation, and instantiates the code models corresponding to the tree templates to generate the code segments corresponding to the intermediate representation after the intermediate representation is completely covered. For the intermediate representation in fig. 6, the invention will generate the overlay shown in fig. 7 and the code fragment shown in fig. 8 for it. After each code segment of the intermediate representation is generated, the code segments of the intermediate representation of all sentences in the document of each function are combined to generate a code model of the function, and then the code models of the functions are combined to generate a code model of the class. A sentence may have multiple intermediate representations, and thus, the present invention may generate multiple code models for a function, and likewise, multiple code models for a class.

Candidate model for removing errors

Not all syntax trees represent the correct semantics and therefore the invention will use testing techniques to filter out the wrong code model. For any candidate class model, the invention firstly compiles the candidate class model and generates unit test cases for the candidate class model by using a Ranwood tool, and the test cases embody the behavior of the candidate class model. These test cases are then run on the original JDK class library. If a test case fails, which means that the candidate class model and the original class library are not consistent in behavior, the invention discards the candidate class model. If all test cases of a candidate class model pass, the invention considers the candidate to be the code model of the expected class library.

While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A method for automatically synthesizing a code model based on a library function of a document is characterized by comprising the following steps:

step one, obtaining useful information in a document;

step two, generating a grammar tree for the natural language sentence;

step three, transforming the syntax tree to generate a plurality of variants of the syntax tree, wherein each variant represents different semantics;

identifying parameters, program structures and operation semantics in the syntax tree to generate candidate code models;

step five, checking the candidate model, and deleting the candidate model with inconsistent behavior with the original class library;

in the fourth step:

step 4.1, associating tree nodes in the syntax tree with variables, program structures and operation semantics forming program statements;

step 4.2, analyzing the associated syntax tree, and translating the syntax tree into code segments;

4.3, combining the code segments corresponding to all sentences in the document of one function to generate a code model of the function;

step 4.4 combines the code models of all functions in a class to generate a code model of the class.

2. The automated synthesis method according to claim 1, wherein: the document information includes package name, class name, function declaration, function behavior description, function return value description, and function thrown abnormal information.

3. The automated synthesis method according to claim 1, wherein: redundant information contained in the document is removed and attempts are made to complete the incomplete information.

4. The automated synthesis method according to claim 1, wherein: moving up or pushing down part of tree nodes in the syntax tree to generate different variants of the syntax tree, wherein each variant represents one possible semantic information; in this way, the ambiguity inherent to natural language is resolved.

5. The automated synthesis method according to claim 1, wherein in step five:

step 5.1, compiling any candidate class model and automatically generating unit test cases for the candidate class model by using a Ranwood tool, wherein the test cases embody the behavior of the candidate class model;

step 5.2, running the test case generated in the step 5.1 on an original JDK class library by using Junit; if one test case fails, which means that the candidate class model and the original class library have inconsistent behaviors, the candidate class model is discarded; if all test cases of a candidate class model pass, the candidate is considered to be the code model of the expected class library.