CN113419720A

CN113419720A - Automatic judgment method for necessity of abbreviation expansion for source code

Info

Publication number: CN113419720A
Application number: CN202110762787.5A
Authority: CN
Inventors: 刘辉; 罗晓青; 姜艳杰
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-09-21
Anticipated expiration: 2041-07-06
Also published as: CN113419720B

Abstract

The invention relates to a method for automatically judging the expansion necessity of an abbreviation oriented to a source code, belonging to the technical field of computer software quality maintenance. First, from a corpus of source code, common abbreviations that are frequently used by various developers in similar contexts are collected using data mining techniques. A given abbreviation is not expanded if it matches at least one common abbreviation found. In the same corpus, probability distributions for different types of identifier lengths are calculated. An abbreviation is not expanded if its full term is contained in its surrounding context, i.e. in the same source code line. Other abbreviations that do not determine whether or not to expand from a given method are expected to be replaced by their full names. For a given test item, most of the abbreviations may be classified correctly, with only a very small proportion of the abbreviations classified incorrectly. It is very accurate in selecting abbreviations that do not need to be expanded, with accuracy as high as 98% and recall as high as 96%.

Description

Automatic judgment method for necessity of abbreviation expansion for source code

Technical Field

The invention relates to a method for automatically judging whether an abbreviation needs to be expanded into a complete term, and belongs to the technical field of computer software quality maintenance.

Background

In software source code, they account for a large portion (70%) of the source code in terms of identifiers. These identifiers, which are composed of natural language terms, become a major source of software understanding. Meaningful identifiers are very helpful in understanding the source code, and therefore, qualified identifiers are particularly important.

Abbreviations are widely used for abbreviated identifiers. The skilled person often replaces a series of terms in an identifier with a short abbreviation. For example, "e" is often used to denote "exception," XMLParser "is used to denote" extensibilemarkuplangugageparser, "and so on. The appropriate abbreviations can greatly facilitate typing, typesetting, and reading lengthy source codes.

However, abbreviations can also significantly reduce the readability and maintainability of the software source code if improperly used. For example, the acronyms "s" (for "students") and "ds" (for "data sequence") are good examples of inappropriately used acronyms. In addition to code authors, it may be difficult for other technicians to ascertain the exact meaning of these abbreviations, which may lead to misunderstanding and improper use of software programs.

To this end, the prior art has proposed automated methods to provide complete terminology for a given abbreviation. For example, a software developer may replace abbreviations with full terms with these tools by renaming. However, there has not been an automated process for automatically determining whether an abbreviation needs to be expanded, i.e., whether the abbreviation should be replaced with a corresponding full term. Making such decisions is often challenging for inexperienced software developers and maintenance personnel, as the decisions have no quantitative guidance and are completely dependent on the experience and intuition of the developer.

Therefore, it is important to find a method that can automatically determine whether an abbreviation needs to be expanded by a software development tool, a software quality maintenance tool, or the like.

Disclosure of Invention

The invention aims to solve the technical problem of automatically judging whether an abbreviation needs to be expanded or not in the process of developing and maintaining computer software, namely, automatically judging whether the abbreviation should be replaced by a corresponding complete term or not.

The rationale for the method of the present invention is that an abbreviation should not be expanded if expansion of the abbreviation would result in a lengthy designator, or if a developer/maintainer could easily find the meaning (i.e., the full term) of the abbreviation based on their domain knowledge or the context of the abbreviation.

According to the basic principle, the invention provides a series of heuristic methods for selecting abbreviations which do not need to be expanded. First, from a corpus of source code, common abbreviations that are frequently used by various developers in similar contexts are collected using data mining techniques. The key to data mining is to transform the mining problem of common acronyms into the biggest clique problem that has been widely studied. A given abbreviation is not expanded if it matches at least one common abbreviation found. In the same corpus, probability distributions of different types of identifier (e.g., variable names and method names) lengths are computed. The probability distribution specifies the likelihood that a T-type identifier consists of exactly n characters. The heuristic method is as follows: an abbreviation is not expanded if the probability of its peripheral identifier is reduced by the expansion of the abbreviation. Finally, the method proposes that an abbreviation is not expanded if its complete term is contained in its surrounding context, i.e. in the same source code line. Other abbreviations that do not determine whether or not to expand from a given method are expected to be replaced with their full names.

Advantageous effects

The method firstly provides an automatic method for judging whether the abbreviation in the source code needs to be expanded, and the heuristic method focuses on different aspects of the abbreviation, namely length, popularity and context. Has the following beneficial effects:

1. for a given test item, most (95%) abbreviations may be classified correctly, with only 5% of abbreviations classified incorrectly;

2. the method is very accurate in selecting abbreviations which do not need to be expanded, the accuracy rate is as high as 98%, and the recall rate is as high as 96%;

drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Detailed Description

The invention is further illustrated and described in detail below with reference to the figures and examples.

The present invention is realized by the following technical means.

As shown in fig. 1, a method for automatically determining the necessity of expanding an abbreviation oriented to source code includes an offline mining phase and an online classification phase. In the offline mining phase, learning from a corpus of source code finds the distribution probability of identifier lengths and common abbreviations. In the online classification phase, a series of heuristic-based filtering methods are employed to determine whether a given abbreviation needs to be expanded.

The method specifically comprises the following steps:

step 1: the identifier length is analyzed.

First, all identifiers (i.e., names of software entities) are extracted from a corpus of software source code and classified according to the type of entity (e.g., variable name, method name, class name, etc.). For each type of identifier, a probability distribution of its length is calculated.

In this step, the generated identifiers are classified into 5 types including variable names, parameter names, method names, class names, and field names, for the extracted identifiers, regardless of whether they contain abbreviations or not. For each type of identifier, a probability distribution P of its length is calculated, the probability distribution representing the probability that an identifier of type T is composed of exactly n characters, denoted P (T, n).

Step 2: abbreviations in which the size of the maximum clique (i.e. the number of vertices) is not less than a predefined threshold β are extracted from the source code corpus.

The same-vocabulary abbreviations are presented in a graph wherever they are extracted, each abbreviation being represented as a node, and the weight of an edge representing the contextual similarity between the same-vocabulary abbreviations. If there is a large subgraph with one edge connected for each node pair, the subgraph (called a very big clique) represents a common abbreviation, widely used in similar contexts.

In this step, abbreviations are identified and extracted from the source code corpus using graph-based data mining techniques. Each resulting abbreviation is represented as a tuple. The tuple includes text and context of the abbreviation, wherein the context consists of the designator. The identifier was vectorized using the Paragraph2Vector algorithm. One significant advantage of the Paragraph2Vector algorithm is that semantically related texts can generate similar vectors, so that the similarity of the resulting vectors obtained after vectorization of the Paragraph2Vector can be used for representing the similarity of the original contexts.

For each group of abbreviations, an undirected graph G is constructed, where nodes represent the abbreviations in the group and weights for edges represent contextual similarity between the abbreviations. To identify common abbreviations that are often used by different developers in similar contexts, the problem is translated into a widely studied tremendous group of problems. To mine common abbreviations used in highly similar contexts, the edges of two vertices are deleted if the contextual similarity of the two abbreviations (represented by the two vertices) is less than a predefined threshold α. The largest clique of the result graph represents a popular abbreviation, often used in a highly similar context, and the size of the clique indicates the popularity of the abbreviation. To remove less popular abbreviations, only the abbreviations for which the size of the maximum clique (i.e., the number of vertices) is not less than the predefined threshold β size are retained.

And step 3: in step 1, for each type of identifier, a probability distribution of its length is calculated.

For a given abbreviation in an identifier id of type T, replacing the abbreviation with its full term will increase the length of id from k characters to j characters, k, j representing the number, if P (T, j) < P (T, k), the abbreviation is not expanded. The heuristic principle is that because of the expansion of the acronyms, the length of the identifier becomes less acceptable (i.e., less popular) and therefore, it is better not to perform the expansion. Otherwise, other heuristics will be applied to the abbreviation to obtain a final decision.

And 4, step 4: in step 2, abbreviations in the source code corpus are identified by data mining techniques, resulting in a set of maximal cliques. The method comprises the following specific steps:

search contains acronyms abb_iAll abbreviations in the project of (a), and calculating and abb_iThe number of lexically identical abbreviations. If this number is greater than the threshold γ, the abbreviation is not expanded. If there is a lexically identical maximum clique as the abb abbreviation and the average contextual similarity between nodes within the clique and abb is greater than the threshold β, the abbreviation is not expanded.

And 5: if abbreviation abb and its full term appear on the same line of source code that defines the peripheral identifier, it is not expanded;

in step 5, in order to identify these abbreviations that are easy to interpret by context, the following method is employed:

step A: first, for the abbreviation abb_iThe entire line of source code is extracted, where its peripheral identifier is defined as the context of an abbreviation, denoted CTX (abb)_i)；

And B: the context CTX (abb) is then applied_i) The space, capital letters and special characters (e.g., "(" and ")") are broken down into a sequence of tokens, and the resulting sequence token is Seq (abb)_i)；

And C: let the full name of the abbreviation be < omega₁,...,ω_nIf Seq (abb)_i) Where all words in the text have equivalent designations, abbreviations are not expanded. Two words are equivalent if they are the same or share the same root. For example, "threads" and "threads" are not identical, but they share the same root ("thread"), and thus they are considered equivalent.

Through steps A, B and C, it can be determined efficiently whether the abbreviation and its full term appear on the same line of source code that defines the peripheral identifier.

Step 6: if none of the abbreviations have been expanded in the previous step, the final expansion abb is performed.

Thus, through steps 1 through 6, an automated method of determining whether an abbreviation needs to be expanded is completed.

Examples

This embodiment details the steps and effects of the method for determining whether the abbreviation needs to be expanded in the project, which is specifically implemented under the open source projects with 5 different topics.

The open source software shown in table 2 was tested in a hardware environment as shown in table 1.

Table 1: hardware environment configuration information table

Hardware environment configuration	Processor model	Memory device	Operating system
				Test environment	3.4GHz Core i7-6700	16G	64-bit Windows 10

Table 2: basic information table of open source software

Table 3: number of abbreviations in a project that need to be expanded versus unexpanded

Name of item	Number of samples	Positive sample	Negative sample	Ratio of positive samples
					Mail	346	71	275	21％
Doc	346	88	258	25％
					DrJ	378	63	315	17％
Dubbo	377	52	325	14％
					jEdit	371	75	296	20％
TOTAL	1,818	349	1,469	19％

For the open source items shown in Table 2, abbreviations are sampled from each item and a manual decision is made as to which ones need expansion.

The size of the sample is determined by the number of abbreviations in the test item. The minimum size of the sample was calculated using a sample size calculator with an error of 5% and a confidence level of 95%. Where all samples are drawn randomly. For each of the 1818 abbreviations that are generated, a human determines whether the abbreviation needs to be expanded. The resulting data set (called golden set) will be used as a benchmark in later evaluations. The example abbreviations in golden set fall into two categories. The first type is a positive sample, consisting of abbreviations that need to be expanded. Others belong to a second class, called negative examples;

for the golden set obtained, the method is applied thereto, and the generated result is compared with a manual decision. A suggestion for a given abbreviation is correct if and only if the generated suggestion is the same as the manual decision for the same abbreviation. Calculating classification indexes, namely accuracy, precision and recall rate;

three thresholds are used, α, β, γ, respectively. α is the minimum contextual similarity between abbreviations on the same blob. Beta represents the minimum size of the largest cluster of common abbreviations. γ represents the minimum time a domain-specific general abbreviation should appear in a single item (referred to as the least popular domain abbreviation). By changing the size of the threshold and repeating the previous evaluation. It should be noted that the value of a single threshold is changed at a time to explicitly reveal the effect of each threshold;

specifically, 1818 abbreviations in five open source projects are selected for the experiment; first, for a given abbreviation, a manual determination is made as to whether the abbreviation needs to be expanded, requiring 3 developers to manually decide whether they should be expanded. All three developers had more than three years of Java experience, requiring them to expand abbreviations and make independent decisions. If there are places of disagreement, they discuss agreement together. Then, the method of the present invention was applied to these abbreviations, the obtained results were compared with the results of manual judgment, and the performance of the method was evaluated, and the obtained evaluation results are shown in table 4. From this table it can be seen that most (95%) abbreviations can be classified correctly, with only 5% of them classified incorrectly. Secondly, the method is very accurate in selecting abbreviations that do not need to be expanded. In the search of the negative abbreviations, the method has the accuracy rate of 98 percent and the recall rate of 96 percent. The performance of items on different topics varies slightly. For example, its minimum and maximum precisions are 93% (on DecFetcher) and 96% (on DavMail and DrJava), respectively, indicating that the method is accurate.

Table 4: method performance

The negative sample precision in table 4 ═ number of true negative samples/(number of true negative samples + number of false negative samples);

negative sample recall in table 4 ═ number of true negative samples/(number of true negative samples + number of false positive samples);

the negative sample precision in table 4 ═ number of true positive samples/(number of true positive samples + number of false positive samples);

the negative sample recall in table 4 is true positive sample number/(true positive sample number + false negative sample number).

By varying the size of the threshold, it has been found that any such decrease in threshold results in a decrease in recall when positive samples are retrieved and an increase in recall when negative samples are retrieved. An increase in either threshold results in an increase in the accuracy of the negative sample search and a decrease in the accuracy of the positive sample search. Maximum accuracy can be produced at the default values of the threshold (α -0.4, β -15, γ -25), while decreasing or increasing the threshold decreases the accuracy of the method.

The results of the evaluation of 1818 abbreviations from 5 open source applications show that the method is accurate with up to 95% accuracy.

While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims

1. A method for automatically judging the expansion necessity of an abbreviation oriented to source codes is characterized by comprising the following steps:

step 1: analyzing the length of the identifier;

firstly, extracting all identifiers from a corpus of software source codes, and classifying the identifiers according to the types of entities, wherein the types comprise variable names, parameter names, method names, class names and field names; for each type of identifier, calculating a probability distribution of its length;

step 2: extracting abbreviations of which the size of the maximum cluster is not smaller than a predefined threshold value beta from a source code corpus;

and step 3: in step 1, for each type of identifier, a probability distribution of its length is calculated;

for a given abbreviation in an identifier id of type T, replacing the abbreviation with its full term will increase the length of id from k characters to j characters, k, j representing the number, if P (T, j) < P (T, k), the abbreviation is not expanded;

and 4, step 4: in step 2, identifying abbreviations in a source code corpus by a data mining technology, thereby generating a set of maximum cliques;

search contains acronyms abb_iAll abbreviations in the project of (a), and calculating and abb_iThe number of lexically identical abbreviations; if this number is greater than the threshold γ, the abbreviation is not expanded; if there is a lexically identical maximum clique as the abb abbreviation and the average contextual similarity between nodes within the clique and abb is greater than the threshold β, the abbreviation is not expanded;

And B: the context CTX (abb) is then applied_i) The sequence tag is Seq (abb) by decomposing space, capital letters and special characters into a tag sequence_i)；

And C: let the full name of the abbreviation be < omega₁,...,ω_nIf Seq (abb)_i) If all the words in the Chinese language have equivalent marks, the abbreviations are not expanded; two words are equivalent if they are the same or share the same root;

through steps A, B and C, it can be determined efficiently whether the abbreviation and its full term appear on the same line of source code that defines the peripheral identifier;

step 6: if none of the abbreviations have been expanded in the previous step, the abbreviations are eventually expanded abb.

2. The method for automatically determining the necessity of expanding abbreviations oriented to source codes as claimed in claim 1, wherein in step 2, the abbreviations are recognized and extracted from the source code corpus by using a graph-based data mining technology, and each obtained abbreviation is represented as a tuple, wherein the tuple comprises texts and contexts of the abbreviations; wherein, the context consists of an identifier, and the identifier is vectorized by using a Paragraph2Vector algorithm;

for each group of abbreviations, constructing an undirected graph G, wherein nodes represent the abbreviations in the group, and weights of edges represent the context similarity between the abbreviations;

deleting the edges of the two vertices if the context similarity of the two abbreviations is less than a predefined threshold α; the maximum clique of the result graph represents a popular abbreviation, and the size of the clique represents the popularity of the abbreviation; only abbreviations for which the size of the maximum clique is not smaller than a predefined threshold β size are retained.