CN116185487A

CN116185487A - Feature attachment reconstruction method based on code multi-level calling association

Info

Publication number: CN116185487A
Application number: CN202211046947.7A
Authority: CN
Inventors: 施重阳; 毛赛
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2023-05-30

Abstract

The invention relates to a feature attachment reconstruction method based on code multi-level calling association, and belongs to the technical field of computer software reconstruction. The method uses the pre-training model to carry out embedded representation on the attribute and the method, thereby reducing the labor cost and the influence of subjective factors. The hidden characteristics contained in the codes can be fully represented by using deeper and wider text information, and the limitation of numerical measurement values is overcome. The method reduces the complexity of the neural network structure in the model and reduces the time and resource consumption; the method embedding technology code2vec can be used for fusing the semantic and text characteristics of the method codes, so that more comprehensive information is extracted; the hierarchical structure and the calling relation in the codes are emphasized, modeling of the feature attachment characteristics is enhanced, and the model performance is improved. The method comprehensively considers the layering characteristics of codes and special calling relations, can calculate the correlation between the method and the class more directly, and improves the reconstruction accuracy.

Description

Feature attachment reconstruction method based on code multi-level calling association

Technical Field

The invention relates to a method for reconstructing bad taste of a feature attachment code, in particular to a method for reconstructing the feature attachment based on code multi-level calling association, and belongs to the technical field of computer software reconstruction.

Background

Software reconfiguration is an important link in software code maintenance, and the source code is modified on the premise of not changing the external behavior of the code so as to promote the internal structure of the computer program.

Essentially, the software reconstruction is a code program arrangement method, which not only can upgrade and maintain codes, but also can avoid introducing new errors and optimize the code structure. The steps of reconstruction are simple, e.g. only a certain variable has to be moved from one class to another, a certain code fragment in a certain class has to be extracted from a function to form another function, or certain codes have to be put in parent or child classes in an inheritance hierarchy. These small modifications are accumulated to radically improve the design quality of the code. The key to software reconfiguration is to locate where the code needs to be modified.

Bad taste of code is a potential problem that affects the design structure of software due to code writing irregularities. To facilitate classification of code writing non-normative problems, fowler et al define 22 code bad tastes, including lengthy functions, feature attachment, god classes, etc.

Feature preservation is a common bad taste type. Methods or variables in one method and another class communicate quite frequently, far beyond communicating with the interior of the class in which they are located, which is typical of feature attachment. The feature-attaching reconstruction solution is to move the method into its more dependent class to improve the source code's clarity, extensibility and reusability.

In order to reconstruct feature attachment, researchers have proposed many reconstruction methods including a method based on conventional metric values, a method based on deep (machine) learning, a reconstruction method based on ensemble learning, and the like. However, these methods still have some drawbacks and disadvantages. For example, most conventional metric methods are based on heuristic rules and rely on the selection of metrics and manually defined thresholds, and different people may choose different metrics and criteria for the same code bad taste, resulting in low consistency between the classifiers and thus inaccurate reconstruction.

With the development of deep learning technology, the neural network is used for mining mass data characteristics, so that the problem can be solved more objectively and accurately. Applying deep learning techniques in the field of feature-based reconstruction typically combines detection and reconstruction into a two-classification problem, namely whether or not there is feature-based approach and class. The association of the neural network capturing method and the class is used for outputting the prediction probability, and the class with the highest probability, namely the class most likely to have characteristic attachment with the method, is selected as the reconstructed target class. However, in most reconstruction models based on deep learning, text semantic features of codes are focused more, and hierarchical features and calling relations of object-oriented languages are ignored, which are not only differences between codes and natural languages, but also important features of bad taste of features. In addition, the deep learning-based feature attachment reconstruction model is insufficient in use of code text information, so that semantic features learned by a neural network are sparse, and reconstruction performance is affected.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, creatively provides a feature attachment reconstruction method based on code multi-level calling association, and aims to solve the technical problems that the code text information is not fully utilized, the code hierarchical structure, the calling relationship and the like are ignored, the reconstruction is affected, and the like.

The invention is realized by the following technical scheme.

In general, a class in which a method is located is referred to as a method-containing class, a class corresponding to a destination to which a reconstructed method moves is referred to as a method-target class, and a class that may become a method-target class is referred to as a method-candidate class. The reconstruction task of the feature attachment is to locate the target class of the method.

A feature attachment reconstruction method based on code multi-level calling association comprises calling functions, class-level code representation and dependency calculation, and specifically comprises the following steps:

step 1: and slicing the code by taking the method as a unit of the data to be reconstructed, and extracting different attribute elements including a method shape parameter, a method return value, other method names called in the method body, attribute names and class names. Through the classification, the method calling relationship and the method interaction relationship are introduced into the feature attachment reconstruction.

Specifically, a piece of data to be reconstructed includes a method M to be reconstructed, a class (class in which the method is located), and a plurality of candidate classes (classes that may be characteristic-attached to the method). For each class in the piece of data to be reconstructed, the method comprises the following steps:

the code of the class is fragmented in units of methods. Setting N methods in class, deleting static methods, get/set methods and constructing methods, and setting m methods remained after one class is screened:

class＝[m ₁ ,m ₂ ,…,m _m ] (1)

for any one of m methods, extracting a method shape parameter, a method return value return, and other method names, attribute names and class names call called inside the method body, and collectively using the method shape parameter, the method return value return and the method names and the class names call as attributes of the methods.

The method shape parameter and the method return value are defined as a method interaction relation, and other method names, attribute names and class names which are called inside the method body are defined as a method call relation. Wherein a is _ji The ith attribute of the jth method in the class is represented, the value range of the subscript j is 1 to m, the value range of the subscript i is 1 to n, and the attribute set of the jth method in the class is method_j. The method comprises the following steps:

method_j＝[parameter _j ,return _j ,call _j ]＝[a _j1 ,a _j2 ,…,a _jn ] (2)

wherein, parameter _j Representing the method shape parameter in the j-th method, return _j Representing the return value in the jth method, call _j Representing the method call relationship in the jth method. The above three elements, each element containing a different number of attributes.

Step 2: and (3) according to the hump naming rule, dividing words of all code attributes output in the step (1) by letters, numbers, underlines, slashes and other English symbols, and then processing and screening word sequences after word division.

After word segmentation, each code attribute corresponds to a word sequence. The word sequence is processed and screened by the following method:

step 2.1: if the obtained word sequence has a single capital or lowercase English letter and the word has no practical meaning, deleting the letter.

Step 2.2: the resulting word sequence is all converted to lowercase.

Step 2.3: and (2) splicing the attribute word sequences corresponding to each method according to the attribute set of the method output in the step (1) to obtain a long word sequence containing all the attributes in the method.

Wherein the word number method_word contained in the long word sequence _j Sum of the number of words segmented for the code attributes in the corresponding set:

Method_word _j ＝<parameter _j ,return _j ,call _j > (3)

＝concat(w ₁ ,w ₂ ,···,w _n ) (4)

wherein, parameter _j 、return _j and call_j Respectively representing a method shape parameter, a return value and a calling relation of a jth method in a class, wherein the value range of a subscript j is from 1 to m; w (w) _i Words resolved for the corresponding code attribute, subscript i has a value ranging from 1 to n, n representing w _i Is a number of (3). concat (·) is a function that connects inputs, concat (w) ₁ ,w ₂ ,···,w _n ) Representing w ₁ ,w ₂ ,···,w _n Are concatenated to form a long word sequence. In this step, the same processing is performed for all the output sets in step 1.

Step 3: and (3) converting the long word sequence of the method_word into a long sentence containing n words according to the method_word output in the step (2), inputting each word in the sentence into a word embedding layer, and converting each word in the sentence into a word vector. Wherein the word embedding layer functions to convert each word entered into a digital vector, referred to as a word vector.

The word embedding layer converts each word into a word vector, represented as formula (5):

V(Method_word)＝V(concat(w ₁ ,w ₂ ,···,w _n ))＝[V(w ₁ ),V(w ₂ ),···,V(w _n )] (5)

wherein V (-) represents the word embedding function, i.e., converting the input (-) into a corresponding word vector; v (w) _i ) Representing w _i And converting the index into a word vector, wherein the value range of the index i is 1 to n. And (3) performing the same treatment on all the output sets processed in the step (2).

Preferably, the Word embedding function is Word2vec and the dimension of the Word vector is 256.

Equation (5) shows that converting a Method word into a word vector is equivalent to converting each w contained in the Method word _i Conversion to the corresponding word vector V (w _i )。

Step 4: concatenating the word vectors in V (method_word) output in step 3 to V (w) _i ) Concatenating to generate a matrix of n x k, where k is the vector dimension of each word and n represents n attribute words in a method:

A _j ＝matrix(V(w ₁ ),V(w ₂ ),···,V(w _n )) (6)

wherein ,A_j And representing a matrix formed by n attribute word vectors corresponding to the jth method in one class, wherein the format is n x k, and the matrix is called an attribute semantic matrix. Subscript j has a value ranging from 1 to m. And (3) carrying out the same processing on all the word vectors corresponding to all the methods processed in the step (3).

Step 5: and (3) performing second splicing on the attribute semantic matrix output in the step (4), wherein the splicing conditions at the time are as follows: the other matrices are spliced in turn, i.e. from A, except for the j-th method ₀ To A _j-1 From A _j+1 To A _m The spliced matrix is marked as A _r The matrix is called as an attribute correlation matrix corresponding to the jth method:

A _r ＝concat(A ₀ ,…,A _j-1 ,A _j+1 ,…,A _m ) (7)

wherein ,A_r The matrix has the format of m _n *k，m _n The number of columns of the matrix is the dimension of the word vector; m is m _n Meaning that all but the j-th method contains the sum of the number of attributed word vectors, k being the dimension of the word vector.

Step 6: correlating the output attribute of step 5 with matrix A _r Performing matrix transposition operation to output transposed matrix as

Wherein the matrix

In the format of k x m _n 。

Step 7: the output of step 6

Output attribute semantic matrix A of step 4 _j Dot product multiplication, the matrix after multiplication is q _a The meaning is the correlation matrix of the attribute and other methods:

wherein matrix A _j Is in the form of n x k, matrix

In the format of k x m _n Matrix q after multiplication _a In the format of n x m _n N represents the number of attribute word vectors included in the jth method, m _n Representing the sum of the number of attribute word vectors contained by all but the jth method.

At this time, in matrix q _a Element q of the x-th row and y-th column of the set _xy Relevance value of the xth attribute word representing the jth method to the yth attribute word of all methods in the entire class except for the jth method。

The reason for selecting matrix multiplication is as follows: the multiplication of the matrix is actually a dot product calculation of the vectors of the two 1*k, the vector dot product being able to reflect the correlation of the two vectors.

Step 8: output matrix q of step 7 _a Is added to the row values to obtain a new matrix q _ax Called attribute call matrix q _ax ：

q _ax ＝sum _[x] (q _a ) (10)

wherein ,sum_[x] (. Cndot.) represents summing the rows of the matrix, and attribute invoking the matrix q _ax In the form of n x 1, n representing the number of vectors of attribute words involved in the jth method, in the matrix q _ax The element of the x line in (1) has only one value representing the correlation of the x attribute word with the class in which it is located, since all selected attributes are correlated with the code call, the matrix q is _ax Referred to as an attribute call matrix.

Step 9: output matrix q of step 8 _ax Is added to each column value of the number to obtain a value c _j ：

c _j ＝sum _[y] (q _ax ) (11)

wherein ,sum_[y] (. Cndot.) means adding the columns of the matrix, the value c _j Representing the calling coefficient of the jth method.

For all methods in a class, calculating call coefficients corresponding to the methods according to the steps 5 to 9, and setting the set of the call coefficients of the methods in the class as D _c ：

D _c ＝[c ₁ ,c ₂ ,…,c _m ] (12)

wherein ,c_j The method call coefficient representing the j-th method in a class is m methods in total, and the subscript j has a value ranging from 1 to m.

So far, from step 1 to step 9, the calculation of the calling function is completed.

Step 10: the code is fragmented by taking the method as a unit through the step 1 to obtain a classIn m method sets class= [ m ] ₁ ,m ₂ ,…,m _m ]Each method source code is input to a method embedded representation model, and a method is converted into a method vector. The function of the method embedding layer is to convert each input method into a digital vector, called a method vector, as shown in the formula:

Vec(method _j )＝code2vec(method _j ) (13)

the Vec (·) representation method is embedded into a representation function, namely, the j-th input method is converted into a corresponding method vector; code2vec (·) represents processing of a method code using a code2vec model, code2vec is a code representation model, and an input code segment is converted into a code vector, where the input of code2vec is a j-th method, and thus a method vector of the j-th method is output, and the subscript j has a value ranging from 1 to m. The dimension of the method vector is preferably 384. In the code2vec model, it has been demonstrated that the best experimental results are obtained when the method vector is set to 384.

And (3) embedding a representation model into each M methods in one class through the method of the step (10) to obtain a method vector corresponding to each method, wherein the M method vectors in one class are collected as M:

M＝[Vecm ₁ ,Vecm ₂ ,…,Vecm _m ] (14)

wherein Vecm _j The value of the subscript j of the method vector representing the jth method ranges from 1 to m.

Step 11: step 9 outputting a class of method call coefficient set D _c ＝[c ₁ ,c ₂ ,…,c _m ]And step 10, outputting a class of method vector set M= [ Vecm ₁ ,Vecm ₂ ,…,Vecm _m ]The two sets are weighted and summed, and the calculation result is that

/>

wherein ,

the vector of a class is obtained by multiplying the respective method vectors corresponding to m methods in the class by the calling coefficient and then adding all the vectors to obtain a new vector value which represents the vector of class level codes.

To this end, a class of code vector representations is implemented from step 1 to step 11.

A piece of data to be reconstructed comprises a method to be reconstructed, a class (the class in which the method resides) and one or more candidate classes (the class that may be characteristic-attached to the method). Calculating corresponding class level code vectors of the class contained in one piece of data and a plurality of candidate classes according to the steps 1 to 11 to form a class vector set:

wherein ,

a vector set representing all classes (including classes and candidate classes) corresponding to a method to be reconstructed,

the class vector of the ith class corresponding to the method to be reconstructed is represented, and the value range of the subscript i is from 1 to z, which means that the number of the classes corresponding to the method to be reconstructed is z.

Step 12: the reconstruction method M is operated.

A method to be reconstructed is solely a code segment, and the method is input into a method embedding representation model, and a method is converted into a method vector. Wherein the method embedding layer functions to convert each method of input into a digital vector, called a method vector, as in formula (17):

wherein ,

the method vector representing a method to be reconstructed, vec (·) representing the method embedding representing function, i.e. converting the method code into a corresponding method vector, code2Vec (·) representing the processing of the method code using the code2Vec model, code2Vec being a code representation model converting the inputted code fragments into code vectors, where the input of code2Vec is the method to be reconstructed. The dimension of the method vector is preferably 384.

Step 13: the output of the step 12 is the vector of the method to be reconstructed, class vectors of all classes corresponding to the method to be detected are finally output through the steps 1 to 11, the relation between the vector of the method to be reconstructed and each class vector is estimated, and a cosine similarity formula is selected:

wherein ,Dep_{c_i} The similarity of the method to be reconstructed and the corresponding class i is called as a dependency degree value; cos (θ) represents the cos function, i.e. the cosine of the angle between the two vectors is calculated to evaluate their similarity, and the method vector to be reconstructed is calculated

The i-th class vector corresponding thereto +.>

Is a similarity of (3). The cos function is calculated as the dot product of the two vectors divided by the modulo length product of each of the two vectors. Wherein the vector dot product calculation is the multiplication of two vector corresponding elementsThen summing; II illustrates the modular length of the vector, i.e., the calculation of the sum of squares of each element in the vector. d represents vector dimension, the value of the vector dimension preferably takes 384, the value of j ranges from 1 to d, the value of the subscript i ranges from 1 to z, and z represents the number of classes corresponding to the method to be reconstructed.

Step 14: output dependency value Dep from step 13 _{c_i} The set of components Dep _c Is selected to be the maximum value Dep _max ：

Dep _max ＝Max[Dep _{c_1} ,Dep _{c_2} ,…,Dep _{c_z} ] (19)

wherein ,Dep_max The maximum value of the dependence degree of the method to be reconstructed and the class corresponding to the method to be reconstructed is represented, max (·) represents the maximum value in the collection, and the class corresponding to the maximum value is the target class to which the method to be reconstructed needs to be moved.

Thus, reconstruction of one piece of data is completed. The reconstruction result is the target class. If the target class is the inclusion class of the method to be reconstituted, indicating that the method and inclusion class do not have a characteristic-preserving bad taste, no method movement operation is required; if this target class is not the inclusion class of the method to be reconstituted, indicating that the method and such presence of characteristics is bad tasting, it is necessary to move the method into the target class to eliminate the characteristics from being attached, and complete the reconstitution.

And (4) obtaining the target class for each piece of data according to the process from the step 1 to the step 14, namely completing reconstruction.

Advantageous effects

Compared with the existing characteristic attachment reconstruction method, the method has the following advantages:

1. compared with the feature attachment reconstruction method based on heuristic rules, the method has the advantages that the pre-training model is used for carrying out embedded representation on the attributes and the method, interference of human factors is eliminated, and the labor cost and the influence of subjective factors are reduced.

2. Compared with a feature attachment reconstruction method based on measurement, the method utilizes deeper and wider text information, can fully represent implicit features contained in codes, and overcomes the limitation of numerical measurement values.

3. Compared with a feature attachment reconstruction method based on deep (machine) learning, the method reduces the complexity of the neural network structure in the model and reduces the time and resource consumption; the method embedding technology code2vec can be used for fusing the semantic and text characteristics of the method codes, so that more comprehensive information is extracted; the hierarchical structure and the calling relation in the codes are emphasized, modeling of the feature attachment characteristics is enhanced, and the model performance is improved.

4. The method proposes a new code representation, namely a class-level code vector representation method. The code layering characteristics and the special calling relation are comprehensively considered, and the correlation between the method and the class can be calculated more directly.

5. Compared with a feature attachment reconstruction method based on depth (machine) learning, the method improves the reconstruction accuracy.

Drawings

FIG. 1 is a block diagram of the method of the present invention.

Wherein, from left to right are respectively an A. Call function module, a B. Class hierarchy code representation module and a C. Dependency calculation module. The candidate class and the containing class of the intermediate module are the candidate class and the containing class corresponding to the method M to be reconstructed in the right graph, and the method set in each class is expressed as [ M ] ₁ ,m ₂ ,…,m _m ]，m _i Representing the ith method code in a class, subscript i takes a value from 1 to m. This set of methods is divided into two parts: the method calls the function layer and the method embeds the presentation layer. Wherein the method embeds the representation layer as a code2vec model, the code2vec functions to convert a code segment into a digital vector. The method code segment is used as the input of the code2vec, and the corresponding method vector is output through the processing of the code2 vec. The set of method vectors in a class is denoted as [ m ] ₁ ,m ₂ ,…,m _m], wherein ,m_i A method vector representing the i-th method.

The method calling function layer points to the left graph, namely, A. Calling function module. The method sequentially comprises the following steps from bottom to top: the input to this module is the method set [ m ] ₁ ,m ₂ ,…,m _m ]Through code segmentation and extraction, the methodMethod shape parameters, method return values and other method names, attribute names and class names called in the body of the method in one method, namely the attribute of the method, are spliced to form a word sequence [ a ] _i1 ,a _i2 ,…,a _in ]Representing all attribute word sequences in the ith method, and outputting an attribute word vector set [ a ] through an attribute word embedding layer _i1 ,a _i2 ,…,a _im ]The attribute word embedding layer is a word2vec model, and the word2vec model is used for converting a word into a digital vector. Then through the attribute splicing layer, the attribute splicing layer has the function of splicing all attribute word vectors corresponding to one method into a matrix, namely an attribute semantic matrix A _i ，A _i And representing the attribute semantic matrix corresponding to the ith method. Then calculating matrix transposition, vector dot product and the like to obtain an attribute calling matrix, and adding and calculating to obtain a method calling coefficient. The output of the call function module is the set of method call coefficients in one class c ₁ ,c ₂ ,…,c _m ]。

Calculating a class vector in a class-level code representation module by weighted summation of method call coefficients and method vectors

In the right graph, namely a dependency degree calculation module, a method to be reconstructed is used as input to be embedded into a representation layer through a method to obtain a vector of the method

The method embedded representation layer here is also a code2vec model. Output class vector of the method vector and class level code representation module>

Outputting the dependency value Dep of the two vectors through the dependency correlation layer _c . The dependency correlation layer is a cosine similarity. The method to be reconstructed calculates the respective dependency values of the plurality of classes corresponding to the method to be reconstructed, and the maximum value is selected to be the reconstruction target class of the method. If the object class is of the methodIf the class is contained, the method has no feature attachment and does not need reconstruction; if this target class is not the inclusion class of the method, then the method has a feature attachment that requires moving it to the location of the target class.

Detailed Description

The process according to the invention is described in further detail below with reference to the figures and examples.

Examples

The method provided by the invention is adopted in the embodiment, and a characteristic attachment reconstruction system based on code multi-level calling association is established, wherein the system adopts a Python development platform and a Keras resource library. The data generation tool movemethodwenerator from the github is used, with the open source item address of https:// github.com/JetBrans-Research/movemethodwenerator. Meanwhile, source codes provided by a Junit item of a software testing tool on the github are adopted, and the address of the Junit open source item is https:// github.

And extracting relevant information of all methods in the Junit item source code by using a data generation tool movemethodwenerator, and recording the method to be reconstructed, the inclusion class and the candidate class. The method comprises the following specific steps:

step 1: deleting static methods, get/set methods and construction methods of all classes corresponding to the method to be reconstructed, and forming a method set by the rest methods. Extracting attributes for each method in the method set, wherein the specific operation is as follows:

and screening the methods in all the Junit source codes by using a data generation tool movemethodwenerator, and selecting the method capable of completing the moving operation. For the selected method, extracting the method adjacency parameter, the method return value and other method names, attribute names and class names called inside the method body, and forming the words into a code attribute set.

Step 2: and (3) according to the hump naming rule, separating words from all code attributes output in the step (1) according to case letters, numbers, underlines, slashes and other English symbols, performing case-case conversion and screening operation after separating words, and splicing the attribute word sequences corresponding to each method to obtain a long word sequence containing all the attributes in the method.

Step 3: and (3) inputting each word in the word sequence into the attribute embedding layer to complete the conversion from the word to the word vector according to the word sequence output in the step (2), and splicing the converted words to form an attribute semantic matrix.

Step 4: and calculating an attribute correlation matrix corresponding to the jth method. And (3) all the attribute semantic matrixes except the attribute semantic matrix corresponding to the jth method obtained in the step (3) are spliced, namely the attribute correlation matrix corresponding to the jth method is obtained.

Step 5: and 4, performing transposition operation on the attribute correlation matrix output in the step 4.

Step 6: multiplying the attribute semantic matrix output in the step 3 by the attribute correlation matrix output in the step 4, namely performing dot product operation on a row vector of the former and each column vector of the latter to obtain an attribute call matrix.

Step 7: and (3) adding each column of the attribute call matrix output in the step (6) and outputting a numerical value, namely the method call coefficient. And calculating the method call coefficients of the methods in one class to form a method call coefficient set of one class.

Step 8: inputting the method set in the step 1 into a method embedded representation model code2vec to obtain a method vector of each method, and forming a class of method vector set.

Step 9: and (3) carrying out weighted summation on the method calling coefficient set output in the step (7) and the method vector set output in the step (8) to obtain a new vector, namely a class vector.

Step 10: inputting the code segment of the method to be reconstructed in the step 1 into a method embedded representation model code2vec to obtain a method vector of the method to be reconstructed.

Step 11: and (3) calculating the similarity of the cosine between the vector of one class output in the step (9) and the vector of the method to be reconstructed output in the step (10), namely, the dependency value of the two vectors.

Step 12: and (3) calculating the dependence value of the method to be reconstructed and each class according to the steps 1 to 11, and selecting the class corresponding to the maximum dependence value as the target class reconstructed by the method.

If the target class is the inclusion class of the method, then the method has no feature attachment and no reconstruction is required; if this target class is not the inclusion class of the method, then the method has a feature attachment that requires moving it to the location of the target class.

The above is the calculation flow of the present invention for one piece of data (to-be-reconstructed method, including class, candidate class), and for each piece of data in the Junit project, the feature attachment reconstruction of the project is completed.

Through the operation of the steps, the ratio of the correct number of the target classes to the total data quantity is reconstructed by the statistical method, namely the reconstruction accuracy of the Java item.

In order to illustrate the reconstruction effect of the present invention, under the same conditions, the same experimental data were used to compare the reconstruction accuracy values obtained by using the JDeodorant tool, the JMove tool and the convolutional neural network CNN method, respectively, as shown in table 1.

Table 1 comparison of the effects of the four reconstitution modes

The following conclusions can be drawn from table 1: the feature attachment reconstruction method based on the code multi-level calling association strengthens the calling relation among the levels by utilizing the hierarchical structure of the attribute, the method and the class of Java language, combines attribute texts of multiple aspects, adopts two different embedded representation models aiming at different characteristics of the attribute and the method, and can comprehensively represent the semantic and structural features of the code. Therefore, compared with the traditional tool and the reconstruction method based on CNN, the reconstruction accuracy is higher, and the effectiveness is verified.

Claims

1. A feature attachment reconstruction method based on code multi-level calling association is characterized by comprising the following steps:

step 1: slicing the code by taking the method as a unit of the data to be reconstructed, and extracting different attribute elements, including method shape parameters, method return values, other method names called in the method body, attribute names and class names; through the classification, the method calling relationship and the method interaction relationship are introduced into feature attachment reconstruction;

step 2: according to the hump naming rule, dividing words of all code attributes output in the step 1 into words with letters, numbers, underlines, slashes and other English symbols, and then processing and screening word sequences after word division;

after word segmentation, each code attribute corresponds to a word sequence; the word sequence is processed and screened by the following method:

step 2.1: if the obtained word sequence has single capital or lowercase English letters and the word has no practical meaning, deleting the letters;

step 2.2: all the obtained word sequences are converted into lowercase;

step 2.3: splicing the attribute word sequences corresponding to each method according to the attribute set of the method output in the step 1 to obtain a long word sequence containing all the attributes in the method;

Method_word _j ＝<parameter _j ,return _j ,call _j >＝concat(w ₁ ,w ₂ ,···,w _n )

wherein, parameter _j 、return _j and call_j Respectively representing a method shape parameter, a return value and a calling relation of a jth method in a class, wherein the value range of a subscript j is from 1 to m; w (w) _i Words resolved for the corresponding code attribute, subscript i has a value ranging from 1 to n, n representing w _i Is the number of (3); concat (·) is a function that connects inputs, concat (w) ₁ ,w ₂ ,···,w _n ) Representing w ₁ ,w ₂ ,···,w _n Are connected to form a long word sequence; in the step, all output sets in the step 1 are processed in the same way;

step 3: according to the method_word output in the step 2, converting the long word sequence of the method_word into a long sentence containing n words, inputting each word in the sentence into a word embedding layer, and converting each word in the sentence into a word vector; wherein the word embedding layer functions to convert each word entered into a digital vector, referred to as a word vector;

the word embedding layer converts each word into a word vector, represented as follows:

V(Method_word)＝V(concat(w ₁ ,w ₂ ,···,w _n ))＝[V(w ₁ ),V(w ₂ ),···,V(w _n )]

wherein V (-) represents the word embedding function, i.e., converting the input (-) into a corresponding word vector; v (w) _i ) Representing w _i Converting the index into a word vector, wherein the value range of the index i is 1 to n; all the output sets processed in the step 2 are processed in the same way;

A _j ＝matrix(V(w ₁ ),V(w ₂ ),···,V(w _n ))

wherein ,A_j Representing a matrix formed by n attribute word vectors corresponding to the jth method in a class, wherein the format is n x k, and the matrix is called an attribute semantic matrix; the subscript j has a value ranging from 1 to m; all word vectors corresponding to all the methods processed in the step 3 are processed identically;

step 5: performing second splicing on the attribute semantic matrix output in the step 4;

the splicing conditions are as follows: the other matrices are spliced in turn, i.e. from A, except for the j-th method ₀ To A _j-1 From A _j+1 To A _m Spliced and splicedThe matrix is denoted as A _r The matrix is called as an attribute correlation matrix corresponding to the jth method:

A _r ＝concat(A ₀ ,…,A _j-1 ,A _j+1 ,…,A _m )

wherein ,A_r The matrix has the format of m _n *k，m _n The number of columns of the matrix is the dimension of the word vector; m is m _n Meaning that all but the jth method contains the sum of the number of attributed word vectors, k being the dimension of the word vector;

Wherein the matrix

In the format of k x m _n ；

Step 7: the output of step 6

wherein matrix A _j Is in the form of n x k, matrix

In the format of k x m _n Matrix q after multiplication _a In the format of n x m _n N represents the j-th methodNumber of attribute word vectors contained, m _n The sum of the numbers of the attribute word vectors contained in all the methods except the jth method is represented;

at this time, in matrix q _a Element q of the x-th row and y-th column of the set _xy A relevance value representing the x-th attribute word of the j-th method and the y-th attribute word of all methods except the j-th method in the whole class;

q _ax ＝sum _[x] (q _a )

wherein ,sum_[x] (. Cndot.) represents summing the rows of the matrix, and attribute invoking the matrix q _ax In the form of n x 1, n representing the number of vectors of attribute words involved in the jth method, in the matrix q _ax The element of the x line in the matrix has only one value, which represents the relativity of the x attribute word and the class of the x attribute word, and the matrix q _ax Called attribute call matrix;

c _j ＝sum _[y] (q _ax )

wherein ,sum_[y] (. Cndot.) means adding the columns of the matrix, the value c _j Calling coefficients representing the jth method;

D _c ＝[c ₁ ,c ₂ ,…,c _m ]

wherein ,c_j A method call coefficient representing the jth method in a class, wherein m methods are provided in total, and the value range of subscript j is from 1 to m;

step 10: the code is fragmented by taking the method as a unit through the step 1 to obtain m method sets in one class, wherein class= [ m ] ₁ ,m ₂ ,…,m _m ]Inputting each method source code into a method embedded representation model, and converting a method into a method vector; the function of the method embedding layer is to convert each input method into a digital vector, called a method vector, as shown in the formula:

Vec(method _j )＝code2vec(method _j )

the Vec (·) representation method is embedded into a representation function, namely, the j-th input method is converted into a corresponding method vector; code2vec (·) means that the code2vec model is used for processing the method code, code2vec is a code representation model, the input code segment is converted into a code vector, the input of code2vec is a j-th method, the method vector of the j-th method is output, and the value of subscript j ranges from 1 to m;

M＝[Vecm ₁ ,Vecm ₂ ,…,Vecm _m ]

wherein Vecm _j A method vector representing a j-th method, wherein the subscript j has a value ranging from 1 to m;

wherein ,

for the vector of a class, the respective method vectors corresponding to m methods in the class are summedCalling coefficient multiplication and then adding all the coefficients to obtain a new vector value which represents the vector of the class level code;

the piece of data to be reconstructed comprises a method to be reconstructed, a class and one or more candidate classes; calculating corresponding class level code vectors of the class contained in one piece of data and a plurality of candidate classes according to the steps 1 to 11 to form a class vector set:

wherein ,

vector set representing all classes (including class and candidate class) corresponding to a method to be reconstructed, +.>

The class vector of the ith class corresponding to the method to be reconstructed is represented, and the value range of the subscript i is from 1 to z, which means that the number of the classes corresponding to the method to be reconstructed is z;

step 12: performing operation treatment on the reconstruction method M;

a method to be reconstructed is independently a code segment, the method is input into a method embedding representation model, and a method is converted into a method vector; the function of the method embedding layer is to convert each input method into a digital vector, called a method vector, and the method vector is represented by the following formula:

wherein ,

representing a method vector of a method to be reconstructed, vec (& gt) representing a method embedding representing function, namely converting a method code into a corresponding method vector, and code2Vec (& gt) representing a counterpart using a code2Vec modelProcessing the normal codes, wherein the code2vec is a code representation model, and converting the input code fragments into code vectors, wherein the input of the code2vec is a method to be reconstructed;

The i-th class vector corresponding thereto +.>

Similarity of (2); the cos function is calculated by dividing the dot product of two vectors by the modulo length product of each of the two vectors; the vector dot product calculation is to multiply and then sum two vector corresponding elements; II represents the modular length of the vector, i.e., the calculation of the sum of squares of each element in the vector; d represents vector dimension, the value range of j is from 1 to d, the value range of subscript i is from 1 to z, and z represents the number of classes corresponding to the method to be reconstructed;

Dep _max ＝Max[Dep _{c_1} ,Dep _{c_2} ,…,Dep _{c_z} ]

wherein ,Dep_max The maximum value of the dependence degree of the method to be reconstructed and the class corresponding to the method to be reconstructed is represented, max (·) represents the maximum value in the collection, and the class corresponding to the maximum value is the target class to which the method to be reconstructed needs to be moved;

the reconstruction result is a target class, if the target class is the inclusion class of the method to be reconstructed, the method and the inclusion class are not characterized by bad taste, and the method moving operation is not needed; if this target class is not the inclusion class of the method to be reconstituted, indicating that the method and such presence of characteristics is bad tasting, it is necessary to move the method into the target class to eliminate the characteristics from being attached, and complete the reconstitution.

2. The method for feature attachment reconstruction based on code multi-level call correlation as claimed in claim 1, wherein in step 1, a piece of data to be reconstructed comprises a method to be reconstructed M, a class and a plurality of candidate classes;

for each class in the piece of data to be reconstructed, the method comprises the following steps:

for the codes of the classes, the codes are sliced by taking the method as a unit, N methods are contained in class, static methods, get/set methods and construction methods are deleted, and m methods are left after one class is screened:

class＝[m ₁ ,m ₂ ,…,m _m ]

extracting a method shape parameter, a method return value return, other method names, attribute names and class names call called inside a method body of any one of m methods, which are commonly called as attributes of the methods;

defining a method shape parameter and a method return value as a method interaction relation, and defining other method names, attribute names and class names called in the method body as a method calling relation; wherein a is _ji The i-th attribute of the j-th method in the class is represented, the value range of the subscript j is 1 to m, the value range of the subscript i is 1 to n, the attribute set of the j-th method in the class is method_j, and the method comprises the following steps:

method_j＝[parameter _j ,return _j ,call _j ]＝[a _j1 ,a _j2 ,…,a _jn ]

wherein, parameter _j Representing the method shape parameter in the j-th method, return _j Representing the return value in the jth method, call _j Representing a method call relation in a j-th method; the above three elements, each element containing a different number of attributes.

3. The method of claim 1, wherein in step 3, the Word embedding function is Word2vec and the dimension of the Word vector is 256.

4. The method of claim 1, wherein in step 10, the dimension of the method vector is 384.

5. The method of claim 1, wherein in step 12, the dimension of the method vector is 384.

6. The method of claim 1, wherein in step 13, vector dimension d is selected 384.