WO2022121146A1

WO2022121146A1 - Method and apparatus for determining importance of code segment

Info

Publication number: WO2022121146A1
Application number: PCT/CN2021/081731
Authority: WO
Inventors: 舒俊淮; 陈湘萍; 金舒原; 郑子彬
Original assignee: 中山大学
Priority date: 2020-12-07
Filing date: 2021-03-19
Publication date: 2022-06-16
Also published as: CN112417852B; CN112417852A

Abstract

A method and apparatus for determining the importance of a code segment. The method for determining the importance of a code segment comprises: generating a target classification model by using a preset classification model training process; when a code segment to be annotated is received (101), extracting a first feature vector of said code segment (102); and inputting the first feature vector into the target classification model, and outputting an importance determination result of said code segment (103). By means of the method, the importance of a code to be annotated can be determined, thereby facilitating the normalization of annotation behavior of software development and maintenance personnel, and keeping a code annotation amount within a suitable range.

Description

A method and device for judging the importance of a code fragment

This application claims the priority of the Chinese patent application with the application number 202011418126.2 and the invention titled "A method and device for determining the importance of a code segment", which was filed with the China Patent Office on December 07, 2020, the entire contents of which are incorporated by reference in this application.

technical field

The invention relates to the field of computer technology, and in particular, to a method and device for judging the importance of code segments.

Background technique

The research direction of intelligent software engineering includes software warehouse mining, program code understanding, automatic code generation, automatic annotation generation, etc. The purpose is to help software developers improve the efficiency of development and maintenance. In recent years, due to the rise of machine learning and deep learning technologies, researchers related to intelligent software engineering problems have begun to explore the possibility of using these advanced technologies to solve related research problems, and obtained many encouraging results. For example, based on information retrieval and recommendation system technology to help developers improve the utilization of open source software warehouses; machine learning-based methods identify important words in the code to help other tasks understand the program correctly; code generation based on convolutional neural networks technology; etc.

The technology of automatic generation of code comments is a hot topic in the research field of intelligent software engineering. Code comments can help us understand the intentions and ideas of code authors, and play an important role in software maintenance, code reuse, and team collaborative development. The research of this technology aims to automatically generate comments for a given code fragment by a machine, so as to reduce the time that software developers spend on writing code comments and improve development efficiency. Using machine learning and deep learning techniques, the researchers turned the problem into a "translation task" in natural language processing to solve. By using sequence-to-sequence models in natural language processing (i.e., input a sequence of text, the model outputs a sequence of text), the researchers "translate" the code language into natural language, and treat the resulting natural language as annotations for the corresponding code snippets .

However, in the existing techniques for predicting the location of code annotations, the model is simply converted into feature vectors to train the model. This practice is fairly common in natural language processing. But doing so is equivalent to treating the text of the code as a natural language, and only uses the textual information of the code, so the effect of this last method is not so ideal. In addition, this method does not make sufficient use of text information. It simply converts words into features without considering the distribution of text, which may lead to lower accuracy in determining the location of code comments, which cannot be reasonable for developers. , which in turn reduces the productivity of software developers and maintainers.

SUMMARY OF THE INVENTION

The present invention provides a method and device for judging the importance of code fragments, which solves the problem that the existing technology for predicting the position of code comments is only limited to treating the code text as unstructured plain text, and the feature utilization rate of multiple dimensions is relatively low. The low accuracy of determining the location of code comments leads to a low technical problem that it is impossible to give developers reasonable suggestions, thereby reducing the work efficiency of software development and maintenance personnel.

A method for judging the importance of a code fragment provided by the present invention includes:

Receive code snippets to be annotated;

extracting the first feature vector of the code fragment to be annotated;

Inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;

Wherein, the target classification model is generated through a preset classification model training process.

Optionally, the classification model training process includes:

Get annotated code files from pre-built software repositories;

Divide the annotated code file in units of functions to generate multiple training code fragments;

setting a first preset label for the training code snippet with a preset type annotation;

Setting a second preset label for the training code snippet without a preset type annotation;

extracting the second feature vector of each of the training code fragments respectively;

Using a plurality of the second feature vectors to train a preset initial classification model to obtain a target classification model.

Optionally, the first feature vector includes a grammatical feature vector, a text feature vector, a structural feature vector and a relational feature vector, and the step of extracting the first feature vector of the code fragment to be annotated includes:

converting the code fragment to be annotated into an abstract syntax tree;

Extract the statement type information of the code fragment to be annotated from the abstract syntax tree;

According to the statistical result of the statement type information, determine the grammatical feature vector corresponding to the code fragment to be annotated;

Extract the target word from the to-be-annotated code fragment according to the preset variable word division rule;

According to the statistical result of the target word, determine the text feature vector corresponding to the code fragment to be annotated;

According to the complexity of the to-be-annotated code fragment, determine the structural feature vector corresponding to the to-be-annotated code fragment;

Based on the number of function calls of the to-be-annotated code segment, a relational feature vector corresponding to the to-be-annotated code segment is determined.

Optionally, the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types, and the grammatical feature vector corresponding to the code fragment to be annotated is determined according to the statistical result of the statement type information. steps, including:

Counting the frequency distribution of the various sentence types, and determining the frequency distribution characteristics of the sentence;

Counting the number of the various types of statements, and determining the characteristics of the number of statements;

Count the total number of statements corresponding to the multiple statement types, and determine the characteristics of the total number of statements;

Using the first preset word feature conversion model to convert the multiple statement types into statement type features respectively;

Using the frequency of occurrence as a weight, weighted summation is performed on a plurality of the statement type features to determine the total statement type features;

The statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced together to generate a grammatical feature vector corresponding to the code segment to be annotated.

Optionally, the preset variable word division rule includes a hump rule or an underline rule, and the step of extracting the target word from the to-be-annotated code fragment according to the preset variable word division rule includes:

Extract words from the to-be-annotated code snippet;

Determine the word to be processed from the word using the camel case rule or the underline rule;

Delete preset stop words from the words to be processed to obtain words to be extracted;

The stems in the words to be extracted are extracted to generate a target word.

Optionally, the target word includes a plurality of words to be counted, and the step of determining the text feature vector corresponding to the code fragment to be annotated according to the statistical result of the target word, includes:

Count the total number of the plurality of words to be counted, and determine the total word number feature;

Count the number of types of the plurality of words to be counted, and determine the characteristics of the number of word types;

Calculate the variance of the frequency of occurrence of each of the words to be counted in the plurality of words to be counted, and determine the word variance feature;

Count the proportion of non-English words in the plurality of words to be counted, and determine the non-word proportion feature;

Using the second preset word feature conversion model to convert the plurality of words to be counted into word features respectively;

Taking the frequency of occurrence of each described word to be counted as a weight, weighted summation is performed on all the described word features to generate total word features;

The total word quantity feature, the word type quantity feature, the word variance feature, the non-word proportion feature, and the total word feature are spliced together to generate a text feature vector corresponding to the code segment to be annotated.

Optionally, the step of determining the structural feature vector corresponding to the to-be-annotated code segment according to the complexity of the to-be-annotated code segment includes:

Count the number of lines of code in the to-be-annotated code snippet, and determine the feature of the number of lines;

Count the number of nested statements in the to-be-annotated code fragment, and determine the feature of the number of nested statements;

Count the maximum number of nesting levels in the code fragment to be annotated, and determine the feature of the maximum number of nested levels;

Count the number of formal parameters in the to-be-annotated code fragment, and determine the quantity characteristics of the formal parameters;

Respectively count the feature of the number of words of the longest statement in the code fragment to be annotated, the feature of the number of API calls of the code fragment to be annotated, the feature of the number of variables of the code fragment to be annotated, and the number of identifiers of the code fragment to be annotated The feature and the internal annotation quantity feature of the code fragment to be annotated are sequentially spliced to generate a comprehensive feature;

The line number feature, the nested statement number feature, the maximum nesting level feature, the shape parameter quantity feature, and the comprehensive feature are spliced together to generate a structural feature vector corresponding to the code segment to be annotated.

Optionally, the number of function calls includes the number of additional functions called and the number of times the function is called, and the step of determining the relationship feature vector corresponding to the code fragment to be annotated based on the number of function calls of the code fragment to be annotated includes:

Traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of called extra functions and the described called times of the to-be-annotated code fragment;

The number of called extra functions and the called times are spliced together to generate a relational feature vector corresponding to the code segment to be annotated.

Optionally, the step of inputting the first feature vector into the target classification model and outputting the result of judging the importance of the to-be-annotated code segment includes:

inputting the first feature vector into the target classification model;

When the output of the target classification model is the first preset label, determine that the importance judgment result of the code segment to be annotated is important;

When the output of the target classification model is the second preset label, it is determined that the importance judgment result of the code segment to be annotated is not important.

The present invention also provides a device for judging the importance of a code segment, including:

The code fragment receiving module is used to receive the code fragment to be annotated;

a first feature vector extraction module for extracting the first feature vector of the code fragment to be annotated;

an importance output module, for inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;

Wherein, the target classification model is generated by a preset classification model training module.

As can be seen from the above technical solutions, the present invention has the following advantages:

The present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment The importance judgment result of the code snippet. Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions. Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.

Description of drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

1 is a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention;

2 is a flowchart of steps of a method for judging the importance of a code segment provided by an optional embodiment of the present invention;

3 is an example diagram of a nested statement in an embodiment of the present invention;

4 is a flowchart of steps of a method for judging the importance of a code segment provided by another embodiment of the present invention;

FIG. 5 is a structural block diagram of an apparatus for judging the importance of a code segment according to an embodiment of the present invention.

Detailed ways

The embodiments of the present invention provide a method and device for judging the importance of code fragments, which are used to solve the problem that the existing technology for predicting the position of code comments is only limited to treating code text as unstructured plain text, and the features of multiple dimensions are The low utilization rate leads to a low accuracy in determining the location of code comments, and it is impossible to give developers reasonable suggestions, thereby reducing the technical problem of software development and maintenance personnel's work efficiency.

In order to make the purpose, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the following The described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Please refer to FIG. 1. FIG. 1 is a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention.

Step 101, receiving the code fragment to be annotated;

In this embodiment of the present invention, in order to more accurately determine the importance of the code fragment and to better support downstream tasks such as judging the position of the code comment, before the user needs to annotate the code fragment, the user input can be received first. Annotate code snippets for importance judgment process.

The code fragment to be annotated may be a Java code fragment, etc., which is not limited in the embodiment of the present invention.

Step 102, extracting the first feature vector of the code fragment to be annotated;

After receiving the to-be-annotated code fragment, extract the first feature vector from the to-be-annotated code fragment, such as grammatical features, text features, structural features, and relational features, etc., as the input of the subsequent model, and perform the to-be-annotated based on the above features. The process of judging the importance of code snippets.

Step 103, inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;

In a specific implementation, the target classification model is generated through a preset classification model training process, and after the target classification model is obtained, the first feature vector is input into the target classification model to perform the process of judging the importance of the code fragment to be annotated, thereby Determine whether the code fragment is important as a result of the importance judgment.

Please refer to FIG. 2, which is a flowchart of steps of a method for judging the importance of a code segment provided by an optional embodiment of the present invention.

Step 201, receiving the code fragment to be annotated;

Before step 201, in order to facilitate the process of judging the importance of the annotated code segment quickly, a target classification model can be generated through a classification model training process in advance, and the classification model training process includes the following steps S1-S6:

S1. Obtain annotated code files from a preset software repository;

In the embodiment of the present invention, in order to obtain enough reliable training data, first obtain some code files of Java projects with a long maintenance history open sourced by large international companies or organizations from the software project repository Github, that is, annotated code files .

S2, dividing the annotated code file in units of functions to generate a plurality of training code fragments;

S3, setting a first preset label for the training code snippet with a preset type annotation;

S4, setting a second preset label for the training code fragment that does not have a preset type annotation;

In specific implementation, since the annotated code file often includes multiple code fragments, the purpose of the present invention is to judge the importance of the code fragments. To do this, you can first divide the annotated code file in units of functions, divide the annotated code file into training code fragments for each function, and then label each training code fragment according to whether the training code fragment has or not. .

Specifically, a first preset label can be set for the training code snippet with a preset type annotation, which is used to identify that the code snippet is important, and a second preset label can be set for the training code snippet without the preset type annotation , which is used to identify that the snippet is unimportant. The preset type annotation may be a function header annotation or the like, the first preset tag may be 1, and the second preset tag may be 0, and the embodiment of the present invention does not limit the annotation type and tag form.

S5, extract the second feature vector of each described training code fragment respectively;

S6. Use a plurality of the second feature vectors to train a preset initial classification model to obtain a target classification model.

In this embodiment of the present invention, after the training code fragment is acquired, it is also necessary to extract a second feature vector of the training code fragment. The type of the second feature vector is the same as that of the first feature vector, that is, the second feature vector also includes a syntax feature vector. , text feature vector, structural feature vector and relation feature vector, and the extraction method is the same as that of the first feature vector.

After obtaining the second feature vector of each training code segment, a plurality of second feature vectors may be used to form a training set, and the training set is used to train a preset initial classification model to obtain a target classification model.

It is worth mentioning that the initial classification model may be a random forest model or other classification models, which is not limited in this embodiment of the present invention.

The specific training process can be as follows: the data set is randomly divided into 10 equal parts, 1 part is taken as the test set each time, and the other 9 parts are used as the training set. Use the training set to train the model, and use the test set to test the effect of the model. When the effect of the model on the test set is no longer improved for 20 consecutive iterations, record the number of iterations corresponding to the best effect. Repeat the above training process 10 times, so that each of the 10 equally divided data sets has been used as a test set to obtain 10 optimal number of iterations. Average these 10 iterations as the number of iterations when we finally train the model. Finally, we use the full amount of data to train a random forest model. When the number of model iterations reaches the preset value, the training is complete.

In this embodiment of the present invention, the first feature vector includes a grammatical feature vector, a text feature vector, a structural feature vector, and a relational feature vector, and the above step 102 may be replaced by the following steps 202-208:

Step 202, converting the code fragment to be annotated into an abstract syntax tree;

Abstract Syntax Tree (AST), or simply Syntax tree, is an abstract representation of the grammatical structure of source code. It represents the syntax structure of the programming language in the form of a tree, and each node on the tree represents a structure in the source code.

In the embodiment of the present invention, in order to enable the grammatical structure of the code segment to be annotated to be vividly embodied, the code segment to be annotated can be converted into an abstract syntax tree, so as to facilitate subsequent extraction of syntactic feature vectors.

Step 203, extracting the statement type information of the code fragment to be annotated from the abstract syntax tree;

Since the abstract syntax tree can reflect each syntax structure in the code fragment to be annotated, that is, can reflect the statement type information of the code fragment to be annotated, the statement type information of the code fragment to be annotated can be extracted from the abstract syntax tree, Including but not limited to IfStmt (if statement), ForStmt (for loop statement), WhileStmt (while loop statement) and so on.

Step 204, according to the statistical result of the statement type information, determine the grammatical feature vector corresponding to the code fragment to be annotated;

The embodiment of the present invention relates to the grammatical feature vector of the code segment to be annotated, and the purpose is to describe the grammatical information of the code segment in the code language. The grammatical feature vector corresponding to the code segment to be annotated can be determined by the statistical result of the statement type information.

Wherein, the grammatical feature vector may be: the frequency distribution feature of the frequency distribution of different sentence types, the sentence quantity feature of the number of different sentence types (that is, the same sentence type is deduplicated), the total sentence quantity feature of the total number of sentences, and the sentence-based The total sentence type characteristics obtained by type weighting.

In this embodiment of the present invention, the statement type information includes the occurrence frequency, quantity, and frequency distribution of multiple statement types, and step 204 may include the following sub-steps:

In the embodiment of the present invention, the frequency distribution characteristics of each sentence type can be determined by counting the frequency distribution of each sentence type; the number of sentences of each sentence type can be determined by separately counting the number of each sentence type; Count the total number of all sentences, and determine the characteristics of the total number of sentences; use the first preset word feature conversion model such as the Word2Vec model, etc., to convert each sentence type into the corresponding sentence type feature, and then use the frequency of occurrence of each sentence type. As the weight, the statement type features are weighted and summed to determine the total statement type feature; finally, the statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced to obtain the grammar representing the code fragment to be annotated. Feature vector.

It is worth mentioning that all the above statistical processes can be performed in parallel.

Word2vec, is a group of related models used to generate word features. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word texts. The network is represented by words and needs to guess the input words in adjacent positions. After training, the word2vec model can be used to map each word to a feature, which can be used to represent the relationship between words and words, which is the hidden layer of the neural network.

Step 205, extracting the target word from the code fragment to be annotated according to the preset variable word division rule;

The embodiment of the present invention also needs to acquire a plurality of text feature vectors about the code fragment, the purpose is to describe the distribution of the text in the code fragment, extract the meaning of the word and the context information of the word. Before obtaining the text feature vector, it is also necessary to preprocess the commented code snippet to obtain the target word.

In a specific implementation, the target word may be extracted from the to-be-annotated code segment according to a preset variable word division rule.

Optionally, the preset variable word division rules include hump rules or underline rules, and step 205 may include the following sub-steps:

Extract words from the to-be-annotated code snippet;

The stems in the words to be extracted are extracted to generate a target word.

In the specific implementation, because the code author likes to use the camel case naming style or the underscore naming style to combine multiple English words to name the variables when naming the code variables. Therefore, words can be extracted from the code fragment to be annotated first. Generally, words are separated by separators such as spaces, brackets, or semicolons, and then the words are divided by the camel case rule or the underline rule to determine the words to be processed. ; Then delete the preset stop words to get the words to be extracted. The preset stop words are function words that have no actual meaning, such as "the", "is", "at", "on", etc.; Stem words may appear in different forms. In order to reduce the number of words, the stem of the words to be extracted can also be extracted to obtain the target word.

Meanwhile, in order to facilitate subsequent operations, all target words may be uniformly processed into lowercase form, which is not limited in this embodiment of the present invention.

Step 206, according to the statistical result of the target word, determine the text feature vector corresponding to the code fragment to be annotated;

Further, the target word includes a plurality of words to be counted, and step 206 may include the following sub-steps:

In the embodiment of the present invention, the text feature vector includes: the total word quantity feature, the word type quantity feature (that is, the same word is deduplicated), the word variance feature, the non-word ratio feature, and the total word feature obtained based on word weighting, wherein, The three features of the total word quantity feature, the word type quantity feature, and the word variance feature measure the distribution of the target words in the code to be annotated; non-English words, that is, some words are "made up" by the code author and have no practical meaning. variable name. Non-word scale features measure word interpretability information; total word features include contextual information about words.

In the specific implementation, the total number of words to be counted is counted to determine the feature of the total number of words; the number of types of various words to be counted is counted separately to determine the feature of the number of word types; the frequency of occurrence of each word to be counted is calculated separately , determine the word variance feature; count the proportion of non-English words in the plurality of words to be counted to determine the non-word ratio feature; adopt the second preset word feature conversion model to convert the plurality of words to be counted into words respectively feature; take the frequency of occurrence of each described word to be counted as weight, carry out weighted summation to all described word features, and generate total word feature; splicing described total word quantity feature, described word type quantity feature, described word The variance feature, the non-word ratio feature, and the total word feature are used to generate a text feature vector corresponding to the code segment to be annotated.

The above statistical process may also be performed in parallel, and the second preset word feature conversion model may be a Word2Vec model, etc., which is not limited in this embodiment of the present invention.

Step 207, according to the complexity of the to-be-annotated code fragment, determine the structural feature vector corresponding to the to-be-annotated code fragment;

In an example of the present invention, the step 207 may include the following sub-steps:

The embodiment of the present invention also needs to determine the structural features of the code segment to be annotated, in order to determine the complexity of the code segment to be annotated. The structural features of function code fragments are some features that can describe the structure of function code fragments. These structural features are: the number of lines of code, the number of nested statements, the maximum level of nested statements, whether the function has formal parameters and the number of formal parameters, the number of words in the longest statement, the number of API calls, the number of variables, the number of identifiers and the number of internal annotations, etc. Among them, the so-called nested statement refers to a statement containing another statement. As shown in Figure 3, a for loop statement contains an example of an if conditional statement. The complexity of the code fragment is positively related to its importance.

It is worth mentioning that the statistical process of the above structural features can be carried out in parallel, and the above structural features do not necessarily need to be used. This embodiment of the invention does not limit this.

Step 208: Determine a relational feature vector corresponding to the to-be-annotated code segment based on the number of function calls of the to-be-annotated code segment.

In another example of the present invention, the number of function calls includes the number of additional functions called and the number of times they are called. Step 208 may include the following sub-steps:

Traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of the called extra functions and the called number of times of the to-be-annotated code fragment;

In this embodiment of the present invention, it is also necessary to analyze the interconnection between different function code fragments. At this time, a method similar to a social network or a directed graph can be used to define the number of calls in the current fragment to be annotated by the out-degree value. , which defines the number of calls to additional functions in terms of in-degree values. Then, scan and traverse the code file to be annotated to which the entire code fragment to be annotated belongs to determine the out-degree value and in-degree value of the code fragment to be annotated, that is, the number of extra functions to be called and the number of times to be called, and then the above two characteristics are performed. splicing to generate the relational feature vector corresponding to the code fragment to be annotated.

It is worth mentioning that steps 202-204 as a whole, steps 205-206 as a whole, and steps 207 and 208 can be executed in parallel.

Step 209, inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;

In a specific implementation, the step 209 may include the following sub-steps:

inputting the first feature vector into the target classification model;

In the embodiment of the present invention, after the first feature vector is obtained, the first feature vector may be input into the target classification model, and the target classification model comprehensively judges based on the first feature vector to obtain the model output. When the output of the target classification model is the first preset label, it is determined that the importance judgment result of the code segment to be annotated is important; if the output of the target classification model is the second preset label , and determine that the importance judgment result of the code segment to be annotated is not important.

Referring to Fig. 4, Fig. 4 shows a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention.

Collect Java project code files from software repositories; divide project code files by function; function code snippets with function header annotations are marked with 1, otherwise marked with 0; required features are extracted from function code snippets; required features include grammatical, textual, structural and relational features;

The syntax feature extraction process includes: preparing to extract syntax features; converting function code segments into abstract syntax trees; obtaining statement type information of function code segments from the abstract syntax tree; counting the frequency distribution of different statement types; counting the number of different statement types ; Count the total number of sentences; convert sentence types into features and weight them according to the frequency of occurrence; concatenate to obtain grammatical features.

The text feature extraction process includes: preparing to extract text features; extracting words in function code snippets; dividing variable words according to the camel case rule or underscore rule; uniformly processing words into lowercase; deleting stop words; stemming; counting words count the number of types of words used; calculate the variance of the frequency of occurrence of different words; count the proportion of non-English words; convert words into features and weighted sums according to their frequency of occurrence; splicing to obtain text features.

The structural feature extraction process includes: preparing to extract structural features; counting the number of lines of code in the function code fragment; counting the number of nested statements in the function code fragment; counting the maximum level of nested statements in the function code fragment; counting the formal parameters in the function code fragment count; count the number of words in the longest statement in the function snippet; count the number of API calls in the function snippet; count the number of variables in the function snippet; count the identifiers in the function snippet and count the number of identifiers in the function snippet Number of internal annotations; splicing to obtain structural features.

The relational feature extraction process includes: preparing to extract relational features; defining the concepts of out-degree and in-degree values; counting out-degree and in-degree values of each function; splicing to obtain relational features.

After executing the above four extraction processes in parallel, the final features of each function code fragment are obtained by splicing; the classification model is trained in combination with the labels; the target classification model is obtained, in which each training will output the result of whether the function code fragment is important;

When a new function code fragment is received, the final feature of the function code fragment is extracted; the final feature is input to the target classification model, and the output function code fragment is important.

Please refer to FIG. 5. FIG. 5 is a structural block diagram of an apparatus for judging the importance of a code segment according to an embodiment of the present invention.

A code fragment receiving module 501, configured to receive code fragments to be annotated;

The first feature vector extraction module 502 is used to extract the first feature vector of the code fragment to be annotated;

Importance output module 503, for inputting the first feature vector into the target classification model, and outputting the importance judgment result of the code fragment to be annotated;

Optionally, the classification model training module includes:

The annotated code file receiving submodule is used to obtain the annotated code file from the preset software repository;

a file division submodule, which is used to divide the annotated code file in units of functions to generate a plurality of training code fragments;

a first label setting submodule, used for setting a first preset label for the training code snippet with a preset type annotation;

A second label setting submodule, configured to set a second preset label for the training code fragment that does not have a preset type annotation;

The second feature vector extraction submodule is used to extract the second feature vector of each of the training code fragments respectively;

The classification model training sub-module is used for training a preset initial classification model by using a plurality of the second feature vectors to obtain a target classification model.

Optionally, the first feature vector includes a syntax feature vector, a text feature vector, a structural feature vector and a relational feature vector, and the first feature vector extraction module 502 includes:

A conversion submodule for converting the code fragment to be annotated into an abstract syntax tree;

A statement type information extraction submodule, used for extracting the statement type information of the code fragment to be annotated from the abstract syntax tree;

a grammatical feature vector determination submodule, configured to determine the grammatical feature vector corresponding to the code fragment to be annotated according to the statistical result of the statement type information;

A target word extraction submodule, used for extracting target words from the to-be-annotated code fragment according to a preset variable word division rule;

Text feature vector determination submodule, for determining the text feature vector corresponding to the code fragment to be annotated according to the statistical result of the target word;

a structural feature vector determination submodule, configured to determine the structural feature vector corresponding to the to-be-annotated code segment according to the complexity of the to-be-annotated code segment;

The relationship feature vector determination submodule is configured to determine the relationship feature vector corresponding to the to-be-annotated code segment based on the number of function calls of the to-be-annotated code segment.

Optionally, the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types, and the grammatical feature vector determination submodule includes:

a sentence frequency distribution feature determining unit, configured to count the frequency distributions of the multiple sentence types to determine the sentence frequency distribution features;

A statement quantity feature determining unit, configured to count the number of the multiple statement types and determine the statement quantity feature;

A total statement quantity feature determining unit, used to count the total number of statements corresponding to the multiple statement types, and determine the total statement number feature;

a statement type feature conversion unit, configured to convert the multiple statement types into statement type features respectively by adopting the first preset word feature conversion model;

a total sentence type feature determining unit, configured to use the frequency of occurrence as a weight to perform a weighted summation on a plurality of the sentence type features to determine the total sentence type feature;

A grammatical feature vector generating unit, configured to splicing the statement frequency distribution feature, the statement quantity feature, the total statement quantity feature and the total statement type feature to generate a grammatical feature vector corresponding to the code segment to be annotated.

Optionally, the preset variable word division rules include hump rules or underline rules, and the target word extraction submodule includes:

A word extraction unit for extracting words from the to-be-annotated code fragment;

a word determination unit to be processed, for using the hump rule or the underline rule to determine the word to be processed from the word;

a word to be extracted determination unit, used for deleting preset stop words from the to-be-processed word to obtain the to-be-extracted word;

The target word determination unit is used for extracting the stem in the words to be extracted to generate the target word.

Optionally, the target word includes a plurality of words to be counted, and the text feature vector determination submodule includes:

A total word quantity feature determination unit, used to count the total number of the multiple words to be counted, and determine the total word quantity feature;

A word type and quantity feature determining unit, used to count the types and quantities of the plurality of words to be counted, and determine the word type and quantity characteristics;

a word variance feature determining unit, used to calculate the variance of the frequency of occurrence of each word to be counted among the multiple words to be counted, to determine the word variance feature;

a non-word ratio feature determining unit, used to count the ratio of non-English words in the plurality of words to be counted, and determine the non-word ratio feature;

A word feature conversion unit, configured to convert the plurality of words to be counted into word features respectively by adopting a second preset word feature conversion model;

The total word feature generating unit is used for taking the frequency of occurrence of each described word to be counted as a weight, and performing weighted summation on all the described word features to generate a total word feature;

The text feature vector determination unit is used for splicing the total word quantity feature, the word type quantity feature, the word variance feature, the non-word ratio feature and the total word feature, and generating the to-be-annotated code fragments corresponding to The text feature vector of .

Optionally, the structural feature vector determination submodule includes:

a line number feature determination unit, used to count the number of lines of code in the to-be-annotated code fragment to determine the line number feature;

a unit for determining the quantity of nested statements, which is used to count the number of nested statements in the code fragment to be annotated, and to determine the number of nested statements;

The maximum nesting level feature determination unit is used to count the maximum nesting level in the code fragment to be annotated, and determine the maximum nesting level feature;

a shape parameter quantity feature determination unit, used to count the number of formal parameters in the to-be-annotated code fragment, and determine the shape parameter quantity feature;

The comprehensive feature determination unit is used to count the word quantity feature of the longest statement in the code fragment to be annotated, the API call quantity feature of the code fragment to be annotated, the variable quantity feature of the code fragment to be annotated, the The identifier quantity feature of the annotated code fragment and the internal annotation quantity feature of the to-be-annotated code fragment are sequentially spliced to generate comprehensive features;

Structural feature vector generation unit, used for splicing the line number feature, the nested statement quantity feature, the nested maximum level feature, the shape parameter quantity feature and the comprehensive feature to generate the to-be-annotated code Structural feature vector corresponding to the fragment.

Optionally, the number of function calls includes the number of additional functions called and the number of times the function is called, and the feature vector determination submodule includes:

a function invocation number determination unit, configured to traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of the additional functions called and the called times of the to-be-annotated code fragment;

A relational feature vector generating unit, configured to concatenate the number of called extra functions and the called times to generate a relational feature vector corresponding to the code segment to be annotated.

Optionally, the importance output module 503 includes:

a feature vector input submodule, for inputting the first feature vector into the target classification model;

an importance determination submodule, configured to determine the importance judgment result of the code fragment to be annotated as important when the output of the target classification model is the first preset label;

The importance negation sub-module is configured to determine that the importance judgment result of the code segment to be annotated is not important when the output of the target classification model is the second preset label.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided by the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

A method for judging the importance of code snippets, comprising:

Receive code snippets to be annotated;

extracting the first feature vector of the code fragment to be annotated;

Inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;

Wherein, the target classification model is generated through a preset classification model training process.
The method according to claim 1, wherein the classification model training process comprises:

Get annotated code files from pre-built software repositories;

Divide the annotated code file in units of functions to generate multiple training code fragments;

setting a first preset label for the training code snippet with a preset type annotation;

Setting a second preset label for the training code snippet without a preset type annotation;

extracting the second feature vector of each of the training code fragments respectively;

Using a plurality of the second feature vectors to train a preset initial classification model to obtain a target classification model.
The method for judging the importance of a code segment according to claim 1, wherein the first feature vector comprises a syntax feature vector, a text feature vector, a structural feature vector and a relational feature vector, and the extracting the code to be annotated Steps of segmenting the first feature vector, including:

converting the code fragment to be annotated into an abstract syntax tree;

Extract the statement type information of the code fragment to be annotated from the abstract syntax tree;

According to the statistical result of the statement type information, determine the grammatical feature vector corresponding to the code fragment to be annotated;

Extract the target word from the to-be-annotated code fragment according to the preset variable word division rule;

According to the statistical result of the target word, determine the text feature vector corresponding to the code fragment to be annotated;

According to the complexity of the to-be-annotated code fragment, determine the structural feature vector corresponding to the to-be-annotated code fragment;

Based on the number of function calls of the to-be-annotated code segment, a relational feature vector corresponding to the to-be-annotated code segment is determined.
The method for judging the importance of a code segment according to claim 3, wherein the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types, and the statement type information is based on statistics of the statement type information. As a result, the step of determining the grammatical feature vector corresponding to the code fragment to be annotated includes:

Counting the frequency distribution of the various sentence types, and determining the frequency distribution characteristics of the sentence;

Counting the number of the various types of statements, and determining the characteristics of the number of statements;

Count the total number of statements corresponding to the multiple statement types, and determine the characteristics of the total number of statements;

Using the first preset word feature conversion model to convert the multiple statement types into statement type features respectively;

Using the frequency of occurrence as a weight, weighted summation is performed on a plurality of the statement type features to determine the total statement type features;

The statement frequency distribution feature, the statement quantity feature, the total statement number feature, and the total statement type feature are spliced together to generate a grammatical feature vector corresponding to the code segment to be annotated.
The method for judging the importance of a code segment according to claim 3, wherein the preset variable word division rule includes a camel case rule or an underline rule, and the preset variable word division rule from the code segment to be annotated The steps of extracting target words in , including:

Extract words from the to-be-annotated code snippet;

Determine the word to be processed from the word using the camel case rule or the underline rule;

Delete preset stop words from the words to be processed to obtain words to be extracted;

The stems in the words to be extracted are extracted to generate a target word.
The method for judging the importance of a code segment according to claim 3 or 5, wherein the target word includes a plurality of words to be counted, and the code to be annotated is determined according to a statistical result of the target word The steps of the text feature vector corresponding to the fragment include:

Count the total number of the plurality of words to be counted, and determine the total word number feature;

Count the number of types of the plurality of words to be counted, and determine the characteristics of the number of word types;

Calculate the variance of the frequency of occurrence of each of the words to be counted in the plurality of words to be counted, and determine the word variance feature;

Count the proportion of non-English words in the plurality of words to be counted, and determine the non-word proportion feature;

Using the second preset word feature conversion model to convert the plurality of words to be counted into word features respectively;

Taking the frequency of occurrence of each described word to be counted as a weight, weighted summation is performed on all the described word features to generate total word features;

The total word quantity feature, the word type quantity feature, the word variance feature, the non-word ratio feature, and the total word feature are spliced together to generate a text feature vector corresponding to the code segment to be annotated.
The method for judging the importance of a code segment according to claim 3, wherein the step of determining the structural feature vector corresponding to the code segment to be annotated according to the complexity of the code segment to be annotated comprises:

Count the number of lines of code in the to-be-annotated code snippet, and determine the feature of the number of lines;

Count the number of nested statements in the to-be-annotated code fragment, and determine the feature of the number of nested statements;

Count the maximum number of nesting levels in the code fragment to be annotated, and determine the feature of the maximum number of nested levels;

Count the number of formal parameters in the to-be-annotated code fragment, and determine the quantity characteristics of the formal parameters;

Respectively count the feature of the number of words of the longest statement in the code snippet to be annotated, the feature of the number of API calls of the code snippet to be annotated, the feature of the number of variables of the code snippet to be annotated, and the number of identifiers of the code snippet to be annotated The feature and the internal annotation quantity feature of the code fragment to be annotated are sequentially spliced to generate a comprehensive feature;

The line number feature, the nested statement number feature, the maximum nested level feature, the shape parameter quantity feature, and the comprehensive feature are spliced together to generate a structural feature vector corresponding to the code segment to be annotated.
The method for judging the importance of a code segment according to claim 3, wherein the number of function calls includes the number of additional functions called and the number of times the function is called, and the number of function calls based on the code segment to be annotated is determined. Describe the steps of the relational feature vector corresponding to the code fragment to be annotated, including:

Traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of the called extra functions and the called number of times of the to-be-annotated code fragment;

The number of called extra functions and the called times are spliced together to generate a relational feature vector corresponding to the code segment to be annotated.
The method for judging the importance of a code segment according to claim 2, wherein the inputting the first feature vector into the target classification model, and outputting a result of judging the importance of the code segment to be annotated. steps, including:

inputting the first feature vector into the target classification model;

When the output of the target classification model is the first preset label, determine that the importance judgment result of the code segment to be annotated is important;

When the output of the target classification model is the second preset label, it is determined that the importance judgment result of the code segment to be annotated is not important.
A device for judging the importance of a code segment, comprising:

The code fragment receiving module is used to receive the code fragment to be annotated;

A first feature vector extraction module, for extracting the first feature vector of the code fragment to be annotated;

an importance output module, for inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;

Wherein, the target classification model is generated by a preset classification model training module.