WO2022121146A1 - Procédé et appareil de détermination de l'importance d'un segment de code - Google Patents

Procédé et appareil de détermination de l'importance d'un segment de code Download PDF

Info

Publication number
WO2022121146A1
WO2022121146A1 PCT/CN2021/081731 CN2021081731W WO2022121146A1 WO 2022121146 A1 WO2022121146 A1 WO 2022121146A1 CN 2021081731 W CN2021081731 W CN 2021081731W WO 2022121146 A1 WO2022121146 A1 WO 2022121146A1
Authority
WO
WIPO (PCT)
Prior art keywords
annotated
code
feature
word
feature vector
Prior art date
Application number
PCT/CN2021/081731
Other languages
English (en)
Chinese (zh)
Inventor
舒俊淮
陈湘萍
金舒原
郑子彬
Original Assignee
中山大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中山大学 filed Critical 中山大学
Publication of WO2022121146A1 publication Critical patent/WO2022121146A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Definitions

  • the invention relates to the field of computer technology, and in particular, to a method and device for judging the importance of code segments.
  • the research direction of intelligent software engineering includes software warehouse mining, program code understanding, automatic code generation, automatic annotation generation, etc.
  • the purpose is to help software developers improve the efficiency of development and maintenance.
  • researchers related to intelligent software engineering problems have begun to explore the possibility of using these advanced technologies to solve related research problems, and obtained many encouraging results. For example, based on information retrieval and recommendation system technology to help developers improve the utilization of open source software warehouses; machine learning-based methods identify important words in the code to help other tasks understand the program correctly; code generation based on convolutional neural networks technology; etc.
  • code comments can help us understand the intentions and ideas of code authors, and play an important role in software maintenance, code reuse, and team collaborative development.
  • the research of this technology aims to automatically generate comments for a given code fragment by a machine, so as to reduce the time that software developers spend on writing code comments and improve development efficiency.
  • machine learning and deep learning techniques the researchers turned the problem into a "translation task" in natural language processing to solve.
  • sequence-to-sequence models in natural language processing (i.e., input a sequence of text, the model outputs a sequence of text), the researchers "translate" the code language into natural language, and treat the resulting natural language as annotations for the corresponding code snippets .
  • the model is simply converted into feature vectors to train the model.
  • This practice is fairly common in natural language processing. But doing so is equivalent to treating the text of the code as a natural language, and only uses the textual information of the code, so the effect of this last method is not so ideal.
  • this method does not make sufficient use of text information. It simply converts words into features without considering the distribution of text, which may lead to lower accuracy in determining the location of code comments, which cannot be reasonable for developers. , which in turn reduces the productivity of software developers and maintainers.
  • the present invention provides a method and device for judging the importance of code fragments, which solves the problem that the existing technology for predicting the position of code comments is only limited to treating the code text as unstructured plain text, and the feature utilization rate of multiple dimensions is relatively low.
  • the low accuracy of determining the location of code comments leads to a low technical problem that it is impossible to give developers reasonable suggestions, thereby reducing the work efficiency of software development and maintenance personnel.
  • a method for judging the importance of a code fragment provided by the present invention includes:
  • the target classification model is generated through a preset classification model training process.
  • the classification model training process includes:
  • the first feature vector includes a grammatical feature vector, a text feature vector, a structural feature vector and a relational feature vector
  • the step of extracting the first feature vector of the code fragment to be annotated includes:
  • a relational feature vector corresponding to the to-be-annotated code segment is determined.
  • the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types, and the grammatical feature vector corresponding to the code fragment to be annotated is determined according to the statistical result of the statement type information. steps, including:
  • weighted summation is performed on a plurality of the statement type features to determine the total statement type features
  • the statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced together to generate a grammatical feature vector corresponding to the code segment to be annotated.
  • the preset variable word division rule includes a hump rule or an underline rule
  • the step of extracting the target word from the to-be-annotated code fragment according to the preset variable word division rule includes:
  • the stems in the words to be extracted are extracted to generate a target word.
  • the target word includes a plurality of words to be counted
  • the step of determining the text feature vector corresponding to the code fragment to be annotated according to the statistical result of the target word includes:
  • weighted summation is performed on all the described word features to generate total word features
  • the total word quantity feature, the word type quantity feature, the word variance feature, the non-word proportion feature, and the total word feature are spliced together to generate a text feature vector corresponding to the code segment to be annotated.
  • the step of determining the structural feature vector corresponding to the to-be-annotated code segment according to the complexity of the to-be-annotated code segment includes:
  • the line number feature, the nested statement number feature, the maximum nesting level feature, the shape parameter quantity feature, and the comprehensive feature are spliced together to generate a structural feature vector corresponding to the code segment to be annotated.
  • the number of function calls includes the number of additional functions called and the number of times the function is called
  • the step of determining the relationship feature vector corresponding to the code fragment to be annotated based on the number of function calls of the code fragment to be annotated includes:
  • the number of called extra functions and the called times are spliced together to generate a relational feature vector corresponding to the code segment to be annotated.
  • the step of inputting the first feature vector into the target classification model and outputting the result of judging the importance of the to-be-annotated code segment includes:
  • the present invention also provides a device for judging the importance of a code segment, including:
  • the code fragment receiving module is used to receive the code fragment to be annotated
  • a first feature vector extraction module for extracting the first feature vector of the code fragment to be annotated
  • an importance output module for inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated
  • the target classification model is generated by a preset classification model training module.
  • the present invention has the following advantages:
  • the present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment
  • the importance judgment result of the code snippet Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions.
  • Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.
  • FIG. 1 is a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention
  • FIG. 2 is a flowchart of steps of a method for judging the importance of a code segment provided by an optional embodiment of the present invention
  • FIG. 3 is an example diagram of a nested statement in an embodiment of the present invention.
  • FIG. 4 is a flowchart of steps of a method for judging the importance of a code segment provided by another embodiment of the present invention.
  • FIG. 5 is a structural block diagram of an apparatus for judging the importance of a code segment according to an embodiment of the present invention.
  • the embodiments of the present invention provide a method and device for judging the importance of code fragments, which are used to solve the problem that the existing technology for predicting the position of code comments is only limited to treating code text as unstructured plain text, and the features of multiple dimensions are
  • the low utilization rate leads to a low accuracy in determining the location of code comments, and it is impossible to give developers reasonable suggestions, thereby reducing the technical problem of software development and maintenance personnel's work efficiency.
  • FIG. 1 is a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention.
  • a method for judging the importance of a code fragment provided by the present invention includes:
  • Step 101 receiving the code fragment to be annotated
  • the user input can be received first.
  • Annotate code snippets for importance judgment process in order to more accurately determine the importance of the code fragment and to better support downstream tasks such as judging the position of the code comment, before the user needs to annotate the code fragment, the user input can be received first. Annotate code snippets for importance judgment process.
  • the code fragment to be annotated may be a Java code fragment, etc., which is not limited in the embodiment of the present invention.
  • Step 102 extracting the first feature vector of the code fragment to be annotated
  • the first feature vector from the to-be-annotated code fragment, such as grammatical features, text features, structural features, and relational features, etc., as the input of the subsequent model, and perform the to-be-annotated based on the above features.
  • the process of judging the importance of code snippets After receiving the to-be-annotated code fragment, extract the first feature vector from the to-be-annotated code fragment, such as grammatical features, text features, structural features, and relational features, etc.
  • Step 103 inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
  • the target classification model is generated through a preset classification model training process, and after the target classification model is obtained, the first feature vector is input into the target classification model to perform the process of judging the importance of the code fragment to be annotated, thereby Determine whether the code fragment is important as a result of the importance judgment.
  • the present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment
  • the importance judgment result of the code snippet Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions.
  • Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.
  • FIG. 2 is a flowchart of steps of a method for judging the importance of a code segment provided by an optional embodiment of the present invention.
  • a method for judging the importance of a code fragment provided by the present invention includes:
  • Step 201 receiving the code fragment to be annotated
  • a target classification model can be generated through a classification model training process in advance, and the classification model training process includes the following steps S1-S6:
  • the purpose of the present invention is to judge the importance of the code fragments. To do this, you can first divide the annotated code file in units of functions, divide the annotated code file into training code fragments for each function, and then label each training code fragment according to whether the training code fragment has or not. .
  • a first preset label can be set for the training code snippet with a preset type annotation, which is used to identify that the code snippet is important
  • a second preset label can be set for the training code snippet without the preset type annotation , which is used to identify that the snippet is unimportant.
  • the preset type annotation may be a function header annotation or the like
  • the first preset tag may be 1
  • the second preset tag may be 0, and the embodiment of the present invention does not limit the annotation type and tag form.
  • the training code fragment after the training code fragment is acquired, it is also necessary to extract a second feature vector of the training code fragment.
  • the type of the second feature vector is the same as that of the first feature vector, that is, the second feature vector also includes a syntax feature vector. , text feature vector, structural feature vector and relation feature vector, and the extraction method is the same as that of the first feature vector.
  • a plurality of second feature vectors may be used to form a training set, and the training set is used to train a preset initial classification model to obtain a target classification model.
  • the initial classification model may be a random forest model or other classification models, which is not limited in this embodiment of the present invention.
  • the specific training process can be as follows: the data set is randomly divided into 10 equal parts, 1 part is taken as the test set each time, and the other 9 parts are used as the training set. Use the training set to train the model, and use the test set to test the effect of the model. When the effect of the model on the test set is no longer improved for 20 consecutive iterations, record the number of iterations corresponding to the best effect. Repeat the above training process 10 times, so that each of the 10 equally divided data sets has been used as a test set to obtain 10 optimal number of iterations. Average these 10 iterations as the number of iterations when we finally train the model. Finally, we use the full amount of data to train a random forest model. When the number of model iterations reaches the preset value, the training is complete.
  • the first feature vector includes a grammatical feature vector, a text feature vector, a structural feature vector, and a relational feature vector, and the above step 102 may be replaced by the following steps 202-208:
  • Step 202 converting the code fragment to be annotated into an abstract syntax tree
  • Abstract Syntax Tree is an abstract representation of the grammatical structure of source code. It represents the syntax structure of the programming language in the form of a tree, and each node on the tree represents a structure in the source code.
  • the code segment to be annotated in order to enable the grammatical structure of the code segment to be annotated to be vividly embodied, the code segment to be annotated can be converted into an abstract syntax tree, so as to facilitate subsequent extraction of syntactic feature vectors.
  • Step 203 extracting the statement type information of the code fragment to be annotated from the abstract syntax tree
  • the abstract syntax tree can reflect each syntax structure in the code fragment to be annotated, that is, can reflect the statement type information of the code fragment to be annotated
  • the statement type information of the code fragment to be annotated can be extracted from the abstract syntax tree, Including but not limited to IfStmt (if statement), ForStmt (for loop statement), WhileStmt (while loop statement) and so on.
  • Step 204 determine the grammatical feature vector corresponding to the code fragment to be annotated
  • the embodiment of the present invention relates to the grammatical feature vector of the code segment to be annotated, and the purpose is to describe the grammatical information of the code segment in the code language.
  • the grammatical feature vector corresponding to the code segment to be annotated can be determined by the statistical result of the statement type information.
  • the grammatical feature vector may be: the frequency distribution feature of the frequency distribution of different sentence types, the sentence quantity feature of the number of different sentence types (that is, the same sentence type is deduplicated), the total sentence quantity feature of the total number of sentences, and the sentence-based The total sentence type characteristics obtained by type weighting.
  • the statement type information includes the occurrence frequency, quantity, and frequency distribution of multiple statement types, and step 204 may include the following sub-steps:
  • weighted summation is performed on a plurality of the statement type features to determine the total statement type features
  • the statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced together to generate a grammatical feature vector corresponding to the code segment to be annotated.
  • the frequency distribution characteristics of each sentence type can be determined by counting the frequency distribution of each sentence type; the number of sentences of each sentence type can be determined by separately counting the number of each sentence type; Count the total number of all sentences, and determine the characteristics of the total number of sentences; use the first preset word feature conversion model such as the Word2Vec model, etc., to convert each sentence type into the corresponding sentence type feature, and then use the frequency of occurrence of each sentence type.
  • the statement type features are weighted and summed to determine the total statement type feature; finally, the statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced to obtain the grammar representing the code fragment to be annotated. Feature vector.
  • Word2vec is a group of related models used to generate word features. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word texts. The network is represented by words and needs to guess the input words in adjacent positions. After training, the word2vec model can be used to map each word to a feature, which can be used to represent the relationship between words and words, which is the hidden layer of the neural network.
  • Step 205 extracting the target word from the code fragment to be annotated according to the preset variable word division rule
  • the embodiment of the present invention also needs to acquire a plurality of text feature vectors about the code fragment, the purpose is to describe the distribution of the text in the code fragment, extract the meaning of the word and the context information of the word. Before obtaining the text feature vector, it is also necessary to preprocess the commented code snippet to obtain the target word.
  • the target word may be extracted from the to-be-annotated code segment according to a preset variable word division rule.
  • the preset variable word division rules include hump rules or underline rules
  • step 205 may include the following sub-steps:
  • the stems in the words to be extracted are extracted to generate a target word.
  • words can be extracted from the code fragment to be annotated first.
  • words are separated by separators such as spaces, brackets, or semicolons, and then the words are divided by the camel case rule or the underline rule to determine the words to be processed. ; Then delete the preset stop words to get the words to be extracted.
  • the preset stop words are function words that have no actual meaning, such as "the”, “is”, "at”, "on”, etc.; Stem words may appear in different forms. In order to reduce the number of words, the stem of the words to be extracted can also be extracted to obtain the target word.
  • all target words may be uniformly processed into lowercase form, which is not limited in this embodiment of the present invention.
  • Step 206 according to the statistical result of the target word, determine the text feature vector corresponding to the code fragment to be annotated
  • step 206 may include the following sub-steps:
  • weighted summation is performed on all the described word features to generate total word features
  • the total word quantity feature, the word type quantity feature, the word variance feature, the non-word proportion feature, and the total word feature are spliced together to generate a text feature vector corresponding to the code segment to be annotated.
  • the text feature vector includes: the total word quantity feature, the word type quantity feature (that is, the same word is deduplicated), the word variance feature, the non-word ratio feature, and the total word feature obtained based on word weighting, wherein,
  • the three features of the total word quantity feature, the word type quantity feature, and the word variance feature measure the distribution of the target words in the code to be annotated; non-English words, that is, some words are "made up" by the code author and have no practical meaning. variable name.
  • Non-word scale features measure word interpretability information; total word features include contextual information about words.
  • the total number of words to be counted is counted to determine the feature of the total number of words; the number of types of various words to be counted is counted separately to determine the feature of the number of word types; the frequency of occurrence of each word to be counted is calculated separately , determine the word variance feature; count the proportion of non-English words in the plurality of words to be counted to determine the non-word ratio feature; adopt the second preset word feature conversion model to convert the plurality of words to be counted into words respectively feature; take the frequency of occurrence of each described word to be counted as weight, carry out weighted summation to all described word features, and generate total word feature; splicing described total word quantity feature, described word type quantity feature, described word
  • the variance feature, the non-word ratio feature, and the total word feature are used to generate a text feature vector corresponding to the code segment to be annotated.
  • the second preset word feature conversion model may be a Word2Vec model, etc., which is not limited in this embodiment of the present invention.
  • Step 207 according to the complexity of the to-be-annotated code fragment, determine the structural feature vector corresponding to the to-be-annotated code fragment;
  • the step 207 may include the following sub-steps:
  • the line number feature, the nested statement number feature, the maximum nesting level feature, the shape parameter quantity feature, and the comprehensive feature are spliced together to generate a structural feature vector corresponding to the code segment to be annotated.
  • the embodiment of the present invention also needs to determine the structural features of the code segment to be annotated, in order to determine the complexity of the code segment to be annotated.
  • the structural features of function code fragments are some features that can describe the structure of function code fragments. These structural features are: the number of lines of code, the number of nested statements, the maximum level of nested statements, whether the function has formal parameters and the number of formal parameters, the number of words in the longest statement, the number of API calls, the number of variables, the number of identifiers and the number of internal annotations, etc.
  • the so-called nested statement refers to a statement containing another statement. As shown in Figure 3, a for loop statement contains an example of an if conditional statement.
  • the complexity of the code fragment is positively related to its importance.
  • Step 208 Determine a relational feature vector corresponding to the to-be-annotated code segment based on the number of function calls of the to-be-annotated code segment.
  • the number of function calls includes the number of additional functions called and the number of times they are called.
  • Step 208 may include the following sub-steps:
  • the number of called extra functions and the called times are spliced together to generate a relational feature vector corresponding to the code segment to be annotated.
  • a method similar to a social network or a directed graph can be used to define the number of calls in the current fragment to be annotated by the out-degree value. , which defines the number of calls to additional functions in terms of in-degree values. Then, scan and traverse the code file to be annotated to which the entire code fragment to be annotated belongs to determine the out-degree value and in-degree value of the code fragment to be annotated, that is, the number of extra functions to be called and the number of times to be called, and then the above two characteristics are performed. splicing to generate the relational feature vector corresponding to the code fragment to be annotated.
  • steps 202-204 as a whole
  • steps 205-206 as a whole
  • steps 207 and 208 can be executed in parallel.
  • Step 209 inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
  • the step 209 may include the following sub-steps:
  • the first feature vector may be input into the target classification model, and the target classification model comprehensively judges based on the first feature vector to obtain the model output.
  • the output of the target classification model is the first preset label, it is determined that the importance judgment result of the code segment to be annotated is important; if the output of the target classification model is the second preset label , and determine that the importance judgment result of the code segment to be annotated is not important.
  • the present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment
  • the importance judgment result of the code snippet Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions.
  • Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.
  • Fig. 4 shows a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention.
  • the syntax feature extraction process includes: preparing to extract syntax features; converting function code segments into abstract syntax trees; obtaining statement type information of function code segments from the abstract syntax tree; counting the frequency distribution of different statement types; counting the number of different statement types ; Count the total number of sentences; convert sentence types into features and weight them according to the frequency of occurrence; concatenate to obtain grammatical features.
  • the text feature extraction process includes: preparing to extract text features; extracting words in function code snippets; dividing variable words according to the camel case rule or underscore rule; uniformly processing words into lowercase; deleting stop words; stemming; counting words count the number of types of words used; calculate the variance of the frequency of occurrence of different words; count the proportion of non-English words; convert words into features and weighted sums according to their frequency of occurrence; splicing to obtain text features.
  • the structural feature extraction process includes: preparing to extract structural features; counting the number of lines of code in the function code fragment; counting the number of nested statements in the function code fragment; counting the maximum level of nested statements in the function code fragment; counting the formal parameters in the function code fragment count; count the number of words in the longest statement in the function snippet; count the number of API calls in the function snippet; count the number of variables in the function snippet; count the identifiers in the function snippet and count the number of identifiers in the function snippet Number of internal annotations; splicing to obtain structural features.
  • the relational feature extraction process includes: preparing to extract relational features; defining the concepts of out-degree and in-degree values; counting out-degree and in-degree values of each function; splicing to obtain relational features.
  • the final features of each function code fragment are obtained by splicing; the classification model is trained in combination with the labels; the target classification model is obtained, in which each training will output the result of whether the function code fragment is important;
  • the final feature of the function code fragment is extracted; the final feature is input to the target classification model, and the output function code fragment is important.
  • FIG. 5 is a structural block diagram of an apparatus for judging the importance of a code segment according to an embodiment of the present invention.
  • the present invention also provides a device for judging the importance of a code segment, including:
  • a code fragment receiving module 501 configured to receive code fragments to be annotated
  • the first feature vector extraction module 502 is used to extract the first feature vector of the code fragment to be annotated
  • the target classification model is generated through a preset classification model training process.
  • the classification model training module includes:
  • the annotated code file receiving submodule is used to obtain the annotated code file from the preset software repository;
  • a file division submodule which is used to divide the annotated code file in units of functions to generate a plurality of training code fragments
  • a first label setting submodule used for setting a first preset label for the training code snippet with a preset type annotation
  • a second label setting submodule configured to set a second preset label for the training code fragment that does not have a preset type annotation
  • the second feature vector extraction submodule is used to extract the second feature vector of each of the training code fragments respectively;
  • the classification model training sub-module is used for training a preset initial classification model by using a plurality of the second feature vectors to obtain a target classification model.
  • the first feature vector includes a syntax feature vector, a text feature vector, a structural feature vector and a relational feature vector
  • the first feature vector extraction module 502 includes:
  • a statement type information extraction submodule used for extracting the statement type information of the code fragment to be annotated from the abstract syntax tree
  • a grammatical feature vector determination submodule configured to determine the grammatical feature vector corresponding to the code fragment to be annotated according to the statistical result of the statement type information
  • a target word extraction submodule used for extracting target words from the to-be-annotated code fragment according to a preset variable word division rule
  • Text feature vector determination submodule for determining the text feature vector corresponding to the code fragment to be annotated according to the statistical result of the target word
  • a structural feature vector determination submodule configured to determine the structural feature vector corresponding to the to-be-annotated code segment according to the complexity of the to-be-annotated code segment;
  • the relationship feature vector determination submodule is configured to determine the relationship feature vector corresponding to the to-be-annotated code segment based on the number of function calls of the to-be-annotated code segment.
  • the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types
  • the grammatical feature vector determination submodule includes:
  • a sentence frequency distribution feature determining unit configured to count the frequency distributions of the multiple sentence types to determine the sentence frequency distribution features
  • a statement quantity feature determining unit configured to count the number of the multiple statement types and determine the statement quantity feature
  • a total statement quantity feature determining unit used to count the total number of statements corresponding to the multiple statement types, and determine the total statement number feature
  • a statement type feature conversion unit configured to convert the multiple statement types into statement type features respectively by adopting the first preset word feature conversion model
  • a total sentence type feature determining unit configured to use the frequency of occurrence as a weight to perform a weighted summation on a plurality of the sentence type features to determine the total sentence type feature
  • a grammatical feature vector generating unit configured to splicing the statement frequency distribution feature, the statement quantity feature, the total statement quantity feature and the total statement type feature to generate a grammatical feature vector corresponding to the code segment to be annotated.
  • the preset variable word division rules include hump rules or underline rules
  • the target word extraction submodule includes:
  • a word extraction unit for extracting words from the to-be-annotated code fragment
  • a word determination unit to be processed, for using the hump rule or the underline rule to determine the word to be processed from the word;
  • a word to be extracted determination unit used for deleting preset stop words from the to-be-processed word to obtain the to-be-extracted word
  • the target word determination unit is used for extracting the stem in the words to be extracted to generate the target word.
  • the target word includes a plurality of words to be counted
  • the text feature vector determination submodule includes:
  • a total word quantity feature determination unit used to count the total number of the multiple words to be counted, and determine the total word quantity feature
  • a word type and quantity feature determining unit used to count the types and quantities of the plurality of words to be counted, and determine the word type and quantity characteristics
  • a word variance feature determining unit used to calculate the variance of the frequency of occurrence of each word to be counted among the multiple words to be counted, to determine the word variance feature
  • a non-word ratio feature determining unit used to count the ratio of non-English words in the plurality of words to be counted, and determine the non-word ratio feature
  • a word feature conversion unit configured to convert the plurality of words to be counted into word features respectively by adopting a second preset word feature conversion model
  • the total word feature generating unit is used for taking the frequency of occurrence of each described word to be counted as a weight, and performing weighted summation on all the described word features to generate a total word feature;
  • the text feature vector determination unit is used for splicing the total word quantity feature, the word type quantity feature, the word variance feature, the non-word ratio feature and the total word feature, and generating the to-be-annotated code fragments corresponding to The text feature vector of .
  • the structural feature vector determination submodule includes:
  • a line number feature determination unit used to count the number of lines of code in the to-be-annotated code fragment to determine the line number feature
  • a unit for determining the quantity of nested statements which is used to count the number of nested statements in the code fragment to be annotated, and to determine the number of nested statements;
  • the maximum nesting level feature determination unit is used to count the maximum nesting level in the code fragment to be annotated, and determine the maximum nesting level feature
  • a shape parameter quantity feature determination unit used to count the number of formal parameters in the to-be-annotated code fragment, and determine the shape parameter quantity feature
  • the comprehensive feature determination unit is used to count the word quantity feature of the longest statement in the code fragment to be annotated, the API call quantity feature of the code fragment to be annotated, the variable quantity feature of the code fragment to be annotated, the The identifier quantity feature of the annotated code fragment and the internal annotation quantity feature of the to-be-annotated code fragment are sequentially spliced to generate comprehensive features;
  • Structural feature vector generation unit used for splicing the line number feature, the nested statement quantity feature, the nested maximum level feature, the shape parameter quantity feature and the comprehensive feature to generate the to-be-annotated code Structural feature vector corresponding to the fragment.
  • the number of function calls includes the number of additional functions called and the number of times the function is called
  • the feature vector determination submodule includes:
  • a function invocation number determination unit configured to traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of the additional functions called and the called times of the to-be-annotated code fragment;
  • a relational feature vector generating unit configured to concatenate the number of called extra functions and the called times to generate a relational feature vector corresponding to the code segment to be annotated.
  • the importance output module 503 includes:
  • a feature vector input submodule for inputting the first feature vector into the target classification model
  • an importance determination submodule configured to determine the importance judgment result of the code fragment to be annotated as important when the output of the target classification model is the first preset label
  • the importance negation sub-module is configured to determine that the importance judgment result of the code segment to be annotated is not important when the output of the target classification model is the second preset label.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un appareil de détermination de l'importance d'un segment de code. Le procédé de détermination de l'importance d'un segment de code comprend les étapes consistant à : générer un modèle de classification cible à l'aide d'un processus d'entraînement de modèle de classification prédéfini ; lorsqu'un segment de code à annoter est reçu (101), extraire un premier vecteur de caractéristique dudit segment de code (102) ; et entrer le premier vecteur de caractéristique dans le modèle de classification cible, et délivrer en sortie un résultat de détermination d'importance dudit segment de code (103). Au moyen du procédé, l'importance d'un code à annoter peut être déterminée, ce qui facilite la normalisation de comportement d'annotation d'un personnel de développement et de maintenance de logiciel, et le maintien d'une quantité d'annotation de code dans une plage appropriée.
PCT/CN2021/081731 2020-12-07 2021-03-19 Procédé et appareil de détermination de l'importance d'un segment de code WO2022121146A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011418126.2A CN112417852B (zh) 2020-12-07 2020-12-07 一种代码片段重要性的判断方法和装置
CN202011418126.2 2020-12-07

Publications (1)

Publication Number Publication Date
WO2022121146A1 true WO2022121146A1 (fr) 2022-06-16

Family

ID=74775399

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/081731 WO2022121146A1 (fr) 2020-12-07 2021-03-19 Procédé et appareil de détermination de l'importance d'un segment de code

Country Status (2)

Country Link
CN (1) CN112417852B (fr)
WO (1) WO2022121146A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116302043A (zh) * 2023-05-25 2023-06-23 深圳市明源云科技有限公司 代码维护问题检测方法、装置、电子设备及可读存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417852B (zh) * 2020-12-07 2022-01-25 中山大学 一种代码片段重要性的判断方法和装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021410A (zh) * 2016-05-12 2016-10-12 中国科学院软件研究所 一种基于机器学习的源代码注释质量评估方法
CN107943514A (zh) * 2017-11-01 2018-04-20 北京大学 一种软件文档中核心代码元素的挖掘方法及系统
CN108170468A (zh) * 2017-12-28 2018-06-15 中山大学 一种自动检测注释和代码一致性的方法及其系统
CN108734215A (zh) * 2018-05-21 2018-11-02 上海戎磐网络科技有限公司 软件分类方法及装置
CN109213520A (zh) * 2018-09-08 2019-01-15 中山大学 一种基于循环神经网络的注释点推荐方法及系统
US20190197119A1 (en) * 2017-12-21 2019-06-27 Facebook, Inc. Language-agnostic understanding
CN111104159A (zh) * 2019-12-19 2020-05-05 南京邮电大学 一种基于程序分析和神经网络的注释定位方法
CN112417852A (zh) * 2020-12-07 2021-02-26 中山大学 一种代码片段重要性的判断方法和装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5867088B2 (ja) * 2012-01-05 2016-02-24 富士電機株式会社 組込み機器用ソフトウェア作成支援装置およびプログラム
CN103488518A (zh) * 2013-09-11 2014-01-01 上海镜月信息科技有限公司 一种以代码重要性为依据的代码高亮方法
CN107870853A (zh) * 2016-09-27 2018-04-03 北京京东尚科信息技术有限公司 测试程序代码路径覆盖率的方法以及装置
CN108491208A (zh) * 2018-01-31 2018-09-04 中山大学 一种基于神经网络模型的代码注释分类方法
CN108804323A (zh) * 2018-06-06 2018-11-13 中国平安人寿保险股份有限公司 代码质量监控方法、设备及存储介质
CN109615020A (zh) * 2018-12-25 2019-04-12 深圳前海微众银行股份有限公司 基于机器学习模型的特征分析方法、装置、设备及介质
CN109753286A (zh) * 2018-12-28 2019-05-14 四川新网银行股份有限公司 一种基于功能标签的代码方法统计其调用次数的方法
CN109656615A (zh) * 2018-12-28 2019-04-19 四川新网银行股份有限公司 一种基于代码方法重要程度进行权限预警的方法
CN110908709B (zh) * 2019-11-25 2023-05-02 中山大学 一种基于代码更改关键类判定的代码提交注释预测方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021410A (zh) * 2016-05-12 2016-10-12 中国科学院软件研究所 一种基于机器学习的源代码注释质量评估方法
CN107943514A (zh) * 2017-11-01 2018-04-20 北京大学 一种软件文档中核心代码元素的挖掘方法及系统
US20190197119A1 (en) * 2017-12-21 2019-06-27 Facebook, Inc. Language-agnostic understanding
CN108170468A (zh) * 2017-12-28 2018-06-15 中山大学 一种自动检测注释和代码一致性的方法及其系统
CN108734215A (zh) * 2018-05-21 2018-11-02 上海戎磐网络科技有限公司 软件分类方法及装置
CN109213520A (zh) * 2018-09-08 2019-01-15 中山大学 一种基于循环神经网络的注释点推荐方法及系统
CN111104159A (zh) * 2019-12-19 2020-05-05 南京邮电大学 一种基于程序分析和神经网络的注释定位方法
CN112417852A (zh) * 2020-12-07 2021-02-26 中山大学 一种代码片段重要性的判断方法和装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN YUANTAO; TAO JIAJUN; WANG JIN; LIAO ZHUOFAN; XIONG JIE; WANG LEI: "The Image Annotation Method by Convolutional Features from Intermediate Layer of Deep Learning Based on Internet of Things", 2019 15TH INTERNATIONAL CONFERENCE ON MOBILE AD-HOC AND SENSOR NETWORKS (MSN), IEEE, 11 December 2019 (2019-12-11), pages 315 - 320, XP033756578, DOI: 10.1109/MSN48538.2019.00066 *
HUANG YUAN, JIA NAN;ZHOU QIANG;CHEN XIANG-PING;XIONG YING-FEI;LUO XIAO-NAN: "Method Combining Structural and Semantic Features to Support Code Commenting Decision", JOURNAL OF SOFTWARE, vol. 29, no. 8, 13 March 2018 (2018-03-13), pages 2226 - 2242, XP055940807, ISSN: 1000-9825, DOI: 10.13328/j.cnki.jos.005528 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116302043A (zh) * 2023-05-25 2023-06-23 深圳市明源云科技有限公司 代码维护问题检测方法、装置、电子设备及可读存储介质
CN116302043B (zh) * 2023-05-25 2023-10-10 深圳市明源云科技有限公司 代码维护问题检测方法、装置、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN112417852B (zh) 2022-01-25
CN112417852A (zh) 2021-02-26

Similar Documents

Publication Publication Date Title
Umer et al. CNN-based automatic prioritization of bug reports
US20220405592A1 (en) Multi-feature log anomaly detection method and system based on log full semantics
CN108874878A (zh) 一种知识图谱的构建系统及方法
Vlas et al. Two rule-based natural language strategies for requirements discovery and classification in open source software development projects
CN108959418A (zh) 一种人物关系抽取方法、装置、计算机装置及计算机可读存储介质
Vlas et al. A rule-based natural language technique for requirements discovery and classification in open-source software development projects
CN110175585B (zh) 一种简答题自动批改系统及方法
CN109857846B (zh) 用户问句与知识点的匹配方法和装置
WO2022121146A1 (fr) Procédé et appareil de détermination de l'importance d'un segment de code
CN111124487A (zh) 代码克隆检测方法、装置以及电子设备
US20220414463A1 (en) Automated troubleshooter
Cabrio et al. Abstract dialectical frameworks for text exploration
CN114217766A (zh) 基于预训练语言微调与依存特征的半自动需求抽取方法
Vineetha et al. A multinomial naïve Bayes classifier for identifying actors and use cases from software requirement specification documents
WO2024087754A1 (fr) Procédé d'identification de texte complet multidimensionnel
Du et al. SemCluster: a semi-supervised clustering tool for crowdsourced test reports with deep image understanding
Zhang et al. A textcnn based approach for multi-label text classification of power fault data
Zhang et al. An Accurate Identifier Renaming Prediction and Suggestion Approach
Jadallah et al. CATE: CAusality Tree Extractor from Natural Language Requirements
Kramer et al. Improvement of a naive Bayes sentiment classifier using MRS-based features
Sawant et al. Deriving requirements model from textual use cases
US20230111052A1 (en) Self-learning annotations to generate rules to be utilized by rule-based system
CN114638225A (zh) 一种基于科技文献图网络的关键词自动抽取方法
CN113779256A (zh) 一种文件审核方法及系统
Praveena et al. Chunking based malayalam paraphrase identification using unfolding recursive autoencoders

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901863

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901863

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21901863

Country of ref document: EP

Kind code of ref document: A1