WO2022121146A1 - Method and apparatus for determining importance of code segment - Google Patents

Method and apparatus for determining importance of code segment Download PDF

Info

Publication number
WO2022121146A1
WO2022121146A1 PCT/CN2021/081731 CN2021081731W WO2022121146A1 WO 2022121146 A1 WO2022121146 A1 WO 2022121146A1 CN 2021081731 W CN2021081731 W CN 2021081731W WO 2022121146 A1 WO2022121146 A1 WO 2022121146A1
Authority
WO
WIPO (PCT)
Prior art keywords
annotated
code
feature
word
feature vector
Prior art date
Application number
PCT/CN2021/081731
Other languages
French (fr)
Chinese (zh)
Inventor
舒俊淮
陈湘萍
金舒原
郑子彬
Original Assignee
中山大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中山大学 filed Critical 中山大学
Publication of WO2022121146A1 publication Critical patent/WO2022121146A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Definitions

  • the invention relates to the field of computer technology, and in particular, to a method and device for judging the importance of code segments.
  • the research direction of intelligent software engineering includes software warehouse mining, program code understanding, automatic code generation, automatic annotation generation, etc.
  • the purpose is to help software developers improve the efficiency of development and maintenance.
  • researchers related to intelligent software engineering problems have begun to explore the possibility of using these advanced technologies to solve related research problems, and obtained many encouraging results. For example, based on information retrieval and recommendation system technology to help developers improve the utilization of open source software warehouses; machine learning-based methods identify important words in the code to help other tasks understand the program correctly; code generation based on convolutional neural networks technology; etc.
  • code comments can help us understand the intentions and ideas of code authors, and play an important role in software maintenance, code reuse, and team collaborative development.
  • the research of this technology aims to automatically generate comments for a given code fragment by a machine, so as to reduce the time that software developers spend on writing code comments and improve development efficiency.
  • machine learning and deep learning techniques the researchers turned the problem into a "translation task" in natural language processing to solve.
  • sequence-to-sequence models in natural language processing (i.e., input a sequence of text, the model outputs a sequence of text), the researchers "translate" the code language into natural language, and treat the resulting natural language as annotations for the corresponding code snippets .
  • the model is simply converted into feature vectors to train the model.
  • This practice is fairly common in natural language processing. But doing so is equivalent to treating the text of the code as a natural language, and only uses the textual information of the code, so the effect of this last method is not so ideal.
  • this method does not make sufficient use of text information. It simply converts words into features without considering the distribution of text, which may lead to lower accuracy in determining the location of code comments, which cannot be reasonable for developers. , which in turn reduces the productivity of software developers and maintainers.
  • the present invention provides a method and device for judging the importance of code fragments, which solves the problem that the existing technology for predicting the position of code comments is only limited to treating the code text as unstructured plain text, and the feature utilization rate of multiple dimensions is relatively low.
  • the low accuracy of determining the location of code comments leads to a low technical problem that it is impossible to give developers reasonable suggestions, thereby reducing the work efficiency of software development and maintenance personnel.
  • a method for judging the importance of a code fragment provided by the present invention includes:
  • the target classification model is generated through a preset classification model training process.
  • the classification model training process includes:
  • the first feature vector includes a grammatical feature vector, a text feature vector, a structural feature vector and a relational feature vector
  • the step of extracting the first feature vector of the code fragment to be annotated includes:
  • a relational feature vector corresponding to the to-be-annotated code segment is determined.
  • the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types, and the grammatical feature vector corresponding to the code fragment to be annotated is determined according to the statistical result of the statement type information. steps, including:
  • weighted summation is performed on a plurality of the statement type features to determine the total statement type features
  • the statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced together to generate a grammatical feature vector corresponding to the code segment to be annotated.
  • the preset variable word division rule includes a hump rule or an underline rule
  • the step of extracting the target word from the to-be-annotated code fragment according to the preset variable word division rule includes:
  • the stems in the words to be extracted are extracted to generate a target word.
  • the target word includes a plurality of words to be counted
  • the step of determining the text feature vector corresponding to the code fragment to be annotated according to the statistical result of the target word includes:
  • weighted summation is performed on all the described word features to generate total word features
  • the total word quantity feature, the word type quantity feature, the word variance feature, the non-word proportion feature, and the total word feature are spliced together to generate a text feature vector corresponding to the code segment to be annotated.
  • the step of determining the structural feature vector corresponding to the to-be-annotated code segment according to the complexity of the to-be-annotated code segment includes:
  • the line number feature, the nested statement number feature, the maximum nesting level feature, the shape parameter quantity feature, and the comprehensive feature are spliced together to generate a structural feature vector corresponding to the code segment to be annotated.
  • the number of function calls includes the number of additional functions called and the number of times the function is called
  • the step of determining the relationship feature vector corresponding to the code fragment to be annotated based on the number of function calls of the code fragment to be annotated includes:
  • the number of called extra functions and the called times are spliced together to generate a relational feature vector corresponding to the code segment to be annotated.
  • the step of inputting the first feature vector into the target classification model and outputting the result of judging the importance of the to-be-annotated code segment includes:
  • the present invention also provides a device for judging the importance of a code segment, including:
  • the code fragment receiving module is used to receive the code fragment to be annotated
  • a first feature vector extraction module for extracting the first feature vector of the code fragment to be annotated
  • an importance output module for inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated
  • the target classification model is generated by a preset classification model training module.
  • the present invention has the following advantages:
  • the present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment
  • the importance judgment result of the code snippet Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions.
  • Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.
  • FIG. 1 is a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention
  • FIG. 2 is a flowchart of steps of a method for judging the importance of a code segment provided by an optional embodiment of the present invention
  • FIG. 3 is an example diagram of a nested statement in an embodiment of the present invention.
  • FIG. 4 is a flowchart of steps of a method for judging the importance of a code segment provided by another embodiment of the present invention.
  • FIG. 5 is a structural block diagram of an apparatus for judging the importance of a code segment according to an embodiment of the present invention.
  • the embodiments of the present invention provide a method and device for judging the importance of code fragments, which are used to solve the problem that the existing technology for predicting the position of code comments is only limited to treating code text as unstructured plain text, and the features of multiple dimensions are
  • the low utilization rate leads to a low accuracy in determining the location of code comments, and it is impossible to give developers reasonable suggestions, thereby reducing the technical problem of software development and maintenance personnel's work efficiency.
  • FIG. 1 is a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention.
  • a method for judging the importance of a code fragment provided by the present invention includes:
  • Step 101 receiving the code fragment to be annotated
  • the user input can be received first.
  • Annotate code snippets for importance judgment process in order to more accurately determine the importance of the code fragment and to better support downstream tasks such as judging the position of the code comment, before the user needs to annotate the code fragment, the user input can be received first. Annotate code snippets for importance judgment process.
  • the code fragment to be annotated may be a Java code fragment, etc., which is not limited in the embodiment of the present invention.
  • Step 102 extracting the first feature vector of the code fragment to be annotated
  • the first feature vector from the to-be-annotated code fragment, such as grammatical features, text features, structural features, and relational features, etc., as the input of the subsequent model, and perform the to-be-annotated based on the above features.
  • the process of judging the importance of code snippets After receiving the to-be-annotated code fragment, extract the first feature vector from the to-be-annotated code fragment, such as grammatical features, text features, structural features, and relational features, etc.
  • Step 103 inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
  • the target classification model is generated through a preset classification model training process, and after the target classification model is obtained, the first feature vector is input into the target classification model to perform the process of judging the importance of the code fragment to be annotated, thereby Determine whether the code fragment is important as a result of the importance judgment.
  • the present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment
  • the importance judgment result of the code snippet Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions.
  • Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.
  • FIG. 2 is a flowchart of steps of a method for judging the importance of a code segment provided by an optional embodiment of the present invention.
  • a method for judging the importance of a code fragment provided by the present invention includes:
  • Step 201 receiving the code fragment to be annotated
  • a target classification model can be generated through a classification model training process in advance, and the classification model training process includes the following steps S1-S6:
  • the purpose of the present invention is to judge the importance of the code fragments. To do this, you can first divide the annotated code file in units of functions, divide the annotated code file into training code fragments for each function, and then label each training code fragment according to whether the training code fragment has or not. .
  • a first preset label can be set for the training code snippet with a preset type annotation, which is used to identify that the code snippet is important
  • a second preset label can be set for the training code snippet without the preset type annotation , which is used to identify that the snippet is unimportant.
  • the preset type annotation may be a function header annotation or the like
  • the first preset tag may be 1
  • the second preset tag may be 0, and the embodiment of the present invention does not limit the annotation type and tag form.
  • the training code fragment after the training code fragment is acquired, it is also necessary to extract a second feature vector of the training code fragment.
  • the type of the second feature vector is the same as that of the first feature vector, that is, the second feature vector also includes a syntax feature vector. , text feature vector, structural feature vector and relation feature vector, and the extraction method is the same as that of the first feature vector.
  • a plurality of second feature vectors may be used to form a training set, and the training set is used to train a preset initial classification model to obtain a target classification model.
  • the initial classification model may be a random forest model or other classification models, which is not limited in this embodiment of the present invention.
  • the specific training process can be as follows: the data set is randomly divided into 10 equal parts, 1 part is taken as the test set each time, and the other 9 parts are used as the training set. Use the training set to train the model, and use the test set to test the effect of the model. When the effect of the model on the test set is no longer improved for 20 consecutive iterations, record the number of iterations corresponding to the best effect. Repeat the above training process 10 times, so that each of the 10 equally divided data sets has been used as a test set to obtain 10 optimal number of iterations. Average these 10 iterations as the number of iterations when we finally train the model. Finally, we use the full amount of data to train a random forest model. When the number of model iterations reaches the preset value, the training is complete.
  • the first feature vector includes a grammatical feature vector, a text feature vector, a structural feature vector, and a relational feature vector, and the above step 102 may be replaced by the following steps 202-208:
  • Step 202 converting the code fragment to be annotated into an abstract syntax tree
  • Abstract Syntax Tree is an abstract representation of the grammatical structure of source code. It represents the syntax structure of the programming language in the form of a tree, and each node on the tree represents a structure in the source code.
  • the code segment to be annotated in order to enable the grammatical structure of the code segment to be annotated to be vividly embodied, the code segment to be annotated can be converted into an abstract syntax tree, so as to facilitate subsequent extraction of syntactic feature vectors.
  • Step 203 extracting the statement type information of the code fragment to be annotated from the abstract syntax tree
  • the abstract syntax tree can reflect each syntax structure in the code fragment to be annotated, that is, can reflect the statement type information of the code fragment to be annotated
  • the statement type information of the code fragment to be annotated can be extracted from the abstract syntax tree, Including but not limited to IfStmt (if statement), ForStmt (for loop statement), WhileStmt (while loop statement) and so on.
  • Step 204 determine the grammatical feature vector corresponding to the code fragment to be annotated
  • the embodiment of the present invention relates to the grammatical feature vector of the code segment to be annotated, and the purpose is to describe the grammatical information of the code segment in the code language.
  • the grammatical feature vector corresponding to the code segment to be annotated can be determined by the statistical result of the statement type information.
  • the grammatical feature vector may be: the frequency distribution feature of the frequency distribution of different sentence types, the sentence quantity feature of the number of different sentence types (that is, the same sentence type is deduplicated), the total sentence quantity feature of the total number of sentences, and the sentence-based The total sentence type characteristics obtained by type weighting.
  • the statement type information includes the occurrence frequency, quantity, and frequency distribution of multiple statement types, and step 204 may include the following sub-steps:
  • weighted summation is performed on a plurality of the statement type features to determine the total statement type features
  • the statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced together to generate a grammatical feature vector corresponding to the code segment to be annotated.
  • the frequency distribution characteristics of each sentence type can be determined by counting the frequency distribution of each sentence type; the number of sentences of each sentence type can be determined by separately counting the number of each sentence type; Count the total number of all sentences, and determine the characteristics of the total number of sentences; use the first preset word feature conversion model such as the Word2Vec model, etc., to convert each sentence type into the corresponding sentence type feature, and then use the frequency of occurrence of each sentence type.
  • the statement type features are weighted and summed to determine the total statement type feature; finally, the statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced to obtain the grammar representing the code fragment to be annotated. Feature vector.
  • Word2vec is a group of related models used to generate word features. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word texts. The network is represented by words and needs to guess the input words in adjacent positions. After training, the word2vec model can be used to map each word to a feature, which can be used to represent the relationship between words and words, which is the hidden layer of the neural network.
  • Step 205 extracting the target word from the code fragment to be annotated according to the preset variable word division rule
  • the embodiment of the present invention also needs to acquire a plurality of text feature vectors about the code fragment, the purpose is to describe the distribution of the text in the code fragment, extract the meaning of the word and the context information of the word. Before obtaining the text feature vector, it is also necessary to preprocess the commented code snippet to obtain the target word.
  • the target word may be extracted from the to-be-annotated code segment according to a preset variable word division rule.
  • the preset variable word division rules include hump rules or underline rules
  • step 205 may include the following sub-steps:
  • the stems in the words to be extracted are extracted to generate a target word.
  • words can be extracted from the code fragment to be annotated first.
  • words are separated by separators such as spaces, brackets, or semicolons, and then the words are divided by the camel case rule or the underline rule to determine the words to be processed. ; Then delete the preset stop words to get the words to be extracted.
  • the preset stop words are function words that have no actual meaning, such as "the”, “is”, "at”, "on”, etc.; Stem words may appear in different forms. In order to reduce the number of words, the stem of the words to be extracted can also be extracted to obtain the target word.
  • all target words may be uniformly processed into lowercase form, which is not limited in this embodiment of the present invention.
  • Step 206 according to the statistical result of the target word, determine the text feature vector corresponding to the code fragment to be annotated
  • step 206 may include the following sub-steps:
  • weighted summation is performed on all the described word features to generate total word features
  • the total word quantity feature, the word type quantity feature, the word variance feature, the non-word proportion feature, and the total word feature are spliced together to generate a text feature vector corresponding to the code segment to be annotated.
  • the text feature vector includes: the total word quantity feature, the word type quantity feature (that is, the same word is deduplicated), the word variance feature, the non-word ratio feature, and the total word feature obtained based on word weighting, wherein,
  • the three features of the total word quantity feature, the word type quantity feature, and the word variance feature measure the distribution of the target words in the code to be annotated; non-English words, that is, some words are "made up" by the code author and have no practical meaning. variable name.
  • Non-word scale features measure word interpretability information; total word features include contextual information about words.
  • the total number of words to be counted is counted to determine the feature of the total number of words; the number of types of various words to be counted is counted separately to determine the feature of the number of word types; the frequency of occurrence of each word to be counted is calculated separately , determine the word variance feature; count the proportion of non-English words in the plurality of words to be counted to determine the non-word ratio feature; adopt the second preset word feature conversion model to convert the plurality of words to be counted into words respectively feature; take the frequency of occurrence of each described word to be counted as weight, carry out weighted summation to all described word features, and generate total word feature; splicing described total word quantity feature, described word type quantity feature, described word
  • the variance feature, the non-word ratio feature, and the total word feature are used to generate a text feature vector corresponding to the code segment to be annotated.
  • the second preset word feature conversion model may be a Word2Vec model, etc., which is not limited in this embodiment of the present invention.
  • Step 207 according to the complexity of the to-be-annotated code fragment, determine the structural feature vector corresponding to the to-be-annotated code fragment;
  • the step 207 may include the following sub-steps:
  • the line number feature, the nested statement number feature, the maximum nesting level feature, the shape parameter quantity feature, and the comprehensive feature are spliced together to generate a structural feature vector corresponding to the code segment to be annotated.
  • the embodiment of the present invention also needs to determine the structural features of the code segment to be annotated, in order to determine the complexity of the code segment to be annotated.
  • the structural features of function code fragments are some features that can describe the structure of function code fragments. These structural features are: the number of lines of code, the number of nested statements, the maximum level of nested statements, whether the function has formal parameters and the number of formal parameters, the number of words in the longest statement, the number of API calls, the number of variables, the number of identifiers and the number of internal annotations, etc.
  • the so-called nested statement refers to a statement containing another statement. As shown in Figure 3, a for loop statement contains an example of an if conditional statement.
  • the complexity of the code fragment is positively related to its importance.
  • Step 208 Determine a relational feature vector corresponding to the to-be-annotated code segment based on the number of function calls of the to-be-annotated code segment.
  • the number of function calls includes the number of additional functions called and the number of times they are called.
  • Step 208 may include the following sub-steps:
  • the number of called extra functions and the called times are spliced together to generate a relational feature vector corresponding to the code segment to be annotated.
  • a method similar to a social network or a directed graph can be used to define the number of calls in the current fragment to be annotated by the out-degree value. , which defines the number of calls to additional functions in terms of in-degree values. Then, scan and traverse the code file to be annotated to which the entire code fragment to be annotated belongs to determine the out-degree value and in-degree value of the code fragment to be annotated, that is, the number of extra functions to be called and the number of times to be called, and then the above two characteristics are performed. splicing to generate the relational feature vector corresponding to the code fragment to be annotated.
  • steps 202-204 as a whole
  • steps 205-206 as a whole
  • steps 207 and 208 can be executed in parallel.
  • Step 209 inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
  • the step 209 may include the following sub-steps:
  • the first feature vector may be input into the target classification model, and the target classification model comprehensively judges based on the first feature vector to obtain the model output.
  • the output of the target classification model is the first preset label, it is determined that the importance judgment result of the code segment to be annotated is important; if the output of the target classification model is the second preset label , and determine that the importance judgment result of the code segment to be annotated is not important.
  • the present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment
  • the importance judgment result of the code snippet Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions.
  • Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.
  • Fig. 4 shows a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention.
  • the syntax feature extraction process includes: preparing to extract syntax features; converting function code segments into abstract syntax trees; obtaining statement type information of function code segments from the abstract syntax tree; counting the frequency distribution of different statement types; counting the number of different statement types ; Count the total number of sentences; convert sentence types into features and weight them according to the frequency of occurrence; concatenate to obtain grammatical features.
  • the text feature extraction process includes: preparing to extract text features; extracting words in function code snippets; dividing variable words according to the camel case rule or underscore rule; uniformly processing words into lowercase; deleting stop words; stemming; counting words count the number of types of words used; calculate the variance of the frequency of occurrence of different words; count the proportion of non-English words; convert words into features and weighted sums according to their frequency of occurrence; splicing to obtain text features.
  • the structural feature extraction process includes: preparing to extract structural features; counting the number of lines of code in the function code fragment; counting the number of nested statements in the function code fragment; counting the maximum level of nested statements in the function code fragment; counting the formal parameters in the function code fragment count; count the number of words in the longest statement in the function snippet; count the number of API calls in the function snippet; count the number of variables in the function snippet; count the identifiers in the function snippet and count the number of identifiers in the function snippet Number of internal annotations; splicing to obtain structural features.
  • the relational feature extraction process includes: preparing to extract relational features; defining the concepts of out-degree and in-degree values; counting out-degree and in-degree values of each function; splicing to obtain relational features.
  • the final features of each function code fragment are obtained by splicing; the classification model is trained in combination with the labels; the target classification model is obtained, in which each training will output the result of whether the function code fragment is important;
  • the final feature of the function code fragment is extracted; the final feature is input to the target classification model, and the output function code fragment is important.
  • FIG. 5 is a structural block diagram of an apparatus for judging the importance of a code segment according to an embodiment of the present invention.
  • the present invention also provides a device for judging the importance of a code segment, including:
  • a code fragment receiving module 501 configured to receive code fragments to be annotated
  • the first feature vector extraction module 502 is used to extract the first feature vector of the code fragment to be annotated
  • the target classification model is generated through a preset classification model training process.
  • the classification model training module includes:
  • the annotated code file receiving submodule is used to obtain the annotated code file from the preset software repository;
  • a file division submodule which is used to divide the annotated code file in units of functions to generate a plurality of training code fragments
  • a first label setting submodule used for setting a first preset label for the training code snippet with a preset type annotation
  • a second label setting submodule configured to set a second preset label for the training code fragment that does not have a preset type annotation
  • the second feature vector extraction submodule is used to extract the second feature vector of each of the training code fragments respectively;
  • the classification model training sub-module is used for training a preset initial classification model by using a plurality of the second feature vectors to obtain a target classification model.
  • the first feature vector includes a syntax feature vector, a text feature vector, a structural feature vector and a relational feature vector
  • the first feature vector extraction module 502 includes:
  • a statement type information extraction submodule used for extracting the statement type information of the code fragment to be annotated from the abstract syntax tree
  • a grammatical feature vector determination submodule configured to determine the grammatical feature vector corresponding to the code fragment to be annotated according to the statistical result of the statement type information
  • a target word extraction submodule used for extracting target words from the to-be-annotated code fragment according to a preset variable word division rule
  • Text feature vector determination submodule for determining the text feature vector corresponding to the code fragment to be annotated according to the statistical result of the target word
  • a structural feature vector determination submodule configured to determine the structural feature vector corresponding to the to-be-annotated code segment according to the complexity of the to-be-annotated code segment;
  • the relationship feature vector determination submodule is configured to determine the relationship feature vector corresponding to the to-be-annotated code segment based on the number of function calls of the to-be-annotated code segment.
  • the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types
  • the grammatical feature vector determination submodule includes:
  • a sentence frequency distribution feature determining unit configured to count the frequency distributions of the multiple sentence types to determine the sentence frequency distribution features
  • a statement quantity feature determining unit configured to count the number of the multiple statement types and determine the statement quantity feature
  • a total statement quantity feature determining unit used to count the total number of statements corresponding to the multiple statement types, and determine the total statement number feature
  • a statement type feature conversion unit configured to convert the multiple statement types into statement type features respectively by adopting the first preset word feature conversion model
  • a total sentence type feature determining unit configured to use the frequency of occurrence as a weight to perform a weighted summation on a plurality of the sentence type features to determine the total sentence type feature
  • a grammatical feature vector generating unit configured to splicing the statement frequency distribution feature, the statement quantity feature, the total statement quantity feature and the total statement type feature to generate a grammatical feature vector corresponding to the code segment to be annotated.
  • the preset variable word division rules include hump rules or underline rules
  • the target word extraction submodule includes:
  • a word extraction unit for extracting words from the to-be-annotated code fragment
  • a word determination unit to be processed, for using the hump rule or the underline rule to determine the word to be processed from the word;
  • a word to be extracted determination unit used for deleting preset stop words from the to-be-processed word to obtain the to-be-extracted word
  • the target word determination unit is used for extracting the stem in the words to be extracted to generate the target word.
  • the target word includes a plurality of words to be counted
  • the text feature vector determination submodule includes:
  • a total word quantity feature determination unit used to count the total number of the multiple words to be counted, and determine the total word quantity feature
  • a word type and quantity feature determining unit used to count the types and quantities of the plurality of words to be counted, and determine the word type and quantity characteristics
  • a word variance feature determining unit used to calculate the variance of the frequency of occurrence of each word to be counted among the multiple words to be counted, to determine the word variance feature
  • a non-word ratio feature determining unit used to count the ratio of non-English words in the plurality of words to be counted, and determine the non-word ratio feature
  • a word feature conversion unit configured to convert the plurality of words to be counted into word features respectively by adopting a second preset word feature conversion model
  • the total word feature generating unit is used for taking the frequency of occurrence of each described word to be counted as a weight, and performing weighted summation on all the described word features to generate a total word feature;
  • the text feature vector determination unit is used for splicing the total word quantity feature, the word type quantity feature, the word variance feature, the non-word ratio feature and the total word feature, and generating the to-be-annotated code fragments corresponding to The text feature vector of .
  • the structural feature vector determination submodule includes:
  • a line number feature determination unit used to count the number of lines of code in the to-be-annotated code fragment to determine the line number feature
  • a unit for determining the quantity of nested statements which is used to count the number of nested statements in the code fragment to be annotated, and to determine the number of nested statements;
  • the maximum nesting level feature determination unit is used to count the maximum nesting level in the code fragment to be annotated, and determine the maximum nesting level feature
  • a shape parameter quantity feature determination unit used to count the number of formal parameters in the to-be-annotated code fragment, and determine the shape parameter quantity feature
  • the comprehensive feature determination unit is used to count the word quantity feature of the longest statement in the code fragment to be annotated, the API call quantity feature of the code fragment to be annotated, the variable quantity feature of the code fragment to be annotated, the The identifier quantity feature of the annotated code fragment and the internal annotation quantity feature of the to-be-annotated code fragment are sequentially spliced to generate comprehensive features;
  • Structural feature vector generation unit used for splicing the line number feature, the nested statement quantity feature, the nested maximum level feature, the shape parameter quantity feature and the comprehensive feature to generate the to-be-annotated code Structural feature vector corresponding to the fragment.
  • the number of function calls includes the number of additional functions called and the number of times the function is called
  • the feature vector determination submodule includes:
  • a function invocation number determination unit configured to traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of the additional functions called and the called times of the to-be-annotated code fragment;
  • a relational feature vector generating unit configured to concatenate the number of called extra functions and the called times to generate a relational feature vector corresponding to the code segment to be annotated.
  • the importance output module 503 includes:
  • a feature vector input submodule for inputting the first feature vector into the target classification model
  • an importance determination submodule configured to determine the importance judgment result of the code fragment to be annotated as important when the output of the target classification model is the first preset label
  • the importance negation sub-module is configured to determine that the importance judgment result of the code segment to be annotated is not important when the output of the target classification model is the second preset label.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

Abstract

A method and apparatus for determining the importance of a code segment. The method for determining the importance of a code segment comprises: generating a target classification model by using a preset classification model training process; when a code segment to be annotated is received (101), extracting a first feature vector of said code segment (102); and inputting the first feature vector into the target classification model, and outputting an importance determination result of said code segment (103). By means of the method, the importance of a code to be annotated can be determined, thereby facilitating the normalization of annotation behavior of software development and maintenance personnel, and keeping a code annotation amount within a suitable range.

Description

一种代码片段重要性的判断方法和装置A method and device for judging the importance of a code fragment
本申请要求于2020年12月07日提交中国专利局、申请号为202011418126.2、发明名称为“一种代码片段重要性的判断方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011418126.2 and the invention titled "A method and device for determining the importance of a code segment", which was filed with the China Patent Office on December 07, 2020, the entire contents of which are incorporated by reference in this application.
技术领域technical field
本发明涉及计算机技术领域,尤其涉及一种代码片段重要性的判断方法和装置。The invention relates to the field of computer technology, and in particular, to a method and device for judging the importance of code segments.
背景技术Background technique
智能软件工程的研究方向包括软件仓库挖掘、程序代码理解、代码自动生成、注释自动生成等,其目的都是帮助软件开发人员提高开发、维护过程中的效率。近几年来,由于机器学习和深度学习技术的兴起,智能软件工程问题的相关研究人员开始探索用这些先进技术来解决相关研究问题的可能性,并得到了不少令人鼓舞的成果。例如基于信息检索和推荐系统技术帮助开发人员提高对开源软件仓库的利用率;基于机器学习的方法判别出代码中重要的词语以帮助其他任务对程序的正确理解;基于卷积神经网络的代码生成技术;等等。The research direction of intelligent software engineering includes software warehouse mining, program code understanding, automatic code generation, automatic annotation generation, etc. The purpose is to help software developers improve the efficiency of development and maintenance. In recent years, due to the rise of machine learning and deep learning technologies, researchers related to intelligent software engineering problems have begun to explore the possibility of using these advanced technologies to solve related research problems, and obtained many encouraging results. For example, based on information retrieval and recommendation system technology to help developers improve the utilization of open source software warehouses; machine learning-based methods identify important words in the code to help other tasks understand the program correctly; code generation based on convolutional neural networks technology; etc.
代码注释的自动生成技术是智能软件工程研究领域一个热门的话题。代码注释可以帮助我们了解代码作者的意图和思路,对软件维护、代码重用和团队协作开发等都有重要作用。该技术的研究旨在对于一个给定的代码片段,由机器自动生成该代码片段的注释,以减少软件开发人员花在撰写代码注释上的时间,提高开发效率。借助机器学习和深度学习技术,研究人员把该问题转化成自然语言处理中的“翻译任务”来解决。通过使用自然语言处理中的序列到序列模型(即,输入一个文本序列,模型会输出一个文本序列),研究人员将代码语言“翻译”成自然语言,将得到的自然语言当成对应代码片段的注释。The technology of automatic generation of code comments is a hot topic in the research field of intelligent software engineering. Code comments can help us understand the intentions and ideas of code authors, and play an important role in software maintenance, code reuse, and team collaborative development. The research of this technology aims to automatically generate comments for a given code fragment by a machine, so as to reduce the time that software developers spend on writing code comments and improve development efficiency. Using machine learning and deep learning techniques, the researchers turned the problem into a "translation task" in natural language processing to solve. By using sequence-to-sequence models in natural language processing (i.e., input a sequence of text, the model outputs a sequence of text), the researchers "translate" the code language into natural language, and treat the resulting natural language as annotations for the corresponding code snippets .
但在现有的预测代码注释位置的技术中,只是单纯地将代码文本转化成特征向量来训练模型。这种做法在自然语言处理中是相当常见的。但这样做相当于把代码文本完全当成自然语言,只是利用了代码的文本信息,导致最后这种方法的效果并不那么理想。另外,这种方法对文本信息的利 用也不够充分,它只是单纯地将单词转换成特征,没有考虑到文本的分布情况,可能会导致确定代码注释位置的准确性变低,无法给开发人员合理的建议,进而降低软件开发和维护人员的工作效率。However, in the existing techniques for predicting the location of code annotations, the model is simply converted into feature vectors to train the model. This practice is fairly common in natural language processing. But doing so is equivalent to treating the text of the code as a natural language, and only uses the textual information of the code, so the effect of this last method is not so ideal. In addition, this method does not make sufficient use of text information. It simply converts words into features without considering the distribution of text, which may lead to lower accuracy in determining the location of code comments, which cannot be reasonable for developers. , which in turn reduces the productivity of software developers and maintainers.
发明内容SUMMARY OF THE INVENTION
本发明提供了一种代码片段重要性的判断方法和装置,解决了现有的预测代码注释位置的技术中仅局限于将代码文本当成无结构的纯文本,对多个维度的特征利用率较低所导致的确定代码注释位置的准确性变低,无法给开发人员合理的建议,进而降低软件开发与维护人员工作效率的技术问题。The present invention provides a method and device for judging the importance of code fragments, which solves the problem that the existing technology for predicting the position of code comments is only limited to treating the code text as unstructured plain text, and the feature utilization rate of multiple dimensions is relatively low. The low accuracy of determining the location of code comments leads to a low technical problem that it is impossible to give developers reasonable suggestions, thereby reducing the work efficiency of software development and maintenance personnel.
本发明提供的一种代码片段重要性的判断方法,包括:A method for judging the importance of a code fragment provided by the present invention includes:
接收待注释代码片段;Receive code snippets to be annotated;
提取所述待注释代码片段的第一特征向量;extracting the first feature vector of the code fragment to be annotated;
将所述第一特征向量输入到所述目标分类模型,输出对所述待注释代码片段的重要性判断结果;Inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
其中,所述目标分类模型通过预置的分类模型训练过程所生成。Wherein, the target classification model is generated through a preset classification model training process.
可选地,所述分类模型训练过程包括:Optionally, the classification model training process includes:
从预置的软件仓库中获取已注释代码文件;Get annotated code files from pre-built software repositories;
以函数为单位对所述已注释代码文件进行划分,生成多个训练代码片段;Divide the annotated code file in units of functions to generate multiple training code fragments;
为具有预设类型注释的所述训练代码片段设置第一预设标签;setting a first preset label for the training code snippet with a preset type annotation;
为不具有预设类型注释的所述训练代码片段设置第二预设标签;Setting a second preset label for the training code snippet without a preset type annotation;
分别提取每个所述训练代码片段的第二特征向量;extracting the second feature vector of each of the training code fragments respectively;
采用多个所述第二特征向量训练预置的初始分类模型,得到目标分类模型。Using a plurality of the second feature vectors to train a preset initial classification model to obtain a target classification model.
可选地,所述第一特征向量包括语法特征向量、文本特征向量、结构特征向量和关系特征向量,所述提取所述待注释代码片段的第一特征向量的步骤,包括:Optionally, the first feature vector includes a grammatical feature vector, a text feature vector, a structural feature vector and a relational feature vector, and the step of extracting the first feature vector of the code fragment to be annotated includes:
将所述待注释代码片段转换为抽象语法树;converting the code fragment to be annotated into an abstract syntax tree;
从所述抽象语法树中提取所述待注释代码片段的语句类型信息;Extract the statement type information of the code fragment to be annotated from the abstract syntax tree;
根据对所述语句类型信息的统计结果,确定所述待注释代码片段对应的语法特征向量;According to the statistical result of the statement type information, determine the grammatical feature vector corresponding to the code fragment to be annotated;
按照预设变量词划分规则从所述待注释代码片段中提取目标单词;Extract the target word from the to-be-annotated code fragment according to the preset variable word division rule;
根据对所述目标单词的统计结果,确定所述待注释代码片段对应的文本特征向量;According to the statistical result of the target word, determine the text feature vector corresponding to the code fragment to be annotated;
根据所述待注释代码片段的复杂程度,确定所述待注释代码片段对应的结构特征向量;According to the complexity of the to-be-annotated code fragment, determine the structural feature vector corresponding to the to-be-annotated code fragment;
基于所述待注释代码片段的函数调用数量,确定所述待注释代码片段对应的关系特征向量。Based on the number of function calls of the to-be-annotated code segment, a relational feature vector corresponding to the to-be-annotated code segment is determined.
可选地,所述语句类型信息包括多种语句类型的出现频率、数量和频率分布情况,所述根据对所述语句类型信息的统计结果,确定所述待注释代码片段对应的语法特征向量的步骤,包括:Optionally, the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types, and the grammatical feature vector corresponding to the code fragment to be annotated is determined according to the statistical result of the statement type information. steps, including:
统计所述多种语句类型的频率分布情况,确定语句频率分布特征;Counting the frequency distribution of the various sentence types, and determining the frequency distribution characteristics of the sentence;
统计所述多种语句类型的数量,确定语句数量特征;Counting the number of the various types of statements, and determining the characteristics of the number of statements;
统计所述多种语句类型所对应的语句总数量,确定总语句数量特征;Count the total number of statements corresponding to the multiple statement types, and determine the characteristics of the total number of statements;
采用第一预置词特征转换模型将所述多种语句类型分别转换为语句类型特征;Using the first preset word feature conversion model to convert the multiple statement types into statement type features respectively;
以所述出现频率作为权重,对多个所述语句类型特征进行加权求和,确定总语句类型特征;Using the frequency of occurrence as a weight, weighted summation is performed on a plurality of the statement type features to determine the total statement type features;
拼接所述语句频率分布特征、所述语句数量特征、所述总语句数量特征和所述总语句类型特征,生成所述待注释代码片段对应的语法特征向量。The statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced together to generate a grammatical feature vector corresponding to the code segment to be annotated.
可选地,所述预设变量词划分规则包括驼峰规则或下划线规则,所述按照预设变量词划分规则从所述待注释代码片段中提取目标单词的步骤,包括:Optionally, the preset variable word division rule includes a hump rule or an underline rule, and the step of extracting the target word from the to-be-annotated code fragment according to the preset variable word division rule includes:
从所述待注释代码片段中提取单词;Extract words from the to-be-annotated code snippet;
采用所述驼峰规则或所述下划线规则从所述单词中确定待处理单词;Determine the word to be processed from the word using the camel case rule or the underline rule;
从所述待处理单词中删除预设停用词,得到待提取单词;Delete preset stop words from the words to be processed to obtain words to be extracted;
提取所述待提取单词中的词干,生成目标单词。The stems in the words to be extracted are extracted to generate a target word.
可选地,所述目标单词包括多个待统计单词,所述根据对所述目标单词的统计结果,确定所述待注释代码片段对应的文本特征向量的步骤,包 括:Optionally, the target word includes a plurality of words to be counted, and the step of determining the text feature vector corresponding to the code fragment to be annotated according to the statistical result of the target word, includes:
统计所述多个待统计单词的总数量,确定总单词数量特征;Count the total number of the plurality of words to be counted, and determine the total word number feature;
统计所述多个待统计单词的种类数量,确定单词种类数量特征;Count the number of types of the plurality of words to be counted, and determine the characteristics of the number of word types;
分别计算所述多个待统计单词中的每种待统计单词的出现频次的方差,确定单词方差特征;Calculate the variance of the frequency of occurrence of each of the words to be counted in the plurality of words to be counted, and determine the word variance feature;
统计所述多个待统计单词中非英语单词的比例,确定非单词比例特征;Count the proportion of non-English words in the plurality of words to be counted, and determine the non-word proportion feature;
采用第二预置词特征转换模型将所述多个待统计单词分别转换为单词特征;Using the second preset word feature conversion model to convert the plurality of words to be counted into word features respectively;
以每个所述待统计单词的出现频率作为权重,对所有所述单词特征进行加权求和,生成总单词特征;Taking the frequency of occurrence of each described word to be counted as a weight, weighted summation is performed on all the described word features to generate total word features;
拼接所述总单词数量特征、所述单词种类数量特征、所述单词方差特征、所述非单词比例特征和所述总单词特征,生成所述待注释代码片段对应的文本特征向量。The total word quantity feature, the word type quantity feature, the word variance feature, the non-word proportion feature, and the total word feature are spliced together to generate a text feature vector corresponding to the code segment to be annotated.
可选地,所述根据所述待注释代码片段的复杂程度,确定所述待注释代码片段对应的结构特征向量的步骤,包括:Optionally, the step of determining the structural feature vector corresponding to the to-be-annotated code segment according to the complexity of the to-be-annotated code segment includes:
统计所述待注释代码片段中的代码行数,确定行数特征;Count the number of lines of code in the to-be-annotated code snippet, and determine the feature of the number of lines;
统计所述待注释代码片段中的嵌套语句数量,确定嵌套语句数量特征;Count the number of nested statements in the to-be-annotated code fragment, and determine the feature of the number of nested statements;
统计所述待注释代码片段中的最大嵌套层数,确定嵌套最大层数特征;Count the maximum number of nesting levels in the code fragment to be annotated, and determine the feature of the maximum number of nested levels;
统计所述待注释代码片段中的形式参数数量,确定形参数量特征;Count the number of formal parameters in the to-be-annotated code fragment, and determine the quantity characteristics of the formal parameters;
分别统计所述待注释代码片段中最长语句的单词数量特征、所述待注释代码片段的API调用数量特征、所述待注释代码片段的变量数量特征、所述待注释代码片段的标识符数量特征以及所述待注释代码片段的内部注释数量特征,依次拼接生成综合特征;Respectively count the feature of the number of words of the longest statement in the code fragment to be annotated, the feature of the number of API calls of the code fragment to be annotated, the feature of the number of variables of the code fragment to be annotated, and the number of identifiers of the code fragment to be annotated The feature and the internal annotation quantity feature of the code fragment to be annotated are sequentially spliced to generate a comprehensive feature;
拼接所述行数特征、所述嵌套语句数量特征、所述嵌套最大层数特征、所述形参数量特征和所述综合特征,生成所述待注释代码片段对应的结构特征向量。The line number feature, the nested statement number feature, the maximum nesting level feature, the shape parameter quantity feature, and the comprehensive feature are spliced together to generate a structural feature vector corresponding to the code segment to be annotated.
可选地,所述函数调用数量包括调用额外函数数量和被调用次数,所述基于所述待注释代码片段的函数调用数量,确定所述待注释代码片段对应的关系特征向量的步骤,包括:Optionally, the number of function calls includes the number of additional functions called and the number of times the function is called, and the step of determining the relationship feature vector corresponding to the code fragment to be annotated based on the number of function calls of the code fragment to be annotated includes:
遍历所述待注释代码片段所属的待注释代码文件,确定所述待注释代 码片段的所述调用额外函数数量和所述被调用次数;Traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of called extra functions and the described called times of the to-be-annotated code fragment;
拼接所述调用额外函数数量和所述被调用次数,生成所述待注释代码片段对应的关系特征向量。The number of called extra functions and the called times are spliced together to generate a relational feature vector corresponding to the code segment to be annotated.
可选地,所述将所述第一特征向量输入到所述目标分类模型,输出对所述待注释代码片段的重要性判断结果的步骤,包括:Optionally, the step of inputting the first feature vector into the target classification model and outputting the result of judging the importance of the to-be-annotated code segment includes:
将所述第一特征向量输入到所述目标分类模型;inputting the first feature vector into the target classification model;
当所述目标分类模型的输出为所述第一预设标签时,确定所述待注释代码片段的重要性判断结果为重要;When the output of the target classification model is the first preset label, determine that the importance judgment result of the code segment to be annotated is important;
当所述目标分类模型的输出为所述第二预设标签时,确定所述待注释代码片段的重要性判断结果为不重要。When the output of the target classification model is the second preset label, it is determined that the importance judgment result of the code segment to be annotated is not important.
本发明还提供了一种代码片段重要性的判断装置,包括:The present invention also provides a device for judging the importance of a code segment, including:
代码片段接收模块,用于接收待注释代码片段;The code fragment receiving module is used to receive the code fragment to be annotated;
第一特征向量提取模块,用于提取所述待注释代码片段的第一特征向量;a first feature vector extraction module for extracting the first feature vector of the code fragment to be annotated;
重要性输出模块,用于将所述第一特征向量输入到所述目标分类模型,输出对所述待注释代码片段的重要性判断结果;an importance output module, for inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
其中,所述目标分类模型通过预置的分类模型训练模块所生成。Wherein, the target classification model is generated by a preset classification model training module.
从以上技术方案可以看出,本发明具有以下优点:As can be seen from the above technical solutions, the present invention has the following advantages:
本发明通过预置的分类模型训练过程生成目标分类模型,当接收到待注释代码片段时,从待注释代码片段提取第一特征向量,最后将第一特征向量输入目标分类模型,以得到待注释代码片段的重要性判断结果。从而解决现有的预测代码注释位置的技术中仅局限于将代码文本当成无结构的纯文本,对多个维度的特征利用率较低所导致的确定代码注释位置的准确性变低,无法给开发人员合理的建议,降低软件开发与维护人员工作效率的技术问题,进而能够高效地判断待注释代码的重要性,以便于优化软件开发与维护人员的注释行为,使代码注释量保持在一个更为合适的范围。The present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment The importance judgment result of the code snippet. Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions. Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地, 下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.
图1为本发明实施例提供的一种代码片段重要性的判断方法的步骤流程图;1 is a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention;
图2为本发明可选实施例提供的一种代码片段重要性的判断方法的步骤流程图;2 is a flowchart of steps of a method for judging the importance of a code segment provided by an optional embodiment of the present invention;
图3为本发明实施例中的嵌套语句示例图;3 is an example diagram of a nested statement in an embodiment of the present invention;
图4为本发明另一实施例提供的一种代码片段重要性的判断方法的步骤流程图;4 is a flowchart of steps of a method for judging the importance of a code segment provided by another embodiment of the present invention;
图5为本发明实施例提供的一种代码片段重要性的判断装置的结构框图。FIG. 5 is a structural block diagram of an apparatus for judging the importance of a code segment according to an embodiment of the present invention.
具体实施方式Detailed ways
本发明实施例提供了一种代码片段重要性的判断方法和装置,用于解决现有的预测代码注释位置的技术中仅局限于将代码文本当成无结构的纯文本,对多个维度的特征利用率较低所导致的确定代码注释位置的准确性变低,无法给开发人员合理的建议,进而降低软件开发与维护人员工作效率的技术问题。The embodiments of the present invention provide a method and device for judging the importance of code fragments, which are used to solve the problem that the existing technology for predicting the position of code comments is only limited to treating code text as unstructured plain text, and the features of multiple dimensions are The low utilization rate leads to a low accuracy in determining the location of code comments, and it is impossible to give developers reasonable suggestions, thereby reducing the technical problem of software development and maintenance personnel's work efficiency.
为使得本发明的发明目的、特征、优点能够更加的明显和易懂,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,下面所描述的实施例仅仅是本发明一部分实施例,而非全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。In order to make the purpose, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the following The described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
请参阅图1,图1为本发明实施例提供的一种代码片段重要性的判断方法的步骤流程图。Please refer to FIG. 1. FIG. 1 is a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention.
本发明提供的一种代码片段重要性的判断方法,包括:A method for judging the importance of a code fragment provided by the present invention includes:
步骤101,接收待注释代码片段; Step 101, receiving the code fragment to be annotated;
在本发明实施例中,为更准确地判断出代码片段的重要性,以更好地支持下游任务如判断代码注释的位置,在用户需要对代码片段进行注释之 前,可以先接收用户输入的待注释代码片段进行重要性判断过程。In this embodiment of the present invention, in order to more accurately determine the importance of the code fragment and to better support downstream tasks such as judging the position of the code comment, before the user needs to annotate the code fragment, the user input can be received first. Annotate code snippets for importance judgment process.
其中,待注释代码片段可以为Java代码片段等,本发明实施例对此不作限制。The code fragment to be annotated may be a Java code fragment, etc., which is not limited in the embodiment of the present invention.
步骤102,提取所述待注释代码片段的第一特征向量; Step 102, extracting the first feature vector of the code fragment to be annotated;
在接收到待注释代码片段后,从待注释代码片段中提取第一特征向量,例如语法特征、文本特征、结构特征和关系特征等,以此作为后续模型的输入量,基于上述特征进行待注释代码片段的重要性判断过程。After receiving the to-be-annotated code fragment, extract the first feature vector from the to-be-annotated code fragment, such as grammatical features, text features, structural features, and relational features, etc., as the input of the subsequent model, and perform the to-be-annotated based on the above features. The process of judging the importance of code snippets.
步骤103,将所述第一特征向量输入到所述目标分类模型,输出对所述待注释代码片段的重要性判断结果; Step 103, inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
在具体实现中,所述目标分类模型通过预置的分类模型训练过程所生成,在得到目标分类模型后,将第一特征向量输入到目标分类模型进行待注释代码片段的重要性判断过程,从而确定该代码片段是否重要作为重要性判断结果。In a specific implementation, the target classification model is generated through a preset classification model training process, and after the target classification model is obtained, the first feature vector is input into the target classification model to perform the process of judging the importance of the code fragment to be annotated, thereby Determine whether the code fragment is important as a result of the importance judgment.
本发明通过预置的分类模型训练过程生成目标分类模型,当接收到待注释代码片段时,从待注释代码片段提取第一特征向量,最后将第一特征向量输入目标分类模型,以得到待注释代码片段的重要性判断结果。从而解决现有的预测代码注释位置的技术中仅局限于将代码文本当成无结构的纯文本,对多个维度的特征利用率较低所导致的确定代码注释位置的准确性变低,无法给开发人员合理的建议,降低软件开发与维护人员工作效率的技术问题,进而能够高效地判断待注释代码的重要性,以便于优化软件开发与维护人员的注释行为,使代码注释量保持在一个更为合适的范围。The present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment The importance judgment result of the code snippet. Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions. Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.
请参阅图2,图2为本发明可选实施例提供的一种代码片段重要性的判断方法的步骤流程图。Please refer to FIG. 2, which is a flowchart of steps of a method for judging the importance of a code segment provided by an optional embodiment of the present invention.
本发明提供的一种代码片段重要性的判断方法,包括:A method for judging the importance of a code fragment provided by the present invention includes:
步骤201,接收待注释代码片段; Step 201, receiving the code fragment to be annotated;
在步骤201之前,为便于后续能够快速地对待注释代码片段的重要性判断过程,可以在事前通过分类模型训练过程以生成目标分类模型,所述分类模型训练过程包括以下步骤S1-S6:Before step 201, in order to facilitate the process of judging the importance of the annotated code segment quickly, a target classification model can be generated through a classification model training process in advance, and the classification model training process includes the following steps S1-S6:
S1、从预置的软件仓库中获取已注释代码文件;S1. Obtain annotated code files from a preset software repository;
在本发明实施例中,为了得到足够多可靠的训练数据,首先从软件项目仓库Github中获取一些由国际大型公司或组织开源的有着悠久维护历史的Java项目的代码文件,也就是已注释代码文件。In the embodiment of the present invention, in order to obtain enough reliable training data, first obtain some code files of Java projects with a long maintenance history open sourced by large international companies or organizations from the software project repository Github, that is, annotated code files .
S2、以函数为单位对所述已注释代码文件进行划分,生成多个训练代码片段;S2, dividing the annotated code file in units of functions to generate a plurality of training code fragments;
S3、为具有预设类型注释的所述训练代码片段设置第一预设标签;S3, setting a first preset label for the training code snippet with a preset type annotation;
S4、为不具有预设类型注释的所述训练代码片段设置第二预设标签;S4, setting a second preset label for the training code fragment that does not have a preset type annotation;
在具体实现中,由于已注释代码文件往往是包括多个代码片段的,而本发明的目的是判断代码片段的重要性。为此可以先对已注释代码文件以函数为单位进行划分,将已注释代码文件划分为一个个函数的训练代码片段,再根据训练代码片段是否具有为标准,对每个训练代码片段都打上标签。In specific implementation, since the annotated code file often includes multiple code fragments, the purpose of the present invention is to judge the importance of the code fragments. To do this, you can first divide the annotated code file in units of functions, divide the annotated code file into training code fragments for each function, and then label each training code fragment according to whether the training code fragment has or not. .
具体地,可以为具有预设类型注释的所述训练代码片段设置第一预设标签,用于标识该代码片段重要,为不具有预设类型注释的所述训练代码片段设置第二预设标签,用于标识该代码片段不重要。其中,预设类型注释可以为函数头注释等,第一预设标签可以为1,第二预设标签可以为0,本发明实施例对注释类型和标签形式并不限制。Specifically, a first preset label can be set for the training code snippet with a preset type annotation, which is used to identify that the code snippet is important, and a second preset label can be set for the training code snippet without the preset type annotation , which is used to identify that the snippet is unimportant. The preset type annotation may be a function header annotation or the like, the first preset tag may be 1, and the second preset tag may be 0, and the embodiment of the present invention does not limit the annotation type and tag form.
S5、分别提取每个所述训练代码片段的第二特征向量;S5, extract the second feature vector of each described training code fragment respectively;
S6、采用多个所述第二特征向量训练预置的初始分类模型,得到目标分类模型。S6. Use a plurality of the second feature vectors to train a preset initial classification model to obtain a target classification model.
在本发明实施例中,在获取到训练代码片段之后,还需要提取训练代码片段的第二特征向量,第二特征向量的类型与第一特征向量相同,即第二特征向量同样包括语法特征向量、文本特征向量、结构特征向量和关系特征向量,其提取的方式与第一特征向量的提取方式相同。In this embodiment of the present invention, after the training code fragment is acquired, it is also necessary to extract a second feature vector of the training code fragment. The type of the second feature vector is the same as that of the first feature vector, that is, the second feature vector also includes a syntax feature vector. , text feature vector, structural feature vector and relation feature vector, and the extraction method is the same as that of the first feature vector.
在得到每个训练代码片段的第二特征向量后,可以采用多个第二特征向量组成训练集,采用训练集训练预置的初始分类模型,以得到目标分类模型。After obtaining the second feature vector of each training code segment, a plurality of second feature vectors may be used to form a training set, and the training set is used to train a preset initial classification model to obtain a target classification model.
值得一提的是,初始分类模型可以为随机森林模型或其他分类模型,本发明实施例对此不作限制。It is worth mentioning that the initial classification model may be a random forest model or other classification models, which is not limited in this embodiment of the present invention.
具体的训练过程可以如下:将数据集随机分成10等份,每次取其中1 份作为测试集,其他9份作为训练集。用训练集训练模型,用测试集测试模型的效果。当模型在测试集上的效果连续20个迭代都不再提升时,记录下效果最好时对应的迭代次数。重复以上的训练过程10次,让10等分的数据集中,每一份都已经作为测试集,得到10个最佳迭代次数。对这10个迭代次数求平均值,作为我们最终训练模型时的迭代次数。最后我们用全量数据训练一个随机森林模型,当模型迭代次数到达预设值时,训练完成。The specific training process can be as follows: the data set is randomly divided into 10 equal parts, 1 part is taken as the test set each time, and the other 9 parts are used as the training set. Use the training set to train the model, and use the test set to test the effect of the model. When the effect of the model on the test set is no longer improved for 20 consecutive iterations, record the number of iterations corresponding to the best effect. Repeat the above training process 10 times, so that each of the 10 equally divided data sets has been used as a test set to obtain 10 optimal number of iterations. Average these 10 iterations as the number of iterations when we finally train the model. Finally, we use the full amount of data to train a random forest model. When the number of model iterations reaches the preset value, the training is complete.
在本发明实施例中,所述第一特征向量包括语法特征向量、文本特征向量、结构特征向量和关系特征向量,上述步骤102可以替换为以下步骤202-208:In this embodiment of the present invention, the first feature vector includes a grammatical feature vector, a text feature vector, a structural feature vector, and a relational feature vector, and the above step 102 may be replaced by the following steps 202-208:
步骤202,将所述待注释代码片段转换为抽象语法树; Step 202, converting the code fragment to be annotated into an abstract syntax tree;
抽象语法树(Abstract Syntax Tree,AST),或简称语法树(Syntax tree),是源代码语法结构的一种抽象表示。它以树状的形式表现编程语言的语法结构,树上的每个节点都表示源代码中的一种结构。Abstract Syntax Tree (AST), or simply Syntax tree, is an abstract representation of the grammatical structure of source code. It represents the syntax structure of the programming language in the form of a tree, and each node on the tree represents a structure in the source code.
在本发明实施例中,为能够使待注释代码片段的语法结构能够得到形象具体的体现,可以将待注释代码片段转换为抽象语法树,以便于后续对语语法特征向量的提取。In the embodiment of the present invention, in order to enable the grammatical structure of the code segment to be annotated to be vividly embodied, the code segment to be annotated can be converted into an abstract syntax tree, so as to facilitate subsequent extraction of syntactic feature vectors.
步骤203,从所述抽象语法树中提取所述待注释代码片段的语句类型信息; Step 203, extracting the statement type information of the code fragment to be annotated from the abstract syntax tree;
由于抽象语法树能够反映待注释代码片段中的每一种语法结构,也就是能够反映待注释代码片段的语句类型信息,因此可以从抽象语法树中提取所述待注释代码片段的语句类型信息,包括但不限于IfStmt(if语句)、ForStmt(for循环语句)、WhileStmt(while循环语句)等等。Since the abstract syntax tree can reflect each syntax structure in the code fragment to be annotated, that is, can reflect the statement type information of the code fragment to be annotated, the statement type information of the code fragment to be annotated can be extracted from the abstract syntax tree, Including but not limited to IfStmt (if statement), ForStmt (for loop statement), WhileStmt (while loop statement) and so on.
步骤204,根据对所述语句类型信息的统计结果,确定所述待注释代码片段对应的语法特征向量; Step 204, according to the statistical result of the statement type information, determine the grammatical feature vector corresponding to the code fragment to be annotated;
本发明实施例涉及待注释代码片段的语法特征向量,目的是描述代码片段在代码语言中的语法信息。可以通过对语句类型信息的统计结果,从而确定待注释代码片段对应的语法特征向量。The embodiment of the present invention relates to the grammatical feature vector of the code segment to be annotated, and the purpose is to describe the grammatical information of the code segment in the code language. The grammatical feature vector corresponding to the code segment to be annotated can be determined by the statistical result of the statement type information.
其中,语法特征向量可以是:不同语句类型的频率分布情况的频率分布特征、不同语句类型(即,相同语句类型去重)的数量的语句数量特征、 语句总数量的总语句数量特征以及基于语句类型加权得到的总语句类型特征。Wherein, the grammatical feature vector may be: the frequency distribution feature of the frequency distribution of different sentence types, the sentence quantity feature of the number of different sentence types (that is, the same sentence type is deduplicated), the total sentence quantity feature of the total number of sentences, and the sentence-based The total sentence type characteristics obtained by type weighting.
在本发明实施例中,所述语句类型信息包括多种语句类型的出现频率、数量和频率分布情况,步骤204可以包括以下子步骤:In this embodiment of the present invention, the statement type information includes the occurrence frequency, quantity, and frequency distribution of multiple statement types, and step 204 may include the following sub-steps:
统计所述多种语句类型的频率分布情况,确定语句频率分布特征;Counting the frequency distribution of the various sentence types, and determining the frequency distribution characteristics of the sentence;
统计所述多种语句类型的数量,确定语句数量特征;Counting the number of the various types of statements, and determining the characteristics of the number of statements;
统计所述多种语句类型所对应的语句总数量,确定总语句数量特征;Count the total number of statements corresponding to the multiple statement types, and determine the characteristics of the total number of statements;
采用第一预置词特征转换模型将所述多种语句类型分别转换为语句类型特征;Using the first preset word feature conversion model to convert the multiple statement types into statement type features respectively;
以所述出现频率作为权重,对多个所述语句类型特征进行加权求和,确定总语句类型特征;Using the frequency of occurrence as a weight, weighted summation is performed on a plurality of the statement type features to determine the total statement type features;
拼接所述语句频率分布特征、所述语句数量特征、所述总语句数量特征和所述总语句类型特征,生成所述待注释代码片段对应的语法特征向量。The statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced together to generate a grammatical feature vector corresponding to the code segment to be annotated.
在本发明实施例中,可以通过统计每种语句类型的频率分布情况,以确定每种语句类型的语句频率分布特征;分别统计每种语句类型的数量,确定每种语句类型的语句数量特征;统计全部语句的总数量,确定总语句数量特征;采用第一预置词特征转换模型例如Word2Vec模型等,将每种语句类型分别转换为对应的语句类型特征,再以每种语句类型的出现频率作为权重,对语句类型特征进行加权求和,确定总语句类型特征;最后将语句频率分布特征、语句数量特征、总语句数量特征和总语句类型特征进行拼接,以得到表示待注释代码片段的语法特征向量。In the embodiment of the present invention, the frequency distribution characteristics of each sentence type can be determined by counting the frequency distribution of each sentence type; the number of sentences of each sentence type can be determined by separately counting the number of each sentence type; Count the total number of all sentences, and determine the characteristics of the total number of sentences; use the first preset word feature conversion model such as the Word2Vec model, etc., to convert each sentence type into the corresponding sentence type feature, and then use the frequency of occurrence of each sentence type. As the weight, the statement type features are weighted and summed to determine the total statement type feature; finally, the statement frequency distribution feature, the statement number feature, the total statement number feature, and the total statement type feature are spliced to obtain the grammar representing the code fragment to be annotated. Feature vector.
值得一提的是,上述统计过程均可以并行进行。It is worth mentioning that all the above statistical processes can be performed in parallel.
Word2vec,是一群用来产生词特征的相关模型。这些模型为浅而双层的神经网络,用来训练以重新建构语言学之词文本。网络以词表现,并且需猜测相邻位置的输入词。训练完成之后,word2vec模型可用来映射每个词到一个特征,可用来表示词对词之间的关系,该特征为神经网络之隐藏层。Word2vec, is a group of related models used to generate word features. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word texts. The network is represented by words and needs to guess the input words in adjacent positions. After training, the word2vec model can be used to map each word to a feature, which can be used to represent the relationship between words and words, which is the hidden layer of the neural network.
步骤205,按照预设变量词划分规则从所述待注释代码片段中提取目标单词; Step 205, extracting the target word from the code fragment to be annotated according to the preset variable word division rule;
本发明实施例还需要获取多个关于代码片段的文本特征向量,目的是 描述代码片段中文本的分布情况、提取单词的意义以及单词的上下文信息。而在获取文本特征向量之前,还需要对待注释代码片段进行预处理,以获取目标单词。The embodiment of the present invention also needs to acquire a plurality of text feature vectors about the code fragment, the purpose is to describe the distribution of the text in the code fragment, extract the meaning of the word and the context information of the word. Before obtaining the text feature vector, it is also necessary to preprocess the commented code snippet to obtain the target word.
在具体实现中,可以按照预设变量词划分规则从所述待注释代码片段中提取目标单词。In a specific implementation, the target word may be extracted from the to-be-annotated code segment according to a preset variable word division rule.
可选地,所述预设变量词划分规则包括驼峰规则或下划线规则,步骤205可以包括以下子步骤:Optionally, the preset variable word division rules include hump rules or underline rules, and step 205 may include the following sub-steps:
从所述待注释代码片段中提取单词;Extract words from the to-be-annotated code snippet;
采用所述驼峰规则或所述下划线规则从所述单词中确定待处理单词;Determine the word to be processed from the word using the camel case rule or the underline rule;
从所述待处理单词中删除预设停用词,得到待提取单词;Delete preset stop words from the words to be processed to obtain words to be extracted;
提取所述待提取单词中的词干,生成目标单词。The stems in the words to be extracted are extracted to generate a target word.
在具体实现中,由于代码作者在命名代码变量的时候喜欢用驼峰式命名风格或者下划线命名风格将多个英语单词结合在一起给变量命名。因此,可以先从待注释代码片段中提取单词,一般地,用空格、括号或者是分号等分隔符将单词区分开来,再采用驼峰规则或者下划线规则对单词进行划分,以确定待处理单词;再进行预设停用词删除,得到待提取单词,预设停用词即如“the”、“is”、“at”、“on”等没有什么实际含义的功能词;而由于同一词干的单词可能以不同形式出现,为减少单词数量,还可以提取待提取单词中的词干,以得到目标单词。In the specific implementation, because the code author likes to use the camel case naming style or the underscore naming style to combine multiple English words to name the variables when naming the code variables. Therefore, words can be extracted from the code fragment to be annotated first. Generally, words are separated by separators such as spaces, brackets, or semicolons, and then the words are divided by the camel case rule or the underline rule to determine the words to be processed. ; Then delete the preset stop words to get the words to be extracted. The preset stop words are function words that have no actual meaning, such as "the", "is", "at", "on", etc.; Stem words may appear in different forms. In order to reduce the number of words, the stem of the words to be extracted can also be extracted to obtain the target word.
同时为方便后续操作,可以将所有目标单词统一处理成小写形式,本发明实施例对此不做限制。Meanwhile, in order to facilitate subsequent operations, all target words may be uniformly processed into lowercase form, which is not limited in this embodiment of the present invention.
步骤206,根据对所述目标单词的统计结果,确定所述待注释代码片段对应的文本特征向量; Step 206, according to the statistical result of the target word, determine the text feature vector corresponding to the code fragment to be annotated;
进一步地,所述目标单词包括多个待统计单词,步骤206可以包括以下子步骤:Further, the target word includes a plurality of words to be counted, and step 206 may include the following sub-steps:
统计所述多个待统计单词的总数量,确定总单词数量特征;Count the total number of the plurality of words to be counted, and determine the total word number feature;
统计所述多个待统计单词的种类数量,确定单词种类数量特征;Count the number of types of the plurality of words to be counted, and determine the characteristics of the number of word types;
分别计算所述多个待统计单词中的每种待统计单词的出现频次的方差,确定单词方差特征;Calculate the variance of the frequency of occurrence of each of the words to be counted in the plurality of words to be counted, and determine the word variance feature;
统计所述多个待统计单词中非英语单词的比例,确定非单词比例特征;Count the proportion of non-English words in the plurality of words to be counted, and determine the non-word proportion feature;
采用第二预置词特征转换模型将所述多个待统计单词分别转换为单词特征;Using the second preset word feature conversion model to convert the plurality of words to be counted into word features respectively;
以每个所述待统计单词的出现频率作为权重,对所有所述单词特征进行加权求和,生成总单词特征;Taking the frequency of occurrence of each described word to be counted as a weight, weighted summation is performed on all the described word features to generate total word features;
拼接所述总单词数量特征、所述单词种类数量特征、所述单词方差特征、所述非单词比例特征和所述总单词特征,生成所述待注释代码片段对应的文本特征向量。The total word quantity feature, the word type quantity feature, the word variance feature, the non-word proportion feature, and the total word feature are spliced together to generate a text feature vector corresponding to the code segment to be annotated.
在本发明实施例中,文本特征向量包括:总单词数量特征、单词种类数量特征(即,相同单词去重)、单词方差特征、非单词比例特征以及基于单词加权得到的总单词特征,其中,总单词数量特征、单词种类数量特征、单词方差特征这三个特征衡量了待注释代码中目标单词的分布情况;非英语单词,即某些词是代码作者自己“编造”出来的没有实际意义的变量名。非单词比例特征衡量了单词的可解释性信息;总单词特征这个特征包含了单词的上下文信息。In the embodiment of the present invention, the text feature vector includes: the total word quantity feature, the word type quantity feature (that is, the same word is deduplicated), the word variance feature, the non-word ratio feature, and the total word feature obtained based on word weighting, wherein, The three features of the total word quantity feature, the word type quantity feature, and the word variance feature measure the distribution of the target words in the code to be annotated; non-English words, that is, some words are "made up" by the code author and have no practical meaning. variable name. Non-word scale features measure word interpretability information; total word features include contextual information about words.
在具体实现中,通过统计全部待统计单词的总数量,以确定总单词数量特征;分别统计各种待统计单词的种类数量,以确定单词种类数量特征;分别计算每种待统计单词的出现频次的方差,确定单词方差特征;统计所述多个待统计单词中非英语单词的比例,确定非单词比例特征;采用第二预置词特征转换模型将所述多个待统计单词分别转换为单词特征;以每个所述待统计单词的出现频率作为权重,对所有所述单词特征进行加权求和,生成总单词特征;拼接所述总单词数量特征、所述单词种类数量特征、所述单词方差特征、所述非单词比例特征和所述总单词特征,生成所述待注释代码片段对应的文本特征向量。In the specific implementation, the total number of words to be counted is counted to determine the feature of the total number of words; the number of types of various words to be counted is counted separately to determine the feature of the number of word types; the frequency of occurrence of each word to be counted is calculated separately , determine the word variance feature; count the proportion of non-English words in the plurality of words to be counted to determine the non-word ratio feature; adopt the second preset word feature conversion model to convert the plurality of words to be counted into words respectively feature; take the frequency of occurrence of each described word to be counted as weight, carry out weighted summation to all described word features, and generate total word feature; splicing described total word quantity feature, described word type quantity feature, described word The variance feature, the non-word ratio feature, and the total word feature are used to generate a text feature vector corresponding to the code segment to be annotated.
其中,上述统计过程也是可以通过并行进行,第二预设词特征转换模型可以为Word2Vec模型等,本发明实施例对此不作限制。The above statistical process may also be performed in parallel, and the second preset word feature conversion model may be a Word2Vec model, etc., which is not limited in this embodiment of the present invention.
步骤207,根据所述待注释代码片段的复杂程度,确定所述待注释代码片段对应的结构特征向量; Step 207, according to the complexity of the to-be-annotated code fragment, determine the structural feature vector corresponding to the to-be-annotated code fragment;
在本发明的一个示例中,所述步骤207可以包括以下子步骤:In an example of the present invention, the step 207 may include the following sub-steps:
统计所述待注释代码片段中的代码行数,确定行数特征;Count the number of lines of code in the to-be-annotated code snippet, and determine the feature of the number of lines;
统计所述待注释代码片段中的嵌套语句数量,确定嵌套语句数量特征;Count the number of nested statements in the to-be-annotated code fragment, and determine the feature of the number of nested statements;
统计所述待注释代码片段中的最大嵌套层数,确定嵌套最大层数特征;Count the maximum number of nesting levels in the code fragment to be annotated, and determine the feature of the maximum number of nested levels;
统计所述待注释代码片段中的形式参数数量,确定形参数量特征;Count the number of formal parameters in the to-be-annotated code fragment, and determine the quantity characteristics of the formal parameters;
分别统计所述待注释代码片段中最长语句的单词数量特征、所述待注释代码片段的API调用数量特征、所述待注释代码片段的变量数量特征、所述待注释代码片段的标识符数量特征以及所述待注释代码片段的内部注释数量特征,依次拼接生成综合特征;Respectively count the feature of the number of words of the longest statement in the code fragment to be annotated, the feature of the number of API calls of the code fragment to be annotated, the feature of the number of variables of the code fragment to be annotated, and the number of identifiers of the code fragment to be annotated The feature and the internal annotation quantity feature of the code fragment to be annotated are sequentially spliced to generate a comprehensive feature;
拼接所述行数特征、所述嵌套语句数量特征、所述嵌套最大层数特征、所述形参数量特征和所述综合特征,生成所述待注释代码片段对应的结构特征向量。The line number feature, the nested statement number feature, the maximum nesting level feature, the shape parameter quantity feature, and the comprehensive feature are spliced together to generate a structural feature vector corresponding to the code segment to be annotated.
本发明实施例还需要确定待注释代码片段的结构特征,目的是确定待注释代码片段的复杂程度。函数代码片段的结构特征是能够描述函数代码片段组成结构上的一些特征。这些结构特征分别是:代码行数、嵌套语句数量、嵌套语句最大层数、函数是否有形式参数以及形式参数的数量、最长语句的单词数量、API调用数量、变量数量、标识符数量以及内部注释数量等。其中,所谓的嵌套语句,指的是一个语句中包含着另一个语句。如图3所示的是一个for循环语句包含着if条件语句的例子,代码片段的复杂程度与其重要性是有正相关关系的。The embodiment of the present invention also needs to determine the structural features of the code segment to be annotated, in order to determine the complexity of the code segment to be annotated. The structural features of function code fragments are some features that can describe the structure of function code fragments. These structural features are: the number of lines of code, the number of nested statements, the maximum level of nested statements, whether the function has formal parameters and the number of formal parameters, the number of words in the longest statement, the number of API calls, the number of variables, the number of identifiers and the number of internal annotations, etc. Among them, the so-called nested statement refers to a statement containing another statement. As shown in Figure 3, a for loop statement contains an example of an if conditional statement. The complexity of the code fragment is positively related to its importance.
值得一提的是,上述结构特征的统计过程可以并行进行,上述结构特征并不一定需要全部使用,在实际操作中可以由技术人员根据能否描述代码片段的复杂程度来进行灵活选择的,本发明实施例对此不作限制。It is worth mentioning that the statistical process of the above structural features can be carried out in parallel, and the above structural features do not necessarily need to be used. This embodiment of the invention does not limit this.
步骤208,基于所述待注释代码片段的函数调用数量,确定所述待注释代码片段对应的关系特征向量。Step 208: Determine a relational feature vector corresponding to the to-be-annotated code segment based on the number of function calls of the to-be-annotated code segment.
在本发明的另一个示例中,所述函数调用数量包括调用额外函数数量和被调用次数,步骤208可以包括以下子步骤:In another example of the present invention, the number of function calls includes the number of additional functions called and the number of times they are called. Step 208 may include the following sub-steps:
遍历所述待注释代码片段所属的待注释代码文件,确定所述待注释代码片段的所述调用额外函数数量和所述被调用次数;Traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of the called extra functions and the called number of times of the to-be-annotated code fragment;
拼接所述调用额外函数数量和所述被调用次数,生成所述待注释代码片段对应的关系特征向量。The number of called extra functions and the called times are spliced together to generate a relational feature vector corresponding to the code segment to be annotated.
在本发明实施例中,还需要对不同函数代码片段之间的相互联系进行分析,此时可以采用类似社交网络或有向图的方式,以出度值定义当前待 注释片段中的被调用次数,以入度值定义调用额外函数数量。再通过扫描遍历整个所述待注释代码片段所属的待注释代码文件,以确定待注释代码片段的出度值和入度值,即调用额外函数数量和被调用次数,再将上述两个特征进行拼接,以生成待注释代码片段对应的关系特征向量。In this embodiment of the present invention, it is also necessary to analyze the interconnection between different function code fragments. At this time, a method similar to a social network or a directed graph can be used to define the number of calls in the current fragment to be annotated by the out-degree value. , which defines the number of calls to additional functions in terms of in-degree values. Then, scan and traverse the code file to be annotated to which the entire code fragment to be annotated belongs to determine the out-degree value and in-degree value of the code fragment to be annotated, that is, the number of extra functions to be called and the number of times to be called, and then the above two characteristics are performed. splicing to generate the relational feature vector corresponding to the code fragment to be annotated.
值得一提的是,步骤202-204作为一个整体,步骤205-206作为一个整体,与步骤207和步骤208之间可以并行执行。It is worth mentioning that steps 202-204 as a whole, steps 205-206 as a whole, and steps 207 and 208 can be executed in parallel.
步骤209,将所述第一特征向量输入到所述目标分类模型,输出对所述待注释代码片段的重要性判断结果;Step 209, inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
在具体实现中,所述步骤209可以包括以下子步骤:In a specific implementation, the step 209 may include the following sub-steps:
将所述第一特征向量输入到所述目标分类模型;inputting the first feature vector into the target classification model;
当所述目标分类模型的输出为所述第一预设标签时,确定所述待注释代码片段的重要性判断结果为重要;When the output of the target classification model is the first preset label, determine that the importance judgment result of the code segment to be annotated is important;
当所述目标分类模型的输出为所述第二预设标签时,确定所述待注释代码片段的重要性判断结果为不重要。When the output of the target classification model is the second preset label, it is determined that the importance judgment result of the code segment to be annotated is not important.
在本发明实施例中,在得到第一特征向量后,可以将第一特征向量输入到目标分类模型,经目标分类模型基于第一特征向量综合判断,以得到模型输出。当所述目标分类模型的输出为所述第一预设标签时,确定所述待注释代码片段的重要性判断结果为重要;若是所述目标分类模型的输出为所述第二预设标签时,确定所述待注释代码片段的重要性判断结果为不重要。In the embodiment of the present invention, after the first feature vector is obtained, the first feature vector may be input into the target classification model, and the target classification model comprehensively judges based on the first feature vector to obtain the model output. When the output of the target classification model is the first preset label, it is determined that the importance judgment result of the code segment to be annotated is important; if the output of the target classification model is the second preset label , and determine that the importance judgment result of the code segment to be annotated is not important.
本发明通过预置的分类模型训练过程生成目标分类模型,当接收到待注释代码片段时,从待注释代码片段提取第一特征向量,最后将第一特征向量输入目标分类模型,以得到待注释代码片段的重要性判断结果。从而解决现有的预测代码注释位置的技术中仅局限于将代码文本当成无结构的纯文本,对多个维度的特征利用率较低所导致的确定代码注释位置的准确性变低,无法给开发人员合理的建议,降低软件开发与维护人员工作效率的技术问题,进而能够高效地判断待注释代码的重要性,以便于优化软件开发与维护人员的注释行为,使代码注释量保持在一个更为合适的范围。The present invention generates a target classification model through a preset classification model training process, when a code fragment to be annotated is received, a first feature vector is extracted from the code fragment to be annotated, and finally the first feature vector is input into the target classification model to obtain the to-be-annotated code fragment The importance judgment result of the code snippet. Therefore, the existing technology for predicting the location of code comments is limited to treating the code text as unstructured plain text, and the accuracy of determining the location of code comments is low due to the low feature utilization of multiple dimensions. Reasonable suggestions from developers can reduce the technical problems of the work efficiency of software developers and maintainers, so as to efficiently judge the importance of the code to be annotated, so as to optimize the comment behavior of software developers and maintainers, and keep the amount of code comments within a higher level. to the appropriate range.
请参阅图4,图4示出了本发明实施例提供的一种代码片段重要性的 判断方法的步骤流程图。Referring to Fig. 4, Fig. 4 shows a flowchart of steps of a method for judging the importance of a code segment provided by an embodiment of the present invention.
从软件仓库中收集Java项目代码文件;以函数为单位划分项目代码文件;带函数头注释的函数代码片段标记为1,否则标记为0;从函数代码片段中提取所需特征;所需特征包括语法特征、文本特征、结构特征和关系特征;Collect Java project code files from software repositories; divide project code files by function; function code snippets with function header annotations are marked with 1, otherwise marked with 0; required features are extracted from function code snippets; required features include grammatical, textual, structural and relational features;
语法特征提取流程包括:准备提取语法特征;将函数代码片段转换成抽象语法树;从抽象语法树中获取函数代码片段的语句类型信息;统计不同语句类型的频率分布情况;统计不同语句类型的数量;统计语句的总数量;将语句类型转换为特征并按出现频率加权求和;拼接得到语法特征。The syntax feature extraction process includes: preparing to extract syntax features; converting function code segments into abstract syntax trees; obtaining statement type information of function code segments from the abstract syntax tree; counting the frequency distribution of different statement types; counting the number of different statement types ; Count the total number of sentences; convert sentence types into features and weight them according to the frequency of occurrence; concatenate to obtain grammatical features.
文本特征提取流程包括:准备提取文本特征;提取函数代码片段中的单词;根据驼峰规则或下划线规则对变量词进行划分;对单词统一处理成小写形式;删除停用词;词干提取;统计单词的总数量;统计使用单词的种类数量;计算不同单词出现频次的方差;统计非英语单词的比例;将单词转换为特征并按出现频率加权求和;拼接得到文本特征。The text feature extraction process includes: preparing to extract text features; extracting words in function code snippets; dividing variable words according to the camel case rule or underscore rule; uniformly processing words into lowercase; deleting stop words; stemming; counting words count the number of types of words used; calculate the variance of the frequency of occurrence of different words; count the proportion of non-English words; convert words into features and weighted sums according to their frequency of occurrence; splicing to obtain text features.
结构特征提取流程包括:准备提取结构特征;统计函数代码片段的代码行数;统计函数代码片段嵌套语句的数量;统计函数代码片段嵌套语句的最大层数;统计函数代码片段中的形式参数的数量;统计函数代码片段中最长语句的单词数量;统计函数代码片段中的API调用数量;统计函数代码片段中的变量数量;统计函数代码片段中的标识符数量以及统计函数代码片段中的内部注释个数;拼接得到结构特征。The structural feature extraction process includes: preparing to extract structural features; counting the number of lines of code in the function code fragment; counting the number of nested statements in the function code fragment; counting the maximum level of nested statements in the function code fragment; counting the formal parameters in the function code fragment count; count the number of words in the longest statement in the function snippet; count the number of API calls in the function snippet; count the number of variables in the function snippet; count the identifiers in the function snippet and count the number of identifiers in the function snippet Number of internal annotations; splicing to obtain structural features.
关系特征提取流程包括:准备提取关系特征;定义出度值和入度值的概念;统计每个函数的出度值和入度值;拼接得到关系特征。The relational feature extraction process includes: preparing to extract relational features; defining the concepts of out-degree and in-degree values; counting out-degree and in-degree values of each function; splicing to obtain relational features.
在并行执行上述四个提取流程后,拼接得到每个函数代码片段的最终特征;结合标签训练分类模型;得到目标分类模型,其中每次训练都会输出函数代码片段是否重要的结果;After executing the above four extraction processes in parallel, the final features of each function code fragment are obtained by splicing; the classification model is trained in combination with the labels; the target classification model is obtained, in which each training will output the result of whether the function code fragment is important;
当接收到新的函数代码片段时,提取该函数代码片段的最终特征;将最终特征输入到目标分类模型,输出函数代码片段是否重要。When a new function code fragment is received, the final feature of the function code fragment is extracted; the final feature is input to the target classification model, and the output function code fragment is important.
请参阅图5,图5为本发明实施例提供的一种代码片段重要性的判断装置的结构框图。Please refer to FIG. 5. FIG. 5 is a structural block diagram of an apparatus for judging the importance of a code segment according to an embodiment of the present invention.
本发明还提供了一种代码片段重要性的判断装置,包括:The present invention also provides a device for judging the importance of a code segment, including:
代码片段接收模块501,用于接收待注释代码片段;A code fragment receiving module 501, configured to receive code fragments to be annotated;
第一特征向量提取模块502,用于提取所述待注释代码片段的第一特征向量;The first feature vector extraction module 502 is used to extract the first feature vector of the code fragment to be annotated;
重要性输出模块503,用于将所述第一特征向量输入到所述目标分类模型,输出对所述待注释代码片段的重要性判断结果; Importance output module 503, for inputting the first feature vector into the target classification model, and outputting the importance judgment result of the code fragment to be annotated;
其中,所述目标分类模型通过预置的分类模型训练过程所生成。Wherein, the target classification model is generated through a preset classification model training process.
可选地,所述分类模型训练模块包括:Optionally, the classification model training module includes:
已注释代码文件接收子模块,用于从预置的软件仓库中获取已注释代码文件;The annotated code file receiving submodule is used to obtain the annotated code file from the preset software repository;
文件划分子模块,用于以函数为单位对所述已注释代码文件进行划分,生成多个训练代码片段;a file division submodule, which is used to divide the annotated code file in units of functions to generate a plurality of training code fragments;
第一标签设置子模块,用于为具有预设类型注释的所述训练代码片段设置第一预设标签;a first label setting submodule, used for setting a first preset label for the training code snippet with a preset type annotation;
第二标签设置子模块,用于为不具有预设类型注释的所述训练代码片段设置第二预设标签;A second label setting submodule, configured to set a second preset label for the training code fragment that does not have a preset type annotation;
第二特征向量提取子模块,用于分别提取每个所述训练代码片段的第二特征向量;The second feature vector extraction submodule is used to extract the second feature vector of each of the training code fragments respectively;
分类模型训练子模块,用于采用多个所述第二特征向量训练预置的初始分类模型,得到目标分类模型。The classification model training sub-module is used for training a preset initial classification model by using a plurality of the second feature vectors to obtain a target classification model.
可选地,所述第一特征向量包括语法特征向量、文本特征向量、结构特征向量和关系特征向量,所述第一特征向量提取模块502包括:Optionally, the first feature vector includes a syntax feature vector, a text feature vector, a structural feature vector and a relational feature vector, and the first feature vector extraction module 502 includes:
转换子模块,用于将所述待注释代码片段转换为抽象语法树;A conversion submodule for converting the code fragment to be annotated into an abstract syntax tree;
语句类型信息提取子模块,用于从所述抽象语法树中提取所述待注释代码片段的语句类型信息;A statement type information extraction submodule, used for extracting the statement type information of the code fragment to be annotated from the abstract syntax tree;
语法特征向量确定子模块,用于根据对所述语句类型信息的统计结果,确定所述待注释代码片段对应的语法特征向量;a grammatical feature vector determination submodule, configured to determine the grammatical feature vector corresponding to the code fragment to be annotated according to the statistical result of the statement type information;
目标单词提取子模块,用于按照预设变量词划分规则从所述待注释代码片段中提取目标单词;A target word extraction submodule, used for extracting target words from the to-be-annotated code fragment according to a preset variable word division rule;
文本特征向量确定子模块,用于根据对所述目标单词的统计结果,确 定所述待注释代码片段对应的文本特征向量;Text feature vector determination submodule, for determining the text feature vector corresponding to the code fragment to be annotated according to the statistical result of the target word;
结构特征向量确定子模块,用于根据所述待注释代码片段的复杂程度,确定所述待注释代码片段对应的结构特征向量;a structural feature vector determination submodule, configured to determine the structural feature vector corresponding to the to-be-annotated code segment according to the complexity of the to-be-annotated code segment;
关系特征向量确定子模块,用于基于所述待注释代码片段的函数调用数量,确定所述待注释代码片段对应的关系特征向量。The relationship feature vector determination submodule is configured to determine the relationship feature vector corresponding to the to-be-annotated code segment based on the number of function calls of the to-be-annotated code segment.
可选地,所述语句类型信息包括多种语句类型的出现频率、数量和频率分布情况,所述语法特征向量确定子模块包括:Optionally, the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types, and the grammatical feature vector determination submodule includes:
语句频率分布特征确定单元,用于统计所述多种语句类型的频率分布情况,确定语句频率分布特征;a sentence frequency distribution feature determining unit, configured to count the frequency distributions of the multiple sentence types to determine the sentence frequency distribution features;
语句数量特征确定单元,用于统计所述多种语句类型的数量,确定语句数量特征;A statement quantity feature determining unit, configured to count the number of the multiple statement types and determine the statement quantity feature;
总语句数量特征确定单元,用于统计所述多种语句类型所对应的语句总数量,确定总语句数量特征;A total statement quantity feature determining unit, used to count the total number of statements corresponding to the multiple statement types, and determine the total statement number feature;
语句类型特征转换单元,用于采用第一预置词特征转换模型将所述多种语句类型分别转换为语句类型特征;a statement type feature conversion unit, configured to convert the multiple statement types into statement type features respectively by adopting the first preset word feature conversion model;
总语句类型特征确定单元,用于以所述出现频率作为权重,对多个所述语句类型特征进行加权求和,确定总语句类型特征;a total sentence type feature determining unit, configured to use the frequency of occurrence as a weight to perform a weighted summation on a plurality of the sentence type features to determine the total sentence type feature;
语法特征向量生成单元,用于拼接所述语句频率分布特征、所述语句数量特征、所述总语句数量特征和所述总语句类型特征,生成所述待注释代码片段对应的语法特征向量。A grammatical feature vector generating unit, configured to splicing the statement frequency distribution feature, the statement quantity feature, the total statement quantity feature and the total statement type feature to generate a grammatical feature vector corresponding to the code segment to be annotated.
可选地,所述预设变量词划分规则包括驼峰规则或下划线规则,所述目标单词提取子模块包括:Optionally, the preset variable word division rules include hump rules or underline rules, and the target word extraction submodule includes:
单词提取单元,用于从所述待注释代码片段中提取单词;A word extraction unit for extracting words from the to-be-annotated code fragment;
待处理单词确定单元,用于采用所述驼峰规则或所述下划线规则从所述单词中确定待处理单词;a word determination unit to be processed, for using the hump rule or the underline rule to determine the word to be processed from the word;
待提取单词确定单元,用于从所述待处理单词中删除预设停用词,得到待提取单词;a word to be extracted determination unit, used for deleting preset stop words from the to-be-processed word to obtain the to-be-extracted word;
目标单词确定单元,用于提取所述待提取单词中的词干,生成目标单词。The target word determination unit is used for extracting the stem in the words to be extracted to generate the target word.
可选地,所述目标单词包括多个待统计单词,所述文本特征向量确定 子模块包括:Optionally, the target word includes a plurality of words to be counted, and the text feature vector determination submodule includes:
总单词数量特征确定单元,用于统计所述多个待统计单词的总数量,确定总单词数量特征;A total word quantity feature determination unit, used to count the total number of the multiple words to be counted, and determine the total word quantity feature;
单词种类数量特征确定单元,用于统计所述多个待统计单词的种类数量,确定单词种类数量特征;A word type and quantity feature determining unit, used to count the types and quantities of the plurality of words to be counted, and determine the word type and quantity characteristics;
单词方差特征确定单元,用于分别计算所述多个待统计单词中的每种待统计单词的出现频次的方差,确定单词方差特征;a word variance feature determining unit, used to calculate the variance of the frequency of occurrence of each word to be counted among the multiple words to be counted, to determine the word variance feature;
非单词比例特征确定单元,用于统计所述多个待统计单词中非英语单词的比例,确定非单词比例特征;a non-word ratio feature determining unit, used to count the ratio of non-English words in the plurality of words to be counted, and determine the non-word ratio feature;
单词特征转换单元,用于采用第二预置词特征转换模型将所述多个待统计单词分别转换为单词特征;A word feature conversion unit, configured to convert the plurality of words to be counted into word features respectively by adopting a second preset word feature conversion model;
总单词特征生成单元,用于以每个所述待统计单词的出现频率作为权重,对所有所述单词特征进行加权求和,生成总单词特征;The total word feature generating unit is used for taking the frequency of occurrence of each described word to be counted as a weight, and performing weighted summation on all the described word features to generate a total word feature;
文本特征向量确定单元,用于拼接所述总单词数量特征、所述单词种类数量特征、所述单词方差特征、所述非单词比例特征和所述总单词特征,生成所述待注释代码片段对应的文本特征向量。The text feature vector determination unit is used for splicing the total word quantity feature, the word type quantity feature, the word variance feature, the non-word ratio feature and the total word feature, and generating the to-be-annotated code fragments corresponding to The text feature vector of .
可选地,所述结构特征向量确定子模块包括:Optionally, the structural feature vector determination submodule includes:
行数特征确定单元,用于统计所述待注释代码片段中的代码行数,确定行数特征;a line number feature determination unit, used to count the number of lines of code in the to-be-annotated code fragment to determine the line number feature;
嵌套语句数量特征确定单元,用于统计所述待注释代码片段中的嵌套语句数量,确定嵌套语句数量特征;a unit for determining the quantity of nested statements, which is used to count the number of nested statements in the code fragment to be annotated, and to determine the number of nested statements;
最大嵌套层数特征确定单元,用于统计所述待注释代码片段中的最大嵌套层数,确定嵌套最大层数特征;The maximum nesting level feature determination unit is used to count the maximum nesting level in the code fragment to be annotated, and determine the maximum nesting level feature;
形参数量特征确定单元,用于统计所述待注释代码片段中的形式参数数量,确定形参数量特征;a shape parameter quantity feature determination unit, used to count the number of formal parameters in the to-be-annotated code fragment, and determine the shape parameter quantity feature;
综合特征确定单元,用于分别统计所述待注释代码片段中最长语句的单词数量特征、所述待注释代码片段的API调用数量特征、所述待注释代码片段的变量数量特征、所述待注释代码片段的标识符数量特征以及所述待注释代码片段的内部注释数量特征,依次拼接生成综合特征;The comprehensive feature determination unit is used to count the word quantity feature of the longest statement in the code fragment to be annotated, the API call quantity feature of the code fragment to be annotated, the variable quantity feature of the code fragment to be annotated, the The identifier quantity feature of the annotated code fragment and the internal annotation quantity feature of the to-be-annotated code fragment are sequentially spliced to generate comprehensive features;
结构特征向量生成单元,用于拼接所述行数特征、所述嵌套语句数量 特征、所述嵌套最大层数特征、所述形参数量特征和所述综合特征,生成所述待注释代码片段对应的结构特征向量。Structural feature vector generation unit, used for splicing the line number feature, the nested statement quantity feature, the nested maximum level feature, the shape parameter quantity feature and the comprehensive feature to generate the to-be-annotated code Structural feature vector corresponding to the fragment.
可选地,所述函数调用数量包括调用额外函数数量和被调用次数,所述系特征向量确定子模块包括:Optionally, the number of function calls includes the number of additional functions called and the number of times the function is called, and the feature vector determination submodule includes:
函数调用数量确定单元,用于遍历所述待注释代码片段所属的待注释代码文件,确定所述待注释代码片段的所述调用额外函数数量和所述被调用次数;a function invocation number determination unit, configured to traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of the additional functions called and the called times of the to-be-annotated code fragment;
关系特征向量生成单元,用于拼接所述调用额外函数数量和所述被调用次数,生成所述待注释代码片段对应的关系特征向量。A relational feature vector generating unit, configured to concatenate the number of called extra functions and the called times to generate a relational feature vector corresponding to the code segment to be annotated.
可选地,所述重要性输出模块503包括:Optionally, the importance output module 503 includes:
特征向量输入子模块,用于将所述第一特征向量输入到所述目标分类模型;a feature vector input submodule, for inputting the first feature vector into the target classification model;
重要性确定子模块,用于当所述目标分类模型的输出为所述第一预设标签时,确定所述待注释代码片段的重要性判断结果为重要;an importance determination submodule, configured to determine the importance judgment result of the code fragment to be annotated as important when the output of the target classification model is the first preset label;
重要性否定子模块,用于当所述目标分类模型的输出为所述第二预设标签时,确定所述待注释代码片段的重要性判断结果为不重要。The importance negation sub-module is configured to determine that the importance judgment result of the code segment to be annotated is not important when the output of the target classification model is the second preset label.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
在本发明所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
以上所述,以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

  1. 一种代码片段重要性的判断方法,其特征在于,包括:A method for judging the importance of code snippets, comprising:
    接收待注释代码片段;Receive code snippets to be annotated;
    提取所述待注释代码片段的第一特征向量;extracting the first feature vector of the code fragment to be annotated;
    将所述第一特征向量输入到所述目标分类模型,输出对所述待注释代码片段的重要性判断结果;Inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
    其中,所述目标分类模型通过预置的分类模型训练过程所生成。Wherein, the target classification model is generated through a preset classification model training process.
  2. 根据权利要求1所述的方法,其特征在于,所述分类模型训练过程包括:The method according to claim 1, wherein the classification model training process comprises:
    从预置的软件仓库中获取已注释代码文件;Get annotated code files from pre-built software repositories;
    以函数为单位对所述已注释代码文件进行划分,生成多个训练代码片段;Divide the annotated code file in units of functions to generate multiple training code fragments;
    为具有预设类型注释的所述训练代码片段设置第一预设标签;setting a first preset label for the training code snippet with a preset type annotation;
    为不具有预设类型注释的所述训练代码片段设置第二预设标签;Setting a second preset label for the training code snippet without a preset type annotation;
    分别提取每个所述训练代码片段的第二特征向量;extracting the second feature vector of each of the training code fragments respectively;
    采用多个所述第二特征向量训练预置的初始分类模型,得到目标分类模型。Using a plurality of the second feature vectors to train a preset initial classification model to obtain a target classification model.
  3. 根据权利要求1所述的代码片段重要性的判断方法,其特征在于,所述第一特征向量包括语法特征向量、文本特征向量、结构特征向量和关系特征向量,所述提取所述待注释代码片段的第一特征向量的步骤,包括:The method for judging the importance of a code segment according to claim 1, wherein the first feature vector comprises a syntax feature vector, a text feature vector, a structural feature vector and a relational feature vector, and the extracting the code to be annotated Steps of segmenting the first feature vector, including:
    将所述待注释代码片段转换为抽象语法树;converting the code fragment to be annotated into an abstract syntax tree;
    从所述抽象语法树中提取所述待注释代码片段的语句类型信息;Extract the statement type information of the code fragment to be annotated from the abstract syntax tree;
    根据对所述语句类型信息的统计结果,确定所述待注释代码片段对应的语法特征向量;According to the statistical result of the statement type information, determine the grammatical feature vector corresponding to the code fragment to be annotated;
    按照预设变量词划分规则从所述待注释代码片段中提取目标单词;Extract the target word from the to-be-annotated code fragment according to the preset variable word division rule;
    根据对所述目标单词的统计结果,确定所述待注释代码片段对应的文本特征向量;According to the statistical result of the target word, determine the text feature vector corresponding to the code fragment to be annotated;
    根据所述待注释代码片段的复杂程度,确定所述待注释代码片段对应的结构特征向量;According to the complexity of the to-be-annotated code fragment, determine the structural feature vector corresponding to the to-be-annotated code fragment;
    基于所述待注释代码片段的函数调用数量,确定所述待注释代码片段对应的关系特征向量。Based on the number of function calls of the to-be-annotated code segment, a relational feature vector corresponding to the to-be-annotated code segment is determined.
  4. 根据权利要求3所述的代码片段重要性的判断方法,其特征在于,所述语句类型信息包括多种语句类型的出现频率、数量和频率分布情况,所述根据对所述语句类型信息的统计结果,确定所述待注释代码片段对应的语法特征向量的步骤,包括:The method for judging the importance of a code segment according to claim 3, wherein the statement type information includes the occurrence frequency, quantity and frequency distribution of multiple statement types, and the statement type information is based on statistics of the statement type information. As a result, the step of determining the grammatical feature vector corresponding to the code fragment to be annotated includes:
    统计所述多种语句类型的频率分布情况,确定语句频率分布特征;Counting the frequency distribution of the various sentence types, and determining the frequency distribution characteristics of the sentence;
    统计所述多种语句类型的数量,确定语句数量特征;Counting the number of the various types of statements, and determining the characteristics of the number of statements;
    统计所述多种语句类型所对应的语句总数量,确定总语句数量特征;Count the total number of statements corresponding to the multiple statement types, and determine the characteristics of the total number of statements;
    采用第一预置词特征转换模型将所述多种语句类型分别转换为语句类型特征;Using the first preset word feature conversion model to convert the multiple statement types into statement type features respectively;
    以所述出现频率作为权重,对多个所述语句类型特征进行加权求和,确定总语句类型特征;Using the frequency of occurrence as a weight, weighted summation is performed on a plurality of the statement type features to determine the total statement type features;
    拼接所述语句频率分布特征、所述语句数量特征、所述总语句数量特征和所述总语句类型特征,生成所述待注释代码片段对应的语法特征向量。The statement frequency distribution feature, the statement quantity feature, the total statement number feature, and the total statement type feature are spliced together to generate a grammatical feature vector corresponding to the code segment to be annotated.
  5. 根据权利要求3所述的代码片段重要性的判断方法,其特征在于,所述预设变量词划分规则包括驼峰规则或下划线规则,所述按照预设变量词划分规则从所述待注释代码片段中提取目标单词的步骤,包括:The method for judging the importance of a code segment according to claim 3, wherein the preset variable word division rule includes a camel case rule or an underline rule, and the preset variable word division rule from the code segment to be annotated The steps of extracting target words in , including:
    从所述待注释代码片段中提取单词;Extract words from the to-be-annotated code snippet;
    采用所述驼峰规则或所述下划线规则从所述单词中确定待处理单词;Determine the word to be processed from the word using the camel case rule or the underline rule;
    从所述待处理单词中删除预设停用词,得到待提取单词;Delete preset stop words from the words to be processed to obtain words to be extracted;
    提取所述待提取单词中的词干,生成目标单词。The stems in the words to be extracted are extracted to generate a target word.
  6. 根据权利要求3或5所述的代码片段重要性的判断方法,其特征在于,所述目标单词包括多个待统计单词,所述根据对所述目标单词的统计结果,确定所述待注释代码片段对应的文本特征向量的步骤,包括:The method for judging the importance of a code segment according to claim 3 or 5, wherein the target word includes a plurality of words to be counted, and the code to be annotated is determined according to a statistical result of the target word The steps of the text feature vector corresponding to the fragment include:
    统计所述多个待统计单词的总数量,确定总单词数量特征;Count the total number of the plurality of words to be counted, and determine the total word number feature;
    统计所述多个待统计单词的种类数量,确定单词种类数量特征;Count the number of types of the plurality of words to be counted, and determine the characteristics of the number of word types;
    分别计算所述多个待统计单词中的每种待统计单词的出现频次的方差,确定单词方差特征;Calculate the variance of the frequency of occurrence of each of the words to be counted in the plurality of words to be counted, and determine the word variance feature;
    统计所述多个待统计单词中非英语单词的比例,确定非单词比例特征;Count the proportion of non-English words in the plurality of words to be counted, and determine the non-word proportion feature;
    采用第二预置词特征转换模型将所述多个待统计单词分别转换为单词特征;Using the second preset word feature conversion model to convert the plurality of words to be counted into word features respectively;
    以每个所述待统计单词的出现频率作为权重,对所有所述单词特征进行加权求和,生成总单词特征;Taking the frequency of occurrence of each described word to be counted as a weight, weighted summation is performed on all the described word features to generate total word features;
    拼接所述总单词数量特征、所述单词种类数量特征、所述单词方差特征、所述非单词比例特征和所述总单词特征,生成所述待注释代码片段对应的文本特征向量。The total word quantity feature, the word type quantity feature, the word variance feature, the non-word ratio feature, and the total word feature are spliced together to generate a text feature vector corresponding to the code segment to be annotated.
  7. 根据权利要求3所述的代码片段重要性的判断方法,其特征在于,所述根据所述待注释代码片段的复杂程度,确定所述待注释代码片段对应的结构特征向量的步骤,包括:The method for judging the importance of a code segment according to claim 3, wherein the step of determining the structural feature vector corresponding to the code segment to be annotated according to the complexity of the code segment to be annotated comprises:
    统计所述待注释代码片段中的代码行数,确定行数特征;Count the number of lines of code in the to-be-annotated code snippet, and determine the feature of the number of lines;
    统计所述待注释代码片段中的嵌套语句数量,确定嵌套语句数量特征;Count the number of nested statements in the to-be-annotated code fragment, and determine the feature of the number of nested statements;
    统计所述待注释代码片段中的最大嵌套层数,确定嵌套最大层数特征;Count the maximum number of nesting levels in the code fragment to be annotated, and determine the feature of the maximum number of nested levels;
    统计所述待注释代码片段中的形式参数数量,确定形参数量特征;Count the number of formal parameters in the to-be-annotated code fragment, and determine the quantity characteristics of the formal parameters;
    分别统计所述待注释代码片段中最长语句的单词数量特征、所述待注释代码片段的API调用数量特征、所述待注释代码片段的变量数量特征、所述待注释代码片段的标识符数量特征以及所述待注释代码片段的内部注释数量特征,依次拼接生成综合特征;Respectively count the feature of the number of words of the longest statement in the code snippet to be annotated, the feature of the number of API calls of the code snippet to be annotated, the feature of the number of variables of the code snippet to be annotated, and the number of identifiers of the code snippet to be annotated The feature and the internal annotation quantity feature of the code fragment to be annotated are sequentially spliced to generate a comprehensive feature;
    拼接所述行数特征、所述嵌套语句数量特征、所述嵌套最大层数特征、所述形参数量特征和所述综合特征,生成所述待注释代码片段对应的结构特征向量。The line number feature, the nested statement number feature, the maximum nested level feature, the shape parameter quantity feature, and the comprehensive feature are spliced together to generate a structural feature vector corresponding to the code segment to be annotated.
  8. 根据权利要求3所述的代码片段重要性的判断方法,其特征在于,所述函数调用数量包括调用额外函数数量和被调用次数,所述基于所述待注释代码片段的函数调用数量,确定所述待注释代码片段对应的关系特征向量的步骤,包括:The method for judging the importance of a code segment according to claim 3, wherein the number of function calls includes the number of additional functions called and the number of times the function is called, and the number of function calls based on the code segment to be annotated is determined. Describe the steps of the relational feature vector corresponding to the code fragment to be annotated, including:
    遍历所述待注释代码片段所属的待注释代码文件,确定所述待注释代码片段的所述调用额外函数数量和所述被调用次数;Traverse the to-be-annotated code file to which the to-be-annotated code fragment belongs, and determine the number of the called extra functions and the called number of times of the to-be-annotated code fragment;
    拼接所述调用额外函数数量和所述被调用次数,生成所述待注释代码片段对应的关系特征向量。The number of called extra functions and the called times are spliced together to generate a relational feature vector corresponding to the code segment to be annotated.
  9. 根据权利要求2所述的代码片段重要性的判断方法,其特征在于, 所述将所述第一特征向量输入到所述目标分类模型,输出对所述待注释代码片段的重要性判断结果的步骤,包括:The method for judging the importance of a code segment according to claim 2, wherein the inputting the first feature vector into the target classification model, and outputting a result of judging the importance of the code segment to be annotated. steps, including:
    将所述第一特征向量输入到所述目标分类模型;inputting the first feature vector into the target classification model;
    当所述目标分类模型的输出为所述第一预设标签时,确定所述待注释代码片段的重要性判断结果为重要;When the output of the target classification model is the first preset label, determine that the importance judgment result of the code segment to be annotated is important;
    当所述目标分类模型的输出为所述第二预设标签时,确定所述待注释代码片段的重要性判断结果为不重要。When the output of the target classification model is the second preset label, it is determined that the importance judgment result of the code segment to be annotated is not important.
  10. 一种代码片段重要性的判断装置,其特征在于,包括:A device for judging the importance of a code segment, comprising:
    代码片段接收模块,用于接收待注释代码片段;The code fragment receiving module is used to receive the code fragment to be annotated;
    第一特征向量提取模块,用于提取所述待注释代码片段的第一特征向量;A first feature vector extraction module, for extracting the first feature vector of the code fragment to be annotated;
    重要性输出模块,用于将所述第一特征向量输入到所述目标分类模型,输出对所述待注释代码片段的重要性判断结果;an importance output module, for inputting the first feature vector into the target classification model, and outputting the result of judging the importance of the code fragment to be annotated;
    其中,所述目标分类模型通过预置的分类模型训练模块所生成。Wherein, the target classification model is generated by a preset classification model training module.
PCT/CN2021/081731 2020-12-07 2021-03-19 Method and apparatus for determining importance of code segment WO2022121146A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011418126.2A CN112417852B (en) 2020-12-07 2020-12-07 Method and device for judging importance of code segment
CN202011418126.2 2020-12-07

Publications (1)

Publication Number Publication Date
WO2022121146A1 true WO2022121146A1 (en) 2022-06-16

Family

ID=74775399

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/081731 WO2022121146A1 (en) 2020-12-07 2021-03-19 Method and apparatus for determining importance of code segment

Country Status (2)

Country Link
CN (1) CN112417852B (en)
WO (1) WO2022121146A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116302043A (en) * 2023-05-25 2023-06-23 深圳市明源云科技有限公司 Code maintenance problem detection method and device, electronic equipment and readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417852B (en) * 2020-12-07 2022-01-25 中山大学 Method and device for judging importance of code segment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021410A (en) * 2016-05-12 2016-10-12 中国科学院软件研究所 Source code annotation quality evaluation method based on machine learning
CN107943514A (en) * 2017-11-01 2018-04-20 北京大学 The method for digging and system of core code element in a kind of software document
CN108170468A (en) * 2017-12-28 2018-06-15 中山大学 The method and its system of a kind of automatic detection annotation and code consistency
CN108734215A (en) * 2018-05-21 2018-11-02 上海戎磐网络科技有限公司 Software classification method and device
CN109213520A (en) * 2018-09-08 2019-01-15 中山大学 A kind of annotation point recommended method and system based on Recognition with Recurrent Neural Network
US20190197119A1 (en) * 2017-12-21 2019-06-27 Facebook, Inc. Language-agnostic understanding
CN111104159A (en) * 2019-12-19 2020-05-05 南京邮电大学 Annotation positioning method based on program analysis and neural network
CN112417852A (en) * 2020-12-07 2021-02-26 中山大学 Method and device for judging importance of code segment

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5867088B2 (en) * 2012-01-05 2016-02-24 富士電機株式会社 Software creation support apparatus and program for embedded devices
CN103488518A (en) * 2013-09-11 2014-01-01 上海镜月信息科技有限公司 Code highlighting method using code importance as basis
CN107870853A (en) * 2016-09-27 2018-04-03 北京京东尚科信息技术有限公司 The method and device of test program code path coverage
CN108491208A (en) * 2018-01-31 2018-09-04 中山大学 A kind of code annotation sorting technique based on neural network model
CN108804323A (en) * 2018-06-06 2018-11-13 中国平安人寿保险股份有限公司 Code quality monitoring method, equipment and storage medium
CN109615020A (en) * 2018-12-25 2019-04-12 深圳前海微众银行股份有限公司 Characteristic analysis method, device, equipment and medium based on machine learning model
CN109753286A (en) * 2018-12-28 2019-05-14 四川新网银行股份有限公司 A method of the code method based on functional label counts its call number
CN109656615A (en) * 2018-12-28 2019-04-19 四川新网银行股份有限公司 A method of permission early warning is carried out based on code method significance level
CN110908709B (en) * 2019-11-25 2023-05-02 中山大学 Code submission annotation prediction method based on code modification key class judgment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021410A (en) * 2016-05-12 2016-10-12 中国科学院软件研究所 Source code annotation quality evaluation method based on machine learning
CN107943514A (en) * 2017-11-01 2018-04-20 北京大学 The method for digging and system of core code element in a kind of software document
US20190197119A1 (en) * 2017-12-21 2019-06-27 Facebook, Inc. Language-agnostic understanding
CN108170468A (en) * 2017-12-28 2018-06-15 中山大学 The method and its system of a kind of automatic detection annotation and code consistency
CN108734215A (en) * 2018-05-21 2018-11-02 上海戎磐网络科技有限公司 Software classification method and device
CN109213520A (en) * 2018-09-08 2019-01-15 中山大学 A kind of annotation point recommended method and system based on Recognition with Recurrent Neural Network
CN111104159A (en) * 2019-12-19 2020-05-05 南京邮电大学 Annotation positioning method based on program analysis and neural network
CN112417852A (en) * 2020-12-07 2021-02-26 中山大学 Method and device for judging importance of code segment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN YUANTAO; TAO JIAJUN; WANG JIN; LIAO ZHUOFAN; XIONG JIE; WANG LEI: "The Image Annotation Method by Convolutional Features from Intermediate Layer of Deep Learning Based on Internet of Things", 2019 15TH INTERNATIONAL CONFERENCE ON MOBILE AD-HOC AND SENSOR NETWORKS (MSN), IEEE, 11 December 2019 (2019-12-11), pages 315 - 320, XP033756578, DOI: 10.1109/MSN48538.2019.00066 *
HUANG YUAN, JIA NAN;ZHOU QIANG;CHEN XIANG-PING;XIONG YING-FEI;LUO XIAO-NAN: "Method Combining Structural and Semantic Features to Support Code Commenting Decision", JOURNAL OF SOFTWARE, vol. 29, no. 8, 13 March 2018 (2018-03-13), pages 2226 - 2242, XP055940807, ISSN: 1000-9825, DOI: 10.13328/j.cnki.jos.005528 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116302043A (en) * 2023-05-25 2023-06-23 深圳市明源云科技有限公司 Code maintenance problem detection method and device, electronic equipment and readable storage medium
CN116302043B (en) * 2023-05-25 2023-10-10 深圳市明源云科技有限公司 Code maintenance problem detection method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN112417852B (en) 2022-01-25
CN112417852A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
Umer et al. CNN-based automatic prioritization of bug reports
CN108874878A (en) A kind of building system and method for knowledge mapping
Vlas et al. Two rule-based natural language strategies for requirements discovery and classification in open source software development projects
CN108959418A (en) Character relation extraction method and device, computer device and computer readable storage medium
Vlas et al. A rule-based natural language technique for requirements discovery and classification in open-source software development projects
CN110175585B (en) Automatic correcting system and method for simple answer questions
CN109857846B (en) Method and device for matching user question and knowledge point
WO2022121146A1 (en) Method and apparatus for determining importance of code segment
US20220414463A1 (en) Automated troubleshooter
CN111124487A (en) Code clone detection method and device and electronic equipment
Cabrio et al. Abstract dialectical frameworks for text exploration
CN114217766A (en) Semi-automatic demand extraction method based on pre-training language fine-tuning and dependency characteristics
Du et al. SemCluster: a semi-supervised clustering tool for crowdsourced test reports with deep image understanding
Zhang et al. A textcnn based approach for multi-label text classification of power fault data
Vineetha et al. A multinomial naïve Bayes classifier for identifying actors and use cases from software requirement specification documents
Zhang et al. An Accurate Identifier Renaming Prediction and Suggestion Approach
Jadallah et al. CATE: CAusality Tree Extractor from Natural Language Requirements
Kramer et al. Improvement of a naive Bayes sentiment classifier using MRS-based features
Sawant et al. Deriving requirements model from textual use cases
US20230111052A1 (en) Self-learning annotations to generate rules to be utilized by rule-based system
CN114638225A (en) Automatic keyword extraction method based on scientific and technological literature graph network
CN113779256A (en) File auditing method and system
Praveena et al. Chunking based malayalam paraphrase identification using unfolding recursive autoencoders
CN111966579A (en) Self-adaptive text input generation method based on natural language processing and machine learning
Kuttiyapillai et al. Improved text analysis approach for predicting effects of nutrient on human health using machine learning techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901863

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901863

Country of ref document: EP

Kind code of ref document: A1