CN112051986A

CN112051986A - Code search recommendation device and method based on open source knowledge

Info

Publication number: CN112051986A
Application number: CN202010872148.XA
Authority: CN
Inventors: 王璐; 李青山; 曹壮; 罗文龙; 吕文琪; 李�昊; 张河
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-08
Anticipated expiration: 2040-08-26
Also published as: CN112051986B

Abstract

The invention discloses a code search recommendation device and method based on open source knowledge. The device comprises a code segment generation module, a code feature extraction module, a text preprocessing module and a code search module. The method comprises the following steps: generating a code fragment library; extracting text features of the code segments; generating code segment subject characteristics; generating a code segment structural feature; generating code fragment development features; establishing a search index; preprocessing a query sentence text; calculating the similarity between the code characteristics and the query statement; and completing recommendation according to the comprehensive score. The method utilizes open source knowledge to construct a code characteristic measurement system, extracts the characteristics of the code segments from multiple angles, obtains the comprehensive scores of the code segments by utilizing similarity calculation and weighting calculation modes and completes code search recommendation, perfects the code measurement angle and improves the accuracy of code search recommendation.

Description

Code search recommendation device and method based on open source knowledge

Technical Field

The invention belongs to the technical field of software engineering, and further relates to a code search recommendation device and method based on open source knowledge in the technical fields of intelligent software research and development, artificial intelligence, data mining and the like. The method is suitable for engineering development tasks of high-efficiency development of high-quality software, analysis and application of knowledge in massive open source communities are completed, accurate measurement and accurate search of code segments are achieved, and accuracy of code search recommendation results is improved.

Background

As software becomes increasingly large and complex in function, the software code itself becomes larger, more complex and more variable. Meanwhile, the continuous structure, the growing property and other characteristics of the modern software also demand developers to efficiently complete high-quality codes to meet the changing new requirements. And the realization of code reuse through code search is an important way for improving the software development efficiency. In general, code search technology is to mine code feature information, establish an index to match with query text, and obtain codes satisfying specific constraints. On one hand, the search recommendation based on the open source knowledge code extracts the community knowledge of the open source code by mining the community characteristics of the open source code and the code reuse requirement of a software developer, realizes the expansion and extraction of code characteristics, establishes an index and forms a candidate code resource library, and on the other hand, ranks the candidate code segments by balancing the similarity between the query text and the code characteristic text and recommends the segments with the top rank to the software developer.

The patent document "a code search method and system" (publication no: CN106294786A, application no: CN201610665959.6, application date: 2016, 08/12/2016) applied by beijing creative industry and knowledge information technology limited provides a code search method and system. The steps of the method disclosed in this patent application include: searching the code for one time based on the search word input by the user; performing secondary sorting on the result set obtained by the primary search according to one or more of relevance, code quality, query intention and user feedback; and returning the result set subjected to secondary sorting to the user as a search result. According to the embodiment of the invention, the code searching effect can be effectively improved, and the code searching result which is most in line with the requirement and has the best quality is provided for developers. However, the method has the following disadvantages: although factors of a user in terms of query intention, code quality and the like are considered, a single similar matching is carried out to realize code search by utilizing the traditional idea of simply regarding software codes as plain texts, the characteristics and characteristics of the codes are not fully considered, and thus, the lack of code measurement depth causes inaccuracy of search results. The system disclosed in this patent application comprises: the search module is used for searching the codes once based on the search words input by the user; the sorting module is used for carrying out secondary sorting on the result set obtained by the primary search according to one or more of relevance, code quality, query intention and user feedback; a return module for returning the result set after secondary sorting as a search result to the user.

A code searching method based on constraint solving is disclosed in a patent document applied by Nanjing university (publication number: CN107992324A, application number: CN201711405834.0, application date: 2017, 12 and 22 days) in constraint solving, and the method comprises the following specific steps: the method comprises the steps of firstly, acquiring an open source project from an open source community; step two, analyzing the source code by using JPF and JDT, and converting the source code into an SSA format; step three, analyzing the code in the SSA format by using JDT, and converting the code into constraint; step four, corresponding the source code and the generated constraint one by one, and constructing a code # constraint library; and fifthly, constructing a code searching system to help a user search codes. The method is mainly characterized in that a constraint solving method is used for solving the code searching problem, loop statements and class member variables are processed, the defects of source code analysis in the existing method are overcome, the code searching accuracy is greatly improved, a programmer can search required codes in the software development process and use the codes for reference or reuse, and the software development efficiency and quality are improved. However, the method still has the following defects: although the open source item is obtained from the open source community as input, massive code knowledge in the open source community is abandoned, so that the problem that generated constraint is not comprehensive enough is caused, and further incomplete matching of query words and inaccuracy of search results are caused.

The patent document applied by the southeast university discloses a code searching method based on function similarity matching (publication number: CN110716749A, application number: CN201910828507.9, application date: 09/03/2019), aiming at the condition that the input is a natural language query, the code in a code library is mined with function information, the function similarity of a code segment and a query sentence is measured by mining two characteristics of a function annotation and a function API of the code segment, different weights are distributed according to the importance of the different characteristics by combining the method name and the method body of the code segment, the total similarity scores of the query and the code segment are calculated, and the search results are returned to a user according to the high-low ranking of the scores. Functional information contained in the source code is fully considered and mined, and weights are distributed according to importance, so that the matching precision is higher. However, the method still has the following defects: the code measurement angle is single, namely the functional information of the code is mined, and only two characteristics of the annotation and the functional API of the code segment are considered. Meanwhile, knowledge such as query intentions of the user is not fully utilized in the searching process, so that the code searching result is not matched with the query words.

In summary, the existing code search recommendation method still stays at the idea level of simply regarding software codes as plain texts, so that the measurement of code characteristics has many defects, and the query intention and numerous open source community knowledge are ignored in the code search process, so that the aims of assisting rapid development and accurate search are difficult to achieve.

The invention comprises the following steps:

the invention aims to provide a code search recommendation device and method based on open source knowledge, aiming at overcoming the defects of the prior art, and solving the problems that the prior art does not fully consider the characteristics and characteristics of codes, so that the code measurement depth is short, massive code knowledge in an open source community is abandoned, so that the generated constraint is not comprehensive enough, the code measurement angle is single, and the knowledge such as the query intention of a user is not fully utilized in the search process.

The idea for realizing the purpose of the invention is as follows: the method comprises the steps of collecting open source codes and information in an open source community, processing the open source codes and the information to generate a code segment library, respectively extracting features of each code segment from four dimensions of text features, structural features, theme features and development features, establishing a mapping relation between each code segment and the features of the code segment to generate a code segment-feature index library, processing query sentences by adopting a text preprocessing method to serve as input of code search recommendation, and completing search sequencing of the code segments and recommendation to a query user by calculating comprehensive scores of the code segments.

The code search recommending device comprises a code segment generating module, a code feature extracting module, a text preprocessing module and a code searching module, wherein,

the code segment generation module is used for dividing codes of at least 10 ten thousand open source items into the granularity according to the attribute of the codes in each item, and obtaining at least 150 ten thousand code segments to form a code segment library.

The code feature extraction module is used for representing all text information in each code segment in a tree form by using an abstract syntax tree AST, representing a feature type in each code segment by each node on the tree to obtain all text features of each code segment, and forming a text feature set by all text features of all code segments; analyzing the explanatory document of each code segment to obtain a hidden semantic structure in the explanatory document, and counting the occurrence frequency of characters in each explanatory document; classifying and reducing dimensions of each implicit semantic structure in sequence by using a clustering method based on division, and forming a theme attribute set by using theme attributes obtained after classification and theme attributes obtained after dimension reduction; taking the frequency of the characters with high relevance with each theme in the theme attribute set appearing in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set; traversing all subjects in the subject attribute set, and forming all subjects with the probability greater than a given threshold value into all subject characteristics of the code segment; the given threshold value refers to the minimum probability value which enables all the obtained subjects to accurately describe the subject attributes of the code segment; forming a theme characteristic set by all theme characteristics of all code segments; selecting all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and forming the feature words of each code segment and the line number corresponding to the code segment into the structural features of the code segment; the structural characteristics of all code segments are combined into a code segment set; crawling the utilization rate and maintenance log of each code segment in the open source community by using a crawler technology; traversing each log in the maintenance logs, marking the logs containing the modification operation, calculating the proportion of the number of marked logs in the maintenance logs, and taking the proportion as the activity of the code segment; the utilization rate and the activity of each code segment form the development characteristics of the code segment; forming development characteristics of all code segments into a development characteristic set; establishing a mapping relation between each code segment in a code segment library and text characteristics, theme characteristics, structural characteristics and development characteristics of the code segment to construct a code segment-characteristic index library; and forming a feature set by the text feature set, the subject feature set, the structural feature set and the development feature set.

The text preprocessing module is used for segmenting long word strings of semantic participles in a query sentence text input by a user according to the semantic content of the long word strings by utilizing a hump participle method to generate word groups; removing stop words in the word group by using a case-word segmentation algorithm to generate a short word group; extracting the word stems in the word group by using a Porter Stemmer algorithm to generate a word stem group; extracting action words in the word group by using a Stanford Parser tool to generate an action word group; and forming the short word group, the word stem group and the action word group into terms of the query sentence.

The code searching module is used for calculating the similarity between the characteristics of each code segment in the characteristic set and the query statement; weighting and calculating each feature similarity, code length, code liveness and code utilization rate to obtain a comprehensive score of each code segment; and sorting the query results from high to low according to the comprehensive scores of the code segments, and recommending the code segments with the top rank to the query user.

The code search recommendation method comprises the following specific steps:

(1) generating a code fragment library:

the code segment generation module divides at least 10 ten thousand open source items into codes according to the self attribute of the codes in each item as granularity to obtain at least 150 ten thousand code segments to form a code segment library;

(2) extracting text features of each code segment:

the code feature extraction module utilizes an abstract syntax tree AST to represent all text information in each code segment in a tree form, each node on the tree represents one feature type in the code segment to obtain all text features of each code segment, and all text features of all code segments form a text feature set;

(3) generating the theme characteristics of each code segment by using an unsupervised learning mode:

(3a) the code feature extraction module analyzes the description document of each code segment to obtain a hidden semantic structure in the description document, and counts the occurrence frequency of characters in each description document;

(3b) the code feature extraction module sequentially classifies and reduces the dimension of each implicit semantic structure by using a clustering method based on division, and a theme attribute set is formed by theme attributes obtained after classification and theme attributes obtained after dimension reduction;

(3c) the code feature extraction module takes the frequency of the characters with high relevance with each theme in the theme attribute set in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set;

(3d) the code feature extraction module traverses all subjects in the subject attribute set, and all subjects with the probability greater than a given threshold value form all subject features of the code segment; the given threshold value refers to the minimum probability value which enables all the obtained subjects to accurately describe the subject attributes of the code segment;

(3e) the code feature extraction module forms all theme features of all code segments into a theme feature set;

(4) generating structural features of each code fragment:

the code feature extraction module selects all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and the feature words of each code segment and the line number corresponding to the code segment form the structural features of the code segment; the structural characteristics of all code segments are combined into a code segment set;

(5) generating development characteristics for each code snippet:

(5a) the code feature extraction module crawls the utilization rate and maintenance log of each code segment in the open source community by using a crawler technology;

(5b) the code feature extraction module traverses each log in the maintenance logs, marks the logs containing modification operation, calculates the proportion of the number of marked logs in the maintenance logs, and takes the proportion as the activity of the code segment;

(5d) the code feature extraction module makes the utilization rate and the activity of each code segment form the development features of the code segment;

(5e) the code feature extraction module forms development features of all code segments into a development feature set;

(6) establishing a search index:

(6a) the code feature extraction module establishes a mapping relation between each code segment in the code segment library and the text feature, the subject feature, the structural feature and the development feature of the code segment, and establishes a code segment-feature index library;

(6b) the code feature extraction module forms a feature set by the text feature set, the subject feature set, the structural feature set and the development feature set obtained in the steps (2), (3), (4) and (5);

(7) preprocessing the query sentence text:

(7a) the text preprocessing module divides semantic word segmentation in query sentence text input by a user into long word strings according to semantic content of the long word strings by using a hump word segmentation method to generate word groups;

(7b) the text preprocessing module removes stop words in the word group by using a capital and lower case word segmentation algorithm to generate a short word group;

(7c) the text preprocessing module extracts word stems in the word groups by using a Porter Stemmer algorithm to generate word stem groups;

(7d) the text preprocessing module extracts action words in the word group by using a Stanford Parser tool to generate an action word group;

(7e) the text preprocessing module makes short word groups, word stem groups and action word groups form terms of the query sentence;

(8) calculating the similarity between the code features and the query statement and the comprehensive score of the code segments to complete recommendation:

(8a) the code searching module calculates the similarity between the feature of each code segment in the feature set and the query sentence by using a text similarity calculation method based on BM 25;

(8b) the code searching module calculates each feature similarity, code length, code liveness and code utilization rate in a weighting manner to obtain a comprehensive score of each code segment;

(8c) and the code searching module sorts the query results from high to low according to the comprehensive scores of the code segments and recommends the code segments with the top rank to the query user.

Compared with the prior art, the invention has the following advantages:

firstly, the code segment generation module in the device can collect the source codes of the open source projects in 10 ten thousand open source communities as input, can segment the codes for each project according to the structure, method and functional attribute of the codes, collects the activity, utilization rate and modification log information of the source codes in the open source communities, and overcomes the problem that the prior art has less utilization on the knowledge of the open source communities where the codes are located. The method and the system for recommending the code search better accord with the intention of the user in the process of completing the code search, and improve the accuracy of code search recommendation.

Secondly, because the method of the invention adopts a plurality of code segment processing methods and query statement preprocessing methods, the code segments are subjected to feature extraction from the four aspects of text features, theme features, structural features and development features, the dimension of code measurement is expanded, and meanwhile, the similarity between the features of the code segments and the query statements is calculated by adopting a text similarity calculation method based on BM25, thereby overcoming the problem of inaccurate search caused by incomplete code feature measurement in the prior art. The invention has more complete measurement to the code and more accurate search result.

Description of the drawings:

FIG. 1 is a block diagram of the apparatus of the present invention;

FIG. 2 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The structure of the device of the present invention will be further described with reference to fig. 1.

The device comprises a code segment generation module, a code feature extraction module, a text preprocessing module and a code search module.

The implementation steps of the method of the invention are further described with reference to fig. 2.

Step 1, generating a code segment library.

The code segment generation module divides codes of at least 10 ten thousand open source items into a code segment library by taking the property of the code in each item as granularity, and obtains at least 150 ten thousand code segments. The properties of the code comprise structure, method and function.

And 2, extracting the text features of each code segment.

The code feature extraction module utilizes an abstract syntax tree AST to represent all text information in each code segment in a tree form, each node on the tree represents one feature type in the code segment to obtain all text features of each code segment, and all text features of all code segments form a text feature set.

And 3, generating the theme characteristics of each code segment by using an unsupervised learning mode.

And the code feature extraction module analyzes the description document of each code segment to obtain a hidden semantic structure in the description document, and counts the occurrence frequency of characters in each description document.

And the code feature extraction module sequentially classifies and reduces the dimension of each implicit semantic structure by using a clustering method based on division, and forms a theme attribute set by using the theme attributes obtained after classification and the theme attributes obtained after dimension reduction.

And the code feature extraction module takes the frequency of the characters with high relevance with each theme in the theme attribute set in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set.

The code feature extraction module traverses all subjects in the subject attribute set, and all subjects with the probability greater than a given threshold value form all subject features of the code segment; the given threshold refers to the minimum probability value that enables all the resulting topics to accurately describe the topic attributes of the code snippet.

And the code feature extraction module makes all the theme features of all the code segments into a theme feature set.

And 4, generating the structural characteristics of each code segment.

The code feature extraction module selects all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and the feature words of each code segment and the line number corresponding to the code segment form the structural features of the code segment; and forming the structural characteristics of all the code segments into a code segment set.

And 5, generating development characteristics of each code segment.

And the code feature extraction module crawls the utilization rate and the maintenance log of each code segment in the open source community by using a crawler technology.

The code feature extraction module traverses each log in the maintenance logs, marks the logs containing modification operation, calculates the proportion of the number of marked logs in the maintenance logs, and takes the proportion as the activity of the code segment;

and the code feature extraction module makes the utilization rate and the activity of each code segment into development features of the code segment.

And the code feature extraction module combines the development features of all the code segments into a development feature set.

Step 6, establishing a search index:

the code feature extraction module establishes a mapping relation between each code segment in the code segment library and the text feature, the subject feature, the structural feature and the development feature of the code segment, and establishes a code segment-feature index library.

And the code feature extraction module forms the text feature set, the subject feature set, the structural feature set and the development feature set obtained in the steps 2,3, 4 and 5 into a feature set.

And 7, preprocessing the query sentence text.

The text preprocessing module divides the semantic word segmentation in the query sentence text input by the user into long word strings according to the semantic content of the long word strings by using a hump word segmentation method to generate word groups.

The text preprocessing module removes stop words in the word group by using a capital and lower case word segmentation algorithm to generate a short word group.

The text preprocessing module extracts word stems in the word groups by using a Porter Stemmer algorithm to generate word stem groups.

And the text preprocessing module extracts action words in the word group by using a Stanford Parser tool to generate an action word group.

The text preprocessing module makes the short word group, the word stem group and the action word group form a term of the query sentence.

And 8, calculating the similarity between the code characteristics and the query statement and the comprehensive score of the code segments to complete recommendation.

The code searching module calculates the similarity of the feature of each code segment in the feature set and the query sentence by using a text similarity calculation method based on BM25 according to the following formula:

wherein sim (. cndot.) represents similarity operation, D_nRepresenting the features of the nth code segment in the feature set D, q representing a query statement, t_iRepresenting the ith term in the query statement, i ═ 1,2,3, …, m, m represents the total number of terms in the query statement, e represents the symbol belonging to, n represents the intersection symbol, IDF (·) represents the operation of computing the inverse document frequency, · represents the multiplication operation, tf (·) represents the operation of computing the term frequency, D_n' denotes the total number of feature terms of the nth code segment in the feature set D, avgdl denotes the average value of the feature terms of the code segments in the feature set, and the parameter k₁And b are used to control the range of term frequencies and the number of term in the feature set respectively.

And (3) utilizing the following formula, weighting and calculating each characteristic similarity, code length, code activity and code utilization rate by the code searching module to obtain a comprehensive score of each code segment:

s_j＝e₁×sm_j+e₂×l_j+e₃×a_j+e₄×se_j

wherein s is_jRepresents the composite score of the jth code fragment, e₁Expressed as a value of 0.6 weight, e₂Expressed as a value of 0.1 weight, e₃Expressed as a value of 0.15 weight, e₄Expressed as a value of 0.15 weight, sm_jRepresenting the feature similarity of the jth code fragment, l_jCode length representing the jth code fragment, a_jIndicating the liveness, se, of the jth code fragment_jIndicating the usage of the jth code fragment.

And the code searching module sorts the query results from high to low according to the comprehensive scores of the code segments and recommends the code segments with the top rank to the query user.

Claims

1. A code search recommendation device based on open source community characteristics comprises a code segment generation module, a code characteristic extraction module, a text preprocessing module and a code search module, wherein,

the code segment generation module is used for segmenting codes of at least 10 ten thousand open source items by taking the attribute of the code in each item as granularity to obtain at least 150 ten thousand code segments to form a code segment library;

the code feature extraction module is used for representing all text information in each code segment in a tree form by using an abstract syntax tree AST, representing a feature type in each code segment by each node on the tree to obtain all text features of each code segment, and forming a text feature set by all text features of all code segments; analyzing the explanatory document of each code segment to obtain a hidden semantic structure in the explanatory document, and counting the occurrence frequency of characters in each explanatory document; classifying and reducing dimensions of each implicit semantic structure in sequence by using a clustering method based on division, and forming a theme attribute set by using theme attributes obtained after classification and theme attributes obtained after dimension reduction; taking the frequency of the characters with high relevance with each theme in the theme attribute set appearing in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set; traversing all subjects in the subject attribute set, and forming all subjects with the probability greater than a given threshold value into all subject characteristics of the code segment; the given threshold value refers to the minimum probability value which enables all the obtained subjects to accurately describe the subject attributes of the code segment; forming a theme characteristic set by all theme characteristics of all code segments; selecting all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and forming the feature words of each code segment and the line number corresponding to the code segment into the structural features of the code segment; the structural characteristics of all code segments are combined into a code segment set; crawling the utilization rate and maintenance log of each code segment in the open source community by using a crawler technology; traversing each log in the maintenance logs, marking the logs containing the modification operation, calculating the proportion of the number of marked logs in the maintenance logs, and taking the proportion as the activity of the code segment; the utilization rate and the activity of each code segment form the development characteristics of the code segment; forming development characteristics of all code segments into a development characteristic set; establishing a mapping relation between each code segment in a code segment library and text characteristics, theme characteristics, structural characteristics and development characteristics of the code segment to construct a code segment-characteristic index library; forming a feature set by the text feature set, the subject feature set, the structural feature set and the development feature set;

the text preprocessing module is used for segmenting long word strings of semantic participles in a query sentence text input by a user according to the semantic content of the long word strings by utilizing a hump participle method to generate word groups; removing stop words in the word group by using a case-word segmentation algorithm to generate a short word group; extracting the word stems in the word group by using a Porter Stemmer algorithm to generate a word stem group; extracting action words in the word group by using a Stanford Parser tool to generate an action word group; forming the short word group, the word stem group and the action word group into terms of the query sentence;

2. The code search recommendation method based on the open source community characteristics is characterized in that development characteristics are added to extract original text, subject and structural characteristics of open source codes to measure code fragments more comprehensively, and candidate code fragments are ranked by calculating the similarity between query texts and code characteristic texts for completing code search; the method comprises the following specific steps:

(1) generating a code fragment library:

(2) extracting text features of each code segment:

(4) generating structural features of each code fragment:

(5) generating development characteristics for each code snippet:

(6) establishing a search index:

(7) preprocessing the query sentence text:

3. The open source community feature-based code search recommendation method according to claim 2, wherein the attributes of the code itself in step (1) include structure, method and function.

4. The method for recommending code search based on open source community features of claim 2, wherein the similarity between the feature of each code segment in the feature set and the query sentence in step (8a) is calculated by the following formula:

wherein sim (. cndot.) represents similarity operation, D_nRepresenting the features of the nth code segment in the feature set D, q representing a query statement, t_iRepresenting query languageThe ith term in the sentence, i 1,2, 3.. the m, m represents the total number of terms of the query sentence, e represents the symbol belonging to, n represents the intersection symbol, IDF (the) represents the operation of calculating the inverse document frequency, f represents the operation of multiplying, tf (the) represents the operation of calculating term frequency, D_n' denotes the total number of feature terms of the nth code segment in the feature set D, avgdl denotes the average value of the feature terms of the code segments in the feature set, and the parameter k₁And b are used to control the range of term frequencies and the number of term in the feature set respectively.

5. The open source community feature-based code search recommendation method according to claim 2, wherein the weighting calculation of each feature similarity, code length, code activity and code usage rate in step (8b) to obtain the comprehensive score of each code segment is obtained by the following formula:

s_j＝e₁×sm_j+e₂×l_j+e₃×a_j+e₄×se_j