CN112051986A - Code search recommendation device and method based on open source knowledge - Google Patents

Code search recommendation device and method based on open source knowledge Download PDF

Info

Publication number
CN112051986A
CN112051986A CN202010872148.XA CN202010872148A CN112051986A CN 112051986 A CN112051986 A CN 112051986A CN 202010872148 A CN202010872148 A CN 202010872148A CN 112051986 A CN112051986 A CN 112051986A
Authority
CN
China
Prior art keywords
code
code segment
feature
text
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010872148.XA
Other languages
Chinese (zh)
Other versions
CN112051986B (en
Inventor
王璐
李青山
曹壮
罗文龙
吕文琪
李�昊
张河
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010872148.XA priority Critical patent/CN112051986B/en
Publication of CN112051986A publication Critical patent/CN112051986A/en
Application granted granted Critical
Publication of CN112051986B publication Critical patent/CN112051986B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/20Software design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a code search recommendation device and method based on open source knowledge. The device comprises a code segment generation module, a code feature extraction module, a text preprocessing module and a code search module. The method comprises the following steps: generating a code fragment library; extracting text features of the code segments; generating code segment subject characteristics; generating a code segment structural feature; generating code fragment development features; establishing a search index; preprocessing a query sentence text; calculating the similarity between the code characteristics and the query statement; and completing recommendation according to the comprehensive score. The method utilizes open source knowledge to construct a code characteristic measurement system, extracts the characteristics of the code segments from multiple angles, obtains the comprehensive scores of the code segments by utilizing similarity calculation and weighting calculation modes and completes code search recommendation, perfects the code measurement angle and improves the accuracy of code search recommendation.

Description

Code search recommendation device and method based on open source knowledge
Technical Field
The invention belongs to the technical field of software engineering, and further relates to a code search recommendation device and method based on open source knowledge in the technical fields of intelligent software research and development, artificial intelligence, data mining and the like. The method is suitable for engineering development tasks of high-efficiency development of high-quality software, analysis and application of knowledge in massive open source communities are completed, accurate measurement and accurate search of code segments are achieved, and accuracy of code search recommendation results is improved.
Background
As software becomes increasingly large and complex in function, the software code itself becomes larger, more complex and more variable. Meanwhile, the continuous structure, the growing property and other characteristics of the modern software also demand developers to efficiently complete high-quality codes to meet the changing new requirements. And the realization of code reuse through code search is an important way for improving the software development efficiency. In general, code search technology is to mine code feature information, establish an index to match with query text, and obtain codes satisfying specific constraints. On one hand, the search recommendation based on the open source knowledge code extracts the community knowledge of the open source code by mining the community characteristics of the open source code and the code reuse requirement of a software developer, realizes the expansion and extraction of code characteristics, establishes an index and forms a candidate code resource library, and on the other hand, ranks the candidate code segments by balancing the similarity between the query text and the code characteristic text and recommends the segments with the top rank to the software developer.
The patent document "a code search method and system" (publication no: CN106294786A, application no: CN201610665959.6, application date: 2016, 08/12/2016) applied by beijing creative industry and knowledge information technology limited provides a code search method and system. The steps of the method disclosed in this patent application include: searching the code for one time based on the search word input by the user; performing secondary sorting on the result set obtained by the primary search according to one or more of relevance, code quality, query intention and user feedback; and returning the result set subjected to secondary sorting to the user as a search result. According to the embodiment of the invention, the code searching effect can be effectively improved, and the code searching result which is most in line with the requirement and has the best quality is provided for developers. However, the method has the following disadvantages: although factors of a user in terms of query intention, code quality and the like are considered, a single similar matching is carried out to realize code search by utilizing the traditional idea of simply regarding software codes as plain texts, the characteristics and characteristics of the codes are not fully considered, and thus, the lack of code measurement depth causes inaccuracy of search results. The system disclosed in this patent application comprises: the search module is used for searching the codes once based on the search words input by the user; the sorting module is used for carrying out secondary sorting on the result set obtained by the primary search according to one or more of relevance, code quality, query intention and user feedback; a return module for returning the result set after secondary sorting as a search result to the user.
A code searching method based on constraint solving is disclosed in a patent document applied by Nanjing university (publication number: CN107992324A, application number: CN201711405834.0, application date: 2017, 12 and 22 days) in constraint solving, and the method comprises the following specific steps: the method comprises the steps of firstly, acquiring an open source project from an open source community; step two, analyzing the source code by using JPF and JDT, and converting the source code into an SSA format; step three, analyzing the code in the SSA format by using JDT, and converting the code into constraint; step four, corresponding the source code and the generated constraint one by one, and constructing a code # constraint library; and fifthly, constructing a code searching system to help a user search codes. The method is mainly characterized in that a constraint solving method is used for solving the code searching problem, loop statements and class member variables are processed, the defects of source code analysis in the existing method are overcome, the code searching accuracy is greatly improved, a programmer can search required codes in the software development process and use the codes for reference or reuse, and the software development efficiency and quality are improved. However, the method still has the following defects: although the open source item is obtained from the open source community as input, massive code knowledge in the open source community is abandoned, so that the problem that generated constraint is not comprehensive enough is caused, and further incomplete matching of query words and inaccuracy of search results are caused.
The patent document applied by the southeast university discloses a code searching method based on function similarity matching (publication number: CN110716749A, application number: CN201910828507.9, application date: 09/03/2019), aiming at the condition that the input is a natural language query, the code in a code library is mined with function information, the function similarity of a code segment and a query sentence is measured by mining two characteristics of a function annotation and a function API of the code segment, different weights are distributed according to the importance of the different characteristics by combining the method name and the method body of the code segment, the total similarity scores of the query and the code segment are calculated, and the search results are returned to a user according to the high-low ranking of the scores. Functional information contained in the source code is fully considered and mined, and weights are distributed according to importance, so that the matching precision is higher. However, the method still has the following defects: the code measurement angle is single, namely the functional information of the code is mined, and only two characteristics of the annotation and the functional API of the code segment are considered. Meanwhile, knowledge such as query intentions of the user is not fully utilized in the searching process, so that the code searching result is not matched with the query words.
In summary, the existing code search recommendation method still stays at the idea level of simply regarding software codes as plain texts, so that the measurement of code characteristics has many defects, and the query intention and numerous open source community knowledge are ignored in the code search process, so that the aims of assisting rapid development and accurate search are difficult to achieve.
The invention comprises the following steps:
the invention aims to provide a code search recommendation device and method based on open source knowledge, aiming at overcoming the defects of the prior art, and solving the problems that the prior art does not fully consider the characteristics and characteristics of codes, so that the code measurement depth is short, massive code knowledge in an open source community is abandoned, so that the generated constraint is not comprehensive enough, the code measurement angle is single, and the knowledge such as the query intention of a user is not fully utilized in the search process.
The idea for realizing the purpose of the invention is as follows: the method comprises the steps of collecting open source codes and information in an open source community, processing the open source codes and the information to generate a code segment library, respectively extracting features of each code segment from four dimensions of text features, structural features, theme features and development features, establishing a mapping relation between each code segment and the features of the code segment to generate a code segment-feature index library, processing query sentences by adopting a text preprocessing method to serve as input of code search recommendation, and completing search sequencing of the code segments and recommendation to a query user by calculating comprehensive scores of the code segments.
The code search recommending device comprises a code segment generating module, a code feature extracting module, a text preprocessing module and a code searching module, wherein,
the code segment generation module is used for dividing codes of at least 10 ten thousand open source items into the granularity according to the attribute of the codes in each item, and obtaining at least 150 ten thousand code segments to form a code segment library.
The code feature extraction module is used for representing all text information in each code segment in a tree form by using an abstract syntax tree AST, representing a feature type in each code segment by each node on the tree to obtain all text features of each code segment, and forming a text feature set by all text features of all code segments; analyzing the explanatory document of each code segment to obtain a hidden semantic structure in the explanatory document, and counting the occurrence frequency of characters in each explanatory document; classifying and reducing dimensions of each implicit semantic structure in sequence by using a clustering method based on division, and forming a theme attribute set by using theme attributes obtained after classification and theme attributes obtained after dimension reduction; taking the frequency of the characters with high relevance with each theme in the theme attribute set appearing in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set; traversing all subjects in the subject attribute set, and forming all subjects with the probability greater than a given threshold value into all subject characteristics of the code segment; the given threshold value refers to the minimum probability value which enables all the obtained subjects to accurately describe the subject attributes of the code segment; forming a theme characteristic set by all theme characteristics of all code segments; selecting all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and forming the feature words of each code segment and the line number corresponding to the code segment into the structural features of the code segment; the structural characteristics of all code segments are combined into a code segment set; crawling the utilization rate and maintenance log of each code segment in the open source community by using a crawler technology; traversing each log in the maintenance logs, marking the logs containing the modification operation, calculating the proportion of the number of marked logs in the maintenance logs, and taking the proportion as the activity of the code segment; the utilization rate and the activity of each code segment form the development characteristics of the code segment; forming development characteristics of all code segments into a development characteristic set; establishing a mapping relation between each code segment in a code segment library and text characteristics, theme characteristics, structural characteristics and development characteristics of the code segment to construct a code segment-characteristic index library; and forming a feature set by the text feature set, the subject feature set, the structural feature set and the development feature set.
The text preprocessing module is used for segmenting long word strings of semantic participles in a query sentence text input by a user according to the semantic content of the long word strings by utilizing a hump participle method to generate word groups; removing stop words in the word group by using a case-word segmentation algorithm to generate a short word group; extracting the word stems in the word group by using a Porter Stemmer algorithm to generate a word stem group; extracting action words in the word group by using a Stanford Parser tool to generate an action word group; and forming the short word group, the word stem group and the action word group into terms of the query sentence.
The code searching module is used for calculating the similarity between the characteristics of each code segment in the characteristic set and the query statement; weighting and calculating each feature similarity, code length, code liveness and code utilization rate to obtain a comprehensive score of each code segment; and sorting the query results from high to low according to the comprehensive scores of the code segments, and recommending the code segments with the top rank to the query user.
The code search recommendation method comprises the following specific steps:
(1) generating a code fragment library:
the code segment generation module divides at least 10 ten thousand open source items into codes according to the self attribute of the codes in each item as granularity to obtain at least 150 ten thousand code segments to form a code segment library;
(2) extracting text features of each code segment:
the code feature extraction module utilizes an abstract syntax tree AST to represent all text information in each code segment in a tree form, each node on the tree represents one feature type in the code segment to obtain all text features of each code segment, and all text features of all code segments form a text feature set;
(3) generating the theme characteristics of each code segment by using an unsupervised learning mode:
(3a) the code feature extraction module analyzes the description document of each code segment to obtain a hidden semantic structure in the description document, and counts the occurrence frequency of characters in each description document;
(3b) the code feature extraction module sequentially classifies and reduces the dimension of each implicit semantic structure by using a clustering method based on division, and a theme attribute set is formed by theme attributes obtained after classification and theme attributes obtained after dimension reduction;
(3c) the code feature extraction module takes the frequency of the characters with high relevance with each theme in the theme attribute set in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set;
(3d) the code feature extraction module traverses all subjects in the subject attribute set, and all subjects with the probability greater than a given threshold value form all subject features of the code segment; the given threshold value refers to the minimum probability value which enables all the obtained subjects to accurately describe the subject attributes of the code segment;
(3e) the code feature extraction module forms all theme features of all code segments into a theme feature set;
(4) generating structural features of each code fragment:
the code feature extraction module selects all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and the feature words of each code segment and the line number corresponding to the code segment form the structural features of the code segment; the structural characteristics of all code segments are combined into a code segment set;
(5) generating development characteristics for each code snippet:
(5a) the code feature extraction module crawls the utilization rate and maintenance log of each code segment in the open source community by using a crawler technology;
(5b) the code feature extraction module traverses each log in the maintenance logs, marks the logs containing modification operation, calculates the proportion of the number of marked logs in the maintenance logs, and takes the proportion as the activity of the code segment;
(5d) the code feature extraction module makes the utilization rate and the activity of each code segment form the development features of the code segment;
(5e) the code feature extraction module forms development features of all code segments into a development feature set;
(6) establishing a search index:
(6a) the code feature extraction module establishes a mapping relation between each code segment in the code segment library and the text feature, the subject feature, the structural feature and the development feature of the code segment, and establishes a code segment-feature index library;
(6b) the code feature extraction module forms a feature set by the text feature set, the subject feature set, the structural feature set and the development feature set obtained in the steps (2), (3), (4) and (5);
(7) preprocessing the query sentence text:
(7a) the text preprocessing module divides semantic word segmentation in query sentence text input by a user into long word strings according to semantic content of the long word strings by using a hump word segmentation method to generate word groups;
(7b) the text preprocessing module removes stop words in the word group by using a capital and lower case word segmentation algorithm to generate a short word group;
(7c) the text preprocessing module extracts word stems in the word groups by using a Porter Stemmer algorithm to generate word stem groups;
(7d) the text preprocessing module extracts action words in the word group by using a Stanford Parser tool to generate an action word group;
(7e) the text preprocessing module makes short word groups, word stem groups and action word groups form terms of the query sentence;
(8) calculating the similarity between the code features and the query statement and the comprehensive score of the code segments to complete recommendation:
(8a) the code searching module calculates the similarity between the feature of each code segment in the feature set and the query sentence by using a text similarity calculation method based on BM 25;
(8b) the code searching module calculates each feature similarity, code length, code liveness and code utilization rate in a weighting manner to obtain a comprehensive score of each code segment;
(8c) and the code searching module sorts the query results from high to low according to the comprehensive scores of the code segments and recommends the code segments with the top rank to the query user.
Compared with the prior art, the invention has the following advantages:
firstly, the code segment generation module in the device can collect the source codes of the open source projects in 10 ten thousand open source communities as input, can segment the codes for each project according to the structure, method and functional attribute of the codes, collects the activity, utilization rate and modification log information of the source codes in the open source communities, and overcomes the problem that the prior art has less utilization on the knowledge of the open source communities where the codes are located. The method and the system for recommending the code search better accord with the intention of the user in the process of completing the code search, and improve the accuracy of code search recommendation.
Secondly, because the method of the invention adopts a plurality of code segment processing methods and query statement preprocessing methods, the code segments are subjected to feature extraction from the four aspects of text features, theme features, structural features and development features, the dimension of code measurement is expanded, and meanwhile, the similarity between the features of the code segments and the query statements is calculated by adopting a text similarity calculation method based on BM25, thereby overcoming the problem of inaccurate search caused by incomplete code feature measurement in the prior art. The invention has more complete measurement to the code and more accurate search result.
Description of the drawings:
FIG. 1 is a block diagram of the apparatus of the present invention;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The structure of the device of the present invention will be further described with reference to fig. 1.
The device comprises a code segment generation module, a code feature extraction module, a text preprocessing module and a code search module.
The code segment generation module is used for dividing codes of at least 10 ten thousand open source items into the granularity according to the attribute of the codes in each item, and obtaining at least 150 ten thousand code segments to form a code segment library.
The code feature extraction module is used for representing all text information in each code segment in a tree form by using an abstract syntax tree AST, representing a feature type in each code segment by each node on the tree to obtain all text features of each code segment, and forming a text feature set by all text features of all code segments; analyzing the explanatory document of each code segment to obtain a hidden semantic structure in the explanatory document, and counting the occurrence frequency of characters in each explanatory document; classifying and reducing dimensions of each implicit semantic structure in sequence by using a clustering method based on division, and forming a theme attribute set by using theme attributes obtained after classification and theme attributes obtained after dimension reduction; taking the frequency of the characters with high relevance with each theme in the theme attribute set appearing in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set; traversing all subjects in the subject attribute set, and forming all subjects with the probability greater than a given threshold value into all subject characteristics of the code segment; the given threshold value refers to the minimum probability value which enables all the obtained subjects to accurately describe the subject attributes of the code segment; forming a theme characteristic set by all theme characteristics of all code segments; selecting all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and forming the feature words of each code segment and the line number corresponding to the code segment into the structural features of the code segment; the structural characteristics of all code segments are combined into a code segment set; crawling the utilization rate and maintenance log of each code segment in the open source community by using a crawler technology; traversing each log in the maintenance logs, marking the logs containing the modification operation, calculating the proportion of the number of marked logs in the maintenance logs, and taking the proportion as the activity of the code segment; the utilization rate and the activity of each code segment form the development characteristics of the code segment; forming development characteristics of all code segments into a development characteristic set; establishing a mapping relation between each code segment in a code segment library and text characteristics, theme characteristics, structural characteristics and development characteristics of the code segment to construct a code segment-characteristic index library; and forming a feature set by the text feature set, the subject feature set, the structural feature set and the development feature set.
The text preprocessing module is used for segmenting long word strings of semantic participles in a query sentence text input by a user according to the semantic content of the long word strings by utilizing a hump participle method to generate word groups; removing stop words in the word group by using a case-word segmentation algorithm to generate a short word group; extracting the word stems in the word group by using a Porter Stemmer algorithm to generate a word stem group; extracting action words in the word group by using a Stanford Parser tool to generate an action word group; and forming the short word group, the word stem group and the action word group into terms of the query sentence.
The code searching module is used for calculating the similarity between the characteristics of each code segment in the characteristic set and the query statement; weighting and calculating each feature similarity, code length, code liveness and code utilization rate to obtain a comprehensive score of each code segment; and sorting the query results from high to low according to the comprehensive scores of the code segments, and recommending the code segments with the top rank to the query user.
The implementation steps of the method of the invention are further described with reference to fig. 2.
Step 1, generating a code segment library.
The code segment generation module divides codes of at least 10 ten thousand open source items into a code segment library by taking the property of the code in each item as granularity, and obtains at least 150 ten thousand code segments. The properties of the code comprise structure, method and function.
And 2, extracting the text features of each code segment.
The code feature extraction module utilizes an abstract syntax tree AST to represent all text information in each code segment in a tree form, each node on the tree represents one feature type in the code segment to obtain all text features of each code segment, and all text features of all code segments form a text feature set.
And 3, generating the theme characteristics of each code segment by using an unsupervised learning mode.
And the code feature extraction module analyzes the description document of each code segment to obtain a hidden semantic structure in the description document, and counts the occurrence frequency of characters in each description document.
And the code feature extraction module sequentially classifies and reduces the dimension of each implicit semantic structure by using a clustering method based on division, and forms a theme attribute set by using the theme attributes obtained after classification and the theme attributes obtained after dimension reduction.
And the code feature extraction module takes the frequency of the characters with high relevance with each theme in the theme attribute set in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set.
The code feature extraction module traverses all subjects in the subject attribute set, and all subjects with the probability greater than a given threshold value form all subject features of the code segment; the given threshold refers to the minimum probability value that enables all the resulting topics to accurately describe the topic attributes of the code snippet.
And the code feature extraction module makes all the theme features of all the code segments into a theme feature set.
And 4, generating the structural characteristics of each code segment.
The code feature extraction module selects all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and the feature words of each code segment and the line number corresponding to the code segment form the structural features of the code segment; and forming the structural characteristics of all the code segments into a code segment set.
And 5, generating development characteristics of each code segment.
And the code feature extraction module crawls the utilization rate and the maintenance log of each code segment in the open source community by using a crawler technology.
The code feature extraction module traverses each log in the maintenance logs, marks the logs containing modification operation, calculates the proportion of the number of marked logs in the maintenance logs, and takes the proportion as the activity of the code segment;
and the code feature extraction module makes the utilization rate and the activity of each code segment into development features of the code segment.
And the code feature extraction module combines the development features of all the code segments into a development feature set.
Step 6, establishing a search index:
the code feature extraction module establishes a mapping relation between each code segment in the code segment library and the text feature, the subject feature, the structural feature and the development feature of the code segment, and establishes a code segment-feature index library.
And the code feature extraction module forms the text feature set, the subject feature set, the structural feature set and the development feature set obtained in the steps 2,3, 4 and 5 into a feature set.
And 7, preprocessing the query sentence text.
The text preprocessing module divides the semantic word segmentation in the query sentence text input by the user into long word strings according to the semantic content of the long word strings by using a hump word segmentation method to generate word groups.
The text preprocessing module removes stop words in the word group by using a capital and lower case word segmentation algorithm to generate a short word group.
The text preprocessing module extracts word stems in the word groups by using a Porter Stemmer algorithm to generate word stem groups.
And the text preprocessing module extracts action words in the word group by using a Stanford Parser tool to generate an action word group.
The text preprocessing module makes the short word group, the word stem group and the action word group form a term of the query sentence.
And 8, calculating the similarity between the code characteristics and the query statement and the comprehensive score of the code segments to complete recommendation.
The code searching module calculates the similarity of the feature of each code segment in the feature set and the query sentence by using a text similarity calculation method based on BM25 according to the following formula:
Figure BDA0002651457170000091
wherein sim (. cndot.) represents similarity operation, DnRepresenting the features of the nth code segment in the feature set D, q representing a query statement, tiRepresenting the ith term in the query statement, i ═ 1,2,3, …, m, m represents the total number of terms in the query statement, e represents the symbol belonging to, n represents the intersection symbol, IDF (·) represents the operation of computing the inverse document frequency, · represents the multiplication operation, tf (·) represents the operation of computing the term frequency, Dn' denotes the total number of feature terms of the nth code segment in the feature set D, avgdl denotes the average value of the feature terms of the code segments in the feature set, and the parameter k1And b are used to control the range of term frequencies and the number of term in the feature set respectively.
And (3) utilizing the following formula, weighting and calculating each characteristic similarity, code length, code activity and code utilization rate by the code searching module to obtain a comprehensive score of each code segment:
sj=e1×smj+e2×lj+e3×aj+e4×sej
wherein s isjRepresents the composite score of the jth code fragment, e1Expressed as a value of 0.6 weight, e2Expressed as a value of 0.1 weight, e3Expressed as a value of 0.15 weight, e4Expressed as a value of 0.15 weight, smjRepresenting the feature similarity of the jth code fragment, ljCode length representing the jth code fragment, ajIndicating the liveness, se, of the jth code fragmentjIndicating the usage of the jth code fragment.
And the code searching module sorts the query results from high to low according to the comprehensive scores of the code segments and recommends the code segments with the top rank to the query user.

Claims (5)

1. A code search recommendation device based on open source community characteristics comprises a code segment generation module, a code characteristic extraction module, a text preprocessing module and a code search module, wherein,
the code segment generation module is used for segmenting codes of at least 10 ten thousand open source items by taking the attribute of the code in each item as granularity to obtain at least 150 ten thousand code segments to form a code segment library;
the code feature extraction module is used for representing all text information in each code segment in a tree form by using an abstract syntax tree AST, representing a feature type in each code segment by each node on the tree to obtain all text features of each code segment, and forming a text feature set by all text features of all code segments; analyzing the explanatory document of each code segment to obtain a hidden semantic structure in the explanatory document, and counting the occurrence frequency of characters in each explanatory document; classifying and reducing dimensions of each implicit semantic structure in sequence by using a clustering method based on division, and forming a theme attribute set by using theme attributes obtained after classification and theme attributes obtained after dimension reduction; taking the frequency of the characters with high relevance with each theme in the theme attribute set appearing in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set; traversing all subjects in the subject attribute set, and forming all subjects with the probability greater than a given threshold value into all subject characteristics of the code segment; the given threshold value refers to the minimum probability value which enables all the obtained subjects to accurately describe the subject attributes of the code segment; forming a theme characteristic set by all theme characteristics of all code segments; selecting all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and forming the feature words of each code segment and the line number corresponding to the code segment into the structural features of the code segment; the structural characteristics of all code segments are combined into a code segment set; crawling the utilization rate and maintenance log of each code segment in the open source community by using a crawler technology; traversing each log in the maintenance logs, marking the logs containing the modification operation, calculating the proportion of the number of marked logs in the maintenance logs, and taking the proportion as the activity of the code segment; the utilization rate and the activity of each code segment form the development characteristics of the code segment; forming development characteristics of all code segments into a development characteristic set; establishing a mapping relation between each code segment in a code segment library and text characteristics, theme characteristics, structural characteristics and development characteristics of the code segment to construct a code segment-characteristic index library; forming a feature set by the text feature set, the subject feature set, the structural feature set and the development feature set;
the text preprocessing module is used for segmenting long word strings of semantic participles in a query sentence text input by a user according to the semantic content of the long word strings by utilizing a hump participle method to generate word groups; removing stop words in the word group by using a case-word segmentation algorithm to generate a short word group; extracting the word stems in the word group by using a Porter Stemmer algorithm to generate a word stem group; extracting action words in the word group by using a Stanford Parser tool to generate an action word group; forming the short word group, the word stem group and the action word group into terms of the query sentence;
the code searching module is used for calculating the similarity between the characteristics of each code segment in the characteristic set and the query statement; weighting and calculating each feature similarity, code length, code liveness and code utilization rate to obtain a comprehensive score of each code segment; and sorting the query results from high to low according to the comprehensive scores of the code segments, and recommending the code segments with the top rank to the query user.
2. The code search recommendation method based on the open source community characteristics is characterized in that development characteristics are added to extract original text, subject and structural characteristics of open source codes to measure code fragments more comprehensively, and candidate code fragments are ranked by calculating the similarity between query texts and code characteristic texts for completing code search; the method comprises the following specific steps:
(1) generating a code fragment library:
the code segment generation module divides at least 10 ten thousand open source items into codes according to the self attribute of the codes in each item as granularity to obtain at least 150 ten thousand code segments to form a code segment library;
(2) extracting text features of each code segment:
the code feature extraction module utilizes an abstract syntax tree AST to represent all text information in each code segment in a tree form, each node on the tree represents one feature type in the code segment to obtain all text features of each code segment, and all text features of all code segments form a text feature set;
(3) generating the theme characteristics of each code segment by using an unsupervised learning mode:
(3a) the code feature extraction module analyzes the description document of each code segment to obtain a hidden semantic structure in the description document, and counts the occurrence frequency of characters in each description document;
(3b) the code feature extraction module sequentially classifies and reduces the dimension of each implicit semantic structure by using a clustering method based on division, and a theme attribute set is formed by theme attributes obtained after classification and theme attributes obtained after dimension reduction;
(3c) the code feature extraction module takes the frequency of the characters with high relevance with each theme in the theme attribute set in the description document of the code segment as the probability of the theme to obtain the probability distribution of the theme attribute set;
(3d) the code feature extraction module traverses all subjects in the subject attribute set, and all subjects with the probability greater than a given threshold value form all subject features of the code segment; the given threshold value refers to the minimum probability value which enables all the obtained subjects to accurately describe the subject attributes of the code segment;
(3e) the code feature extraction module forms all theme features of all code segments into a theme feature set;
(4) generating structural features of each code fragment:
the code feature extraction module selects all vocabularies representing code frames from the description documents of each code segment by using a feature word selection algorithm to form frame feature words of the code segment, and the feature words of each code segment and the line number corresponding to the code segment form the structural features of the code segment; the structural characteristics of all code segments are combined into a code segment set;
(5) generating development characteristics for each code snippet:
(5a) the code feature extraction module crawls the utilization rate and maintenance log of each code segment in the open source community by using a crawler technology;
(5b) the code feature extraction module traverses each log in the maintenance logs, marks the logs containing modification operation, calculates the proportion of the number of marked logs in the maintenance logs, and takes the proportion as the activity of the code segment;
(5d) the code feature extraction module makes the utilization rate and the activity of each code segment form the development features of the code segment;
(5e) the code feature extraction module forms development features of all code segments into a development feature set;
(6) establishing a search index:
(6a) the code feature extraction module establishes a mapping relation between each code segment in the code segment library and the text feature, the subject feature, the structural feature and the development feature of the code segment, and establishes a code segment-feature index library;
(6b) the code feature extraction module forms a feature set by the text feature set, the subject feature set, the structural feature set and the development feature set obtained in the steps (2), (3), (4) and (5);
(7) preprocessing the query sentence text:
(7a) the text preprocessing module divides semantic word segmentation in query sentence text input by a user into long word strings according to semantic content of the long word strings by using a hump word segmentation method to generate word groups;
(7b) the text preprocessing module removes stop words in the word group by using a capital and lower case word segmentation algorithm to generate a short word group;
(7c) the text preprocessing module extracts word stems in the word groups by using a Porter Stemmer algorithm to generate word stem groups;
(7d) the text preprocessing module extracts action words in the word group by using a Stanford Parser tool to generate an action word group;
(7e) the text preprocessing module makes short word groups, word stem groups and action word groups form terms of the query sentence;
(8) calculating the similarity between the code features and the query statement and the comprehensive score of the code segments to complete recommendation:
(8a) the code searching module calculates the similarity between the feature of each code segment in the feature set and the query sentence by using a text similarity calculation method based on BM 25;
(8b) the code searching module calculates each feature similarity, code length, code liveness and code utilization rate in a weighting manner to obtain a comprehensive score of each code segment;
(8c) and the code searching module sorts the query results from high to low according to the comprehensive scores of the code segments and recommends the code segments with the top rank to the query user.
3. The open source community feature-based code search recommendation method according to claim 2, wherein the attributes of the code itself in step (1) include structure, method and function.
4. The method for recommending code search based on open source community features of claim 2, wherein the similarity between the feature of each code segment in the feature set and the query sentence in step (8a) is calculated by the following formula:
Figure FDA0002651457160000041
wherein sim (. cndot.) represents similarity operation, DnRepresenting the features of the nth code segment in the feature set D, q representing a query statement, tiRepresenting query languageThe ith term in the sentence, i 1,2, 3.. the m, m represents the total number of terms of the query sentence, e represents the symbol belonging to, n represents the intersection symbol, IDF (the) represents the operation of calculating the inverse document frequency, f represents the operation of multiplying, tf (the) represents the operation of calculating term frequency, Dn' denotes the total number of feature terms of the nth code segment in the feature set D, avgdl denotes the average value of the feature terms of the code segments in the feature set, and the parameter k1And b are used to control the range of term frequencies and the number of term in the feature set respectively.
5. The open source community feature-based code search recommendation method according to claim 2, wherein the weighting calculation of each feature similarity, code length, code activity and code usage rate in step (8b) to obtain the comprehensive score of each code segment is obtained by the following formula:
sj=e1×smj+e2×lj+e3×aj+e4×sej
wherein s isjRepresents the composite score of the jth code fragment, e1Expressed as a value of 0.6 weight, e2Expressed as a value of 0.1 weight, e3Expressed as a value of 0.15 weight, e4Expressed as a value of 0.15 weight, smjRepresenting the feature similarity of the jth code fragment, ljCode length representing the jth code fragment, ajIndicating the liveness, se, of the jth code fragmentjIndicating the usage of the jth code fragment.
CN202010872148.XA 2020-08-26 2020-08-26 Code search recommendation device and method based on open source knowledge Active CN112051986B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010872148.XA CN112051986B (en) 2020-08-26 2020-08-26 Code search recommendation device and method based on open source knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010872148.XA CN112051986B (en) 2020-08-26 2020-08-26 Code search recommendation device and method based on open source knowledge

Publications (2)

Publication Number Publication Date
CN112051986A true CN112051986A (en) 2020-12-08
CN112051986B CN112051986B (en) 2021-07-27

Family

ID=73599375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010872148.XA Active CN112051986B (en) 2020-08-26 2020-08-26 Code search recommendation device and method based on open source knowledge

Country Status (1)

Country Link
CN (1) CN112051986B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113901177A (en) * 2021-10-27 2022-01-07 电子科技大学 Code searching method based on multi-mode attribute decision
WO2023065638A1 (en) * 2021-10-22 2023-04-27 平安科技(深圳)有限公司 Data retrieval method and apparatus, and electronic device and storage medium
CN116185379A (en) * 2022-11-17 2023-05-30 北京东方通科技股份有限公司 Method for optimizing code hosting platform
CN116974619A (en) * 2023-09-22 2023-10-31 国网电商科技有限公司 Method, device and equipment for constructing software bill of materials library and readable medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927177A (en) * 2014-04-18 2014-07-16 扬州大学 Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm
CN106407113A (en) * 2016-09-09 2017-02-15 扬州大学 Bug positioning method based on Stack Overflow and commit libraries
KR20170134191A (en) * 2016-05-26 2017-12-06 연세대학교 원주산학협력단 Software domain topics extraction system using PageRank and topic modeling
CN108717470A (en) * 2018-06-14 2018-10-30 南京航空航天大学 A kind of code snippet recommendation method with high accuracy
US20190303141A1 (en) * 2018-03-29 2019-10-03 Elasticsearch B.V. Syntax Based Source Code Search
CN110750240A (en) * 2019-08-28 2020-02-04 南京航空航天大学 Code segment recommendation method based on sequence-to-sequence model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103927177A (en) * 2014-04-18 2014-07-16 扬州大学 Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm
KR20170134191A (en) * 2016-05-26 2017-12-06 연세대학교 원주산학협력단 Software domain topics extraction system using PageRank and topic modeling
CN106407113A (en) * 2016-09-09 2017-02-15 扬州大学 Bug positioning method based on Stack Overflow and commit libraries
US20190303141A1 (en) * 2018-03-29 2019-10-03 Elasticsearch B.V. Syntax Based Source Code Search
CN108717470A (en) * 2018-06-14 2018-10-30 南京航空航天大学 A kind of code snippet recommendation method with high accuracy
CN110750240A (en) * 2019-08-28 2020-02-04 南京航空航天大学 Code segment recommendation method based on sequence-to-sequence model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李阵: ""基于多特征权重分配的源代码搜索优化"", 《计算机软件技术》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023065638A1 (en) * 2021-10-22 2023-04-27 平安科技(深圳)有限公司 Data retrieval method and apparatus, and electronic device and storage medium
CN113901177A (en) * 2021-10-27 2022-01-07 电子科技大学 Code searching method based on multi-mode attribute decision
CN113901177B (en) * 2021-10-27 2023-08-08 电子科技大学 Code searching method based on multi-mode attribute decision
CN116185379A (en) * 2022-11-17 2023-05-30 北京东方通科技股份有限公司 Method for optimizing code hosting platform
CN116185379B (en) * 2022-11-17 2023-09-22 北京东方通科技股份有限公司 Method for optimizing code hosting platform
CN116974619A (en) * 2023-09-22 2023-10-31 国网电商科技有限公司 Method, device and equipment for constructing software bill of materials library and readable medium
CN116974619B (en) * 2023-09-22 2024-01-12 国网电商科技有限公司 Method, device and equipment for constructing software bill of materials library and readable medium

Also Published As

Publication number Publication date
CN112051986B (en) 2021-07-27

Similar Documents

Publication Publication Date Title
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN112051986B (en) Code search recommendation device and method based on open source knowledge
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN110298033B (en) Keyword corpus labeling training extraction system
CN110298032A (en) Text classification corpus labeling training system
CN110134946B (en) Machine reading understanding method for complex data
CN109145260B (en) Automatic text information extraction method
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
CN112307153B (en) Automatic construction method and device of industrial knowledge base and storage medium
CN101645083B (en) Acquisition system and method of text field based on concept symbols
CN112256939B (en) Text entity relation extraction method for chemical field
CN108717433A (en) A kind of construction of knowledge base method and device of programming-oriented field question answering system
CN110175585B (en) Automatic correcting system and method for simple answer questions
CN104484380A (en) Personalized search method and personalized search device
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN106407113A (en) Bug positioning method based on Stack Overflow and commit libraries
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN115759092A (en) Network threat information named entity identification method based on ALBERT
CN113111645B (en) Media text similarity detection method
CN114048354A (en) Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN113032541A (en) Answer extraction method based on bert and fusion sentence cluster retrieval
CN110851584B (en) Legal provision accurate recommendation system and method
CN111859955A (en) Public opinion data analysis model based on deep learning
CN111460147A (en) Title short text classification method based on semantic enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant