CN106649557B - Semantic association mining method for defect report and mail list - Google Patents

Semantic association mining method for defect report and mail list Download PDF

Info

Publication number
CN106649557B
CN106649557B CN201610984538.XA CN201610984538A CN106649557B CN 106649557 B CN106649557 B CN 106649557B CN 201610984538 A CN201610984538 A CN 201610984538A CN 106649557 B CN106649557 B CN 106649557B
Authority
CN
China
Prior art keywords
mail
defect report
association
text
defect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610984538.XA
Other languages
Chinese (zh)
Other versions
CN106649557A (en
Inventor
赵俊峰
陈秀招
曹英魁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Information Technology Institute (tianjin Binhai)
Original Assignee
Peking University Information Technology Institute (tianjin Binhai)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Information Technology Institute (tianjin Binhai) filed Critical Peking University Information Technology Institute (tianjin Binhai)
Priority to CN201610984538.XA priority Critical patent/CN106649557B/en
Publication of CN106649557A publication Critical patent/CN106649557A/en
Application granted granted Critical
Publication of CN106649557B publication Critical patent/CN106649557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a semantic association mining method for a defect report and a mail list. The method comprises the following steps: 1) analyzing the acquired defect report of the target item and the mail list to obtain stack information, code segments and a text of the defect report and stack information, code segments and a text of the mail list; 2) the document explicit semantic association mining unit identifies explicit semantic associations between the defect reports and the mail list according to the analysis result, wherein the explicit semantic associations comprise reference associations and common code element associations; 3) and the document implicit semantic association mining unit identifies implicit semantic associations, including similar associations and potential semantic associations, between the defect reports and the mail list according to the analysis result. The invention is beneficial to efficiently positioning the related defect report and the mail list and helps developers to better reuse software resources.

Description

Semantic association mining method for defect report and mail list
Technical Field
The method is used for mining the semantic association relationship between the defect report and the mail list in the software multiplexing process, and reducing the searching, reading and learning burden of developers.
Background
In the software development process, the software reuse hopes to be capable of fully utilizing the knowledge and experience accumulated in the development of the application system in the past, avoiding repeated labor, focusing the development emphasis on the specific composition of the application and improving the software development efficiency and quality.
In recent years, the internet has seen a large number of open source project hosting sites, with more and more premium software appearing in the field of development. In these projects, not only are rich code resources provided to us, but at the same time, some excellent, mature, open source projects produce a large number of documents in a variety of forms. Defect reports and mailing lists are two of them. The documents are not only development records and communication modes of project developers, but also provide ways for other developers to learn and reuse the projects.
As two common project documents: the defect report and the mail list provide valuable reuse resources for software developers and discuss and record the problems encountered by the developers. However, as the size of open-source projects increases, the problems encountered by developers have become increasingly complex, and answers to the problems have not been able to be found through a single type of document. For example, the problem scene and the suggestions of community members are known according to the discussion in the mailing list, the method for solving the problem and the affected code modules are checked according to the patches in the defect report, and the like.
The defect report and the mail list contain rich software project information, and great convenience is brought to developers. However, some characteristics of these documents also bring obstacles to the developers for learning and reuse, mainly including:
first, the number of defect reports and mailing lists is enormous.
Secondly, the defect report and the mail list are complicated.
Thirdly, the association of the defect report with the mailing list is complicated.
Therefore, in order to help developers to better understand and reuse software resources, the association relationship between the defect report and the mailing list needs to be mined, and a document retrieval interface based on the association relationship between the defect report and the mailing list needs to be provided.
In summary, the defect reports and the mailing lists contain rich software project information, but the information is huge in quantity, complicated in content and complex in association, a great deal of time and energy are consumed for manual arrangement, and it is very difficult to locate information concerned by developers across documents. Therefore, to help developers better reuse open source project resources, it is desirable to provide a semantic association mining tool with mailing lists based on bug reports.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a semantic association mining method for a defect report and a mail list.
The technical scheme of the invention is as follows:
a semantic association mining method for a defect report and a mail list comprises the following steps:
1) analyzing the acquired defect report of the target item and the mail list to obtain stack information, code segments and a text of the defect report and stack information, code segments and a text of the mail list;
2) the document explicit semantic association mining unit identifies explicit semantic associations between the defect reports and the mail list according to the analysis result, wherein the explicit semantic associations comprise reference associations and common code element associations;
3) and the document implicit semantic association mining unit identifies implicit semantic associations, including similar associations and potential semantic associations, between the defect reports and the mail list according to the analysis result.
Further, the reference association includes an association of a defect report reference and an association of a mail reference.
Further, the method for obtaining the association of the defect report reference includes: performing pattern matching on a text of the mail list, and judging whether reference links for the defect report or key names of the defect report are contained; if yes, identifying key names or extracting key name information in the reference links; and then positioning a corresponding defect report according to the key name and establishing a reference association.
Further, the method for obtaining the association of the mail reference includes: performing pattern matching on the text of the defect report, and judging whether reference information of the mail is contained; if yes, extracting Message-ID information in the reference link; and then positioning the corresponding mail according to the Message-ID and establishing reference association.
Further, if the same code element exists in the body text of a mail and the body text of a defect report, the mail and the defect report are considered to have the common code element association.
Further, mining the common code element association according to the source of the code elements; firstly, analyzing the code elements, wherein if the source of the code elements is the code elements of the target project and the code elements are long code elements, the method comprises the following steps: 1) parsing the long code element into an AST; 2) traversing AST nodes and reading elements on the nodes; 3) for each node, extracting the information of the packet name where the node is located and connecting to obtain a long code element set; if the code element source is the code element of the target project and is a short code element, 1) resolving the short code element into AST; 2) traversing AST nodes, and reading elements on the nodes to obtain an initial code element set; 3) removing the duplication of the elements in the initial code element set and filtering stop words; if the code element source is the code element of other projects, adopting a naming rule method to analyze; and then judging whether common code element association exists in the text of the mail and the text of the defect report according to the analysis result.
Further, the method for mining the implicit semantic association comprises the following steps:
1) calculating the similarity SIM1 of the mail and the defect report according to the stack information in the mail and the defect report;
2) calculating the similarity SIM2 of the mail and the defect report according to the text of the mail and the defect report;
3) obtaining a comprehensive similarity SIM of the mail and the defect report based on the similarity SIM1 and the SIM 2; then determining the mails and defect reports with similar associations according to the comprehensive similarity SIM;
4) acquiring a query vector of each document; wherein the query vector of the ith document is Vi=<Wi,1,Wi,2,...,Wi,k,...,Wi,n>,
Figure BDA0001148825690000032
n is the total number of terms in which all documents appear, Wi,kThe number of times of occurrence of the k-th vocabulary in the document i; the documents include mail and defect reports;
5) calculating the cosine similarity of the document i and all other documents according to the query vector corresponding to the document i, and sorting according to a descending order; and then taking the first documents of the sequencing result as documents with potential semantic association with the document i.
Further, using the formula
Figure BDA0001148825690000031
Computing a query vector Vi=<Wi,1,Wi,2,...,Wi,n>And query vector V of document jj=<Wj,1,Wj,2,...,Wj,n>Cosine similarity of (V)i,Vj)。
Further, the graph data Neo4j is used to represent semantic associations between mined defect reports and mailing lists.
Further, the method for analyzing the defect report and the mail list comprises the following steps:
21) firstly, filtering redundant text contents in a defect report and a mail list;
22) extracting stack information from the defect report and the mailing list processed in the step 21) according to the characteristics of the stack information;
23) extracting code segments from the defect report and the mailing list processed in the step 22); then the rest text is the text
Text.
The invention relates to the following main contents:
firstly, acquiring and analyzing a defect report and a mail list;
secondly, mining explicit semantic association of the defect report and the mail list;
and thirdly, mining the implicit semantic association of the defect report and the mail list.
The semantic association mining method for the defect report and the mailing list comprises four subunits: the system comprises a document acquisition and analysis unit, a document explicit semantic association mining unit, a document implicit semantic association mining unit and a document retrieval unit.
In the optimized semantic association mining method for the defect report and the mail list, a document acquisition and analysis unit mainly comprises the following working procedures:
step 1: and determining a document resource network address, and acquiring a defect report and a mail list document resource by adopting fixed-point crawling.
Step 2: determining a data format of the obtained document, wherein the defect report is in a JSON format; the mail list is in mbox format.
And step 3: JSON according to a document format, analyzing the format into a JSON defect report by using a tool org.json; individual Mail is processed using the Mime streamware parser in the Mime4J class library of Apache james (java Apache Mail Enterprise server) and parsed text is stored (both Mail and defect report types, and by their type).
And 4, step 4: the following are identified from the two types of text parsed out in step 3 for mail and defect reports:
a) a code fragment. The code fragments in the defect reports and mailing lists are often not complete compilable units, presenting certain difficulties to code identification. The method realizes a code segment recognition tool based on artificial rules aiming at java grammar, and the recognition process is as follows:
1. a keyword having program characteristics, such as a class definition, a method definition, a condition judgment, and an assignment statement, is used as a set K ═ { class, if, else, for, while, switch, "═ … … }, and a regular expression is used to describe a basic grammar format F defined by the method.
2. Each text segment is analyzed line by line, and if the line contains a Java keyword or a unique symbol (set K) of the Java language, or conforms to the grammar format in the method definition F, the line is determined as a code line.
3. If the comment is started with "//" or started with "/", "+/" and ended, the comment is determined. And counting the number of the annotation lines.
4. If the number of the code lines is added and the number of the annotation lines exceeds half of the total number of the lines of the text section, and a brace or a brace appears in the paragraph, the paragraph is judged to be the code segment.
b) Stack information. Refers to stack call information reporting of breakpoints when a program performs an error. The stack information is mainly determined by the following characteristic text lines:
1. exception statement row: located at the beginning of the text segment, includes words "java.
2. In the row: beginning with "at", the location of the breakpoint indicating that the error is, is located in the body portion of the text passage.
3. The rows are omitted: since some stack information is particularly long, but only the top few methods actually cause errors, it is common to have too long a stack of information represented by the ellipsis "…," ending with a "more," followed by an "at" statement.
4. Resulting in the row: beginning with "used by:" followed by a series of "at" statements.
If the appearance of the characteristic line of the text paragraph exceeds a certain threshold value, the paragraph is judged as stack information.
c) Redundant text. The main occurrences in the mail are greeting words, thank you and mail signature at the beginning and end of the mail. If the mail is a reply mail, the reference content of the original mail is also included.
d) Text. Except the three types, the rest contents are all regarded as text types.
The recognition order is in the order of redundant text → stack information → code fragment → body text. That is, redundant text content is first filtered to reduce the effect of noise. And then extracting stack information according to the stack information characteristics. Then, the code segment is extracted. And finally, the residual text is the text. Wherein, stack information, code fragments and text texts are taken as three effective texts to be analyzed as follows.
In the optimized semantic association mining method for the defect report and the mailing list, the document explicit semantic association mining unit is mainly used for identifying explicit semantic associations between the defect report and the mailing list, including reference associations and common code element associations. The work flow mainly comprises the following steps:
step 1: reference associations are mined. The defect reports and mailing lists often cross-reference content to supplement the problem-describing information. Mining reference association, namely mining reference relationship between the reference report and the reference mail, dividing the reference relationship into a reference defect report and a reference mail, and establishing an association method as follows:
a) quote a defect report. The text of the body can reference the defect report in two modes of key name and link. First, the key names of the defect report include capitalized open source item names, abbreviated as "and numbers," which are concatenated, e.g., "LUCENE-352". Secondly, each defect report in the defect tracking system corresponds to a specific link, and the link mode is as follows:
[ network protocol ]/[ report type ] [ Community site ]/[ Defect tracking System name ]/[ View mode ]/[ Key name ]
For example, references to a defect report by Lucene are linked as follows:
http://issues.apache.org/jira/browse/LUCENE-352
based on the above mode characteristics, the association mining of the defect report references is divided into the following steps: 1) judging whether a reference link for a defect report or a key name of the defect report is contained in a text of the mail list in a mode of pattern matching; 2) if yes, directly identifying the key name, or extracting key name information in the reference link; 3) and positioning a corresponding defect report according to the key name, and establishing reference association.
b) Quote mail. The unique identifier Message-ID of the mail is in the form of < local-part @ domian >. Where local-part is the mail ID and domian is the host domain name that specifies this ID. The uniqueness of the ID is handled by the host that specifies the ID and there is no provision for how it is generated. It can be observed that the Message-ID of the mail is not as intuitive and concise as the key names of the defect report, such as "< c68e39170909240919g11492b52qc65667ac4c5431d5@ mail. Thus, references to mail typically use only a linked approach. Each mail corresponds to a specific link, and the mode of the link is as follows:
[ network protocol ]/[ archive Server name ] - [ Community site ]/[ archive Format ]/[ mail List name ]/[ archive File name ]/[ mail ID ]
For example, a reference to Lucene's mail is linked as follows:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200808.mbox/<48a3076a.2679420a.1c53.ffffa5c4%40mx.google.com>
based on the above mode characteristics, the association mining of the mail references is divided into the following steps: 1) judging whether reference information of the mail is contained in a text of the defect report in a mode of pattern matching, namely reference link; 2) if yes, extracting Message-ID information in the reference link; 3) and positioning the corresponding mail according to the Message-ID, and establishing reference association.
Step 2: common code element associations are mined. If the same code element exists in the text of one mail and one defect report file, the two are considered to have a common code element association. Depending on the source of the associated code element, two categories can be distinguished: this project (Lucene) code element and other project (non-Lucene) code elements. For the two situations, a method of source code analysis and naming rules is respectively adopted for analysis, and the analysis process is as follows:
a) and source code parsing for parsing the code elements in the project. Taking lunene4.0 as an example, since many class names and method names in the source code are common words of natural language, noise is generated in association mining, in order to avoid the situation, the method divides the code elements into long code elements and short code elements for analysis, and the analysis process is as follows:
the long code element includes not only the name of the class and the name of the method, but also the name of the complete package in which the class or the method is located, such as: apache analysis, cacing tokenfilter. The method comprises the following steps: 1) resolving java source codes into AST by adopting ASTParser provided by eclipseJDT; 2) traversing AST nodes, and reading elements such as classes, methods and interfaces on the nodes; 3) and extracting the packet name information of each node, and connecting the packet name information with the node in a 'right' manner to obtain a long code element set.
Short code elements, containing only the name of the AST current node, such as CashingTokenFilter. Further filtering is required to eliminate the effect of many method names being common words in natural language. The method comprises the following specific steps: 1) resolving java source codes into AST by adopting ASTParser provided by Eclipse JDT; 2) traversing AST nodes, reading elements such as classes and methods on the nodes, and obtaining an initial code element set; 3) removing the duplication of the set elements and filtering stop words in the set; 4) and manually screening and deleting words with small meanings, such as elements of get, size, value and the like.
b) Naming rule identification for parsing other project code elements. In the Java language, variable names follow the naming rule of camel spelling to enhance its readability, specifically three are:
1. small hump spelling method. The first lower case of the first letter followed by the upper case of each first letter, such as: firstName.
2. Big hump spelling method. The first letter of the first word and each word following it is capitalized, for example: FirstName.
3. Nonstandard humped strings, i.e. with consecutive capital letters and numbers, such as UPPER2000UPPER, hasabbreviationembedded, Client2Server 2012.
For the "camel spelling" naming convention for Java variables, the regular expression "? < | A! (^ A-Z0-9)) (? ═ a-Z0-9) | (? < | A! (^ A-Z))) (? [ (? < | A! (^ 0-9))) (? ═ a-Za-z) | (? < | A! ? The names of all variables defined by the method can be identified as [ a-Z ]) ".
When the same code element is extracted from the two documents, the two documents are judged to have a common code element association.
In the optimized semantic association mining method for the defect report and the mailing list, a document implicit semantic association mining unit is mainly used for identifying implicit semantic association between the defect report and the mailing list, and mainly comprises the following steps of:
step 1: similar associations of mail and defect reports are mined.
Step 1.1: stack information in the mail and the defect report, including the function signature of the abnormal type name and the error position, is extracted respectively, and the similarity SIM1 is calculated.
Step 1.2: combined with the diversity of natural languages, the similarity of text SIM2 is calculated.
Step 1.3: calculating the comprehensive similarity of the document by adopting a heuristic rule based on the two similarity SIM1 and SIM 2; and then determining that the similar associated mails and defect reports exist according to the comprehensive similarity SIM.
Step 2: potential semantic association is mined. And mining the contribution degree of the patch information of the solved defect report to semantic association.
Step 2.1: and establishing an index. An inverted index is built for all defect reports and mailing lists.
Step 2.2: a query vector is obtained. Suppose that all documents appear lexicaln, and numbering. The processed query vector of the ith document (mail or defect report) is Vi=<Wi,1,Wi,2,...,Wi,n>,
Figure BDA0001148825690000072
Wi,kRefers to the number of times the k-th word appears in document i. The summary of the defect report can highlight the central idea of the document, and the patch file solves the code information required to be modified by the defect. Therefore, the method combines the summary of the defect report and the code elements in the patch file to perform word segmentation, stop word filtering and stem extraction processing to obtain the query vector of the defect report.
Step 2.3: potential associations are established. And calculating the cosine similarity of the document D and all other documents according to the query vectors corresponding to the documents, and sequencing according to a descending order. And taking the first 3 documents of the sequencing result, and judging that the potential semantic association exists between the documents D and the sequencing result. And circularly establishing potential semantic association of all the documents. Wherein a query vector V is calculatedi=<Wi,1,Wi,2,...,Wi,n>And Vj=<Wj,1,Wj,2,...,Wj,n>The cosine similarity method is as follows:
Figure BDA0001148825690000071
optimally, in the semantic association mining method for the defect report and the mailing list, the document retrieval unit visually represents the semantic association between the mined defect report and the mailing list by adopting the graph data Neo4 j. Meanwhile, a retrieval interface is provided externally, so that a defect report and a mail list can be conveniently acquired. The method adopts the PageRank algorithm of a search engine, document nodes in a graph data structure of the algorithm are compared with web pages, semantic association edges are compared with hyperlinks, and retrieval results are reordered according to semantic association relations among documents. Due to the huge number of defect reports and mail documents, in order to improve the reordering efficiency, only the first 100 results of information retrieval are reordered.
Compared with the prior art, the invention has the following positive effects:
taking the project Lucene as an example, compared with the prior art, the method improves the search efficiency through the following experiments.
Firstly deleting a problem set Sq related to Lucene from stackoverflow, wherein the screening principle is as follows: questions with Lucene labels, positive number of votes for the questions, and reports of defects or mailing list links to answers to the questions.
Then, according to the semantic association diagram established by the method, a relevant defect report and a mail list set Ba are returned as a search result.
And finally, calculating two values to measure the advantages and disadvantages of the method, wherein the two values are respectively as follows: recall (Recall) and average reciprocal rank (MRR). Two methods of calculation are as follows:
1. calculation of recall
Evaluating whether the search result based on semantic association contains the document related to the best answer, namely MC:
Figure BDA0001148825690000081
recall is defined as follows:
Figure BDA0001148825690000082
MCias the ith document
2. Calculation of average reciprocal rank
The average reciprocal rank MRR index is used to measure the retrieval effect. The accuracy of the defect report or the mail list related to the best answer is determined by taking the reciprocal of the ranking in the retrieval system, and the results of all the problems are averaged. The MRR value calculation formula is as follows:
Figure BDA0001148825690000083
| Q | is the number of the search result documents, rankiRanking for document i
The experiments had a total of three groups, one control group and two experimental groups. The control group is a Base group using the SVM text retrieval method; and the first experimental group selects candidate answers by using an SVM retrieval method based on classified documents, and the second experimental group reorders the results by considering the influence of semantic association on the candidate answers on the basis of the first experimental group. The two experimental groups were Base + Classification group, Base + Classification + Relationship group, respectively. The tasks of each group are the same, i.e. documents related to the mail document are retrieved from all the defect reports according to the questions in the question set Sq. The results are shown in Table 1:
table 1 is an experimental result table
Method of producing a composite material Recall MRR
Base 22.22% 8.05%
Base+Classification 26.98% 13.27%
Base+Classifcation+Relationship 39.68% 19.95%
The experimental result shows that after the semantic association relationship is introduced, the retrieval recall rate of the system to the document is greatly improved compared with the Base method. In the ranking of the retrieval result, the MRR value of the Base method is low, namely the relevant documents are ranked too late, and the help to the developer is not large. After text classification processing and semantic association mining, the MRR value is obviously improved, which is beneficial to efficiently positioning related defect reports and mail lists and helps developers to better reuse software resources.
Drawings
FIG. 1 is a process flow for mining similar associations.
FIG. 2 is a workflow for mining potential semantic associations.
Detailed Description
In the embodiment, semantic association between the defect report of the item Lucene and the mailing list is mined, and the association is stored and inquired by using the graph database Neo4j so as to verify the effect of the method.
As previously described, the method will be performed in the order of acquiring resources → mining explicit semantics → mining implicit semantics → querying the document.
The document acquisition and analysis unit executes the following steps:
step 1: and constructing file addresses, such as http:// mail-archives. apache.org/mod _ mbox/lucene-general/201509.mbox, and crawling the two types of documents according to the addresses respectively.
Step 2: the project bug reports are parsed in the Json format and the parsed text is stored in the database Neo4j by category.
And step 3: the item mailing list is parsed according to the MIME4J format, and the parsed text is stored in the database Neo4j according to the category.
The document display semantic association mining unit executes the following steps:
step 1: reference links in the documents are identified, and reference associations are mined. The defect report content reference mailing list and the mailing list reference defect report are included.
Step 2: identifying code element association in the document, and identifying code elements of the project Lucene by adopting a source code analysis method; adopting a naming rule method to identify non-Lucene code elements, and the process is as follows:
step 2.1: and (6) analyzing the source code. The method is mainly used for identifying the code elements of the project Lucene, and comprises the identification of long code elements and segment code elements. .
A long code element. The project source code is parsed using Eclipse JDT AST, identifying long code elements in the source code (containing the full package name where the class or method is located), such as org.
A short code element. The project source code is parsed with Eclipse JDT AST to distinguish long code elements (containing only class or method names) in the source code, such as the CachingtkenFilter.
Step 2.2: and identifying code elements of the non-Lucene items by using a naming rule method. According to the camel naming rule, variable names which conform to the naming rule are identified.
Step 2.3: if the text contains the same code elements, the code association exists between the two documents, and the association between the corresponding nodes of the documents is established in the Neo4j data.
The document implicit semantic association mining unit is mainly used for:
firstly, mining similar association, mainly comprising the following steps (a work flow chart is shown in fig. 1):
step 1: and calculating the similarity of the stack information. Extracting function signatures of abnormal type names and error positions of stack information of the document i to form a set STi. Then the similarity of the stack information of document i and document j (the two documents are mail and defect report, respectively) is:
Figure BDA0001148825690000101
step 2: and calculating the text similarity. After word segmentation, stop word filtering and stem extraction processing are carried out on the text content. According to the formula:
Figure BDA0001148825690000102
the similarity of the document D, Q is calculated. Wherein q is1To qnThe keywords are processed by the document Q; f (q)iD) is the ith participle Q in the document QiWord frequency in document D;avgdl is the average document length in the document set;
Figure BDA0001148825690000103
where N is the number of all documents, N (q)i) To contain the keyword qiThe number of documents of (a); k is a radical of1And b is an adjustment parameter.
And step 3: and calculating the comprehensive similarity. And (3) removing the document D, respectively calculating the stack information similarity and the text similarity of the document D and all other documents according to the methods in the step 1 and the step 2, taking the average value of the stack information similarity and the text similarity as the comprehensive similarity of the document D and other documents, and performing descending order. Taking the top 10 documents, judging that similar association exists with the document D, and establishing association between corresponding nodes of the documents in Neo4j data.
Secondly, mining potential semantic association, which mainly comprises the following steps (a work flow chart is shown in fig. 2):
step 1: and establishing an inverted index for the defect report and the mail list of the Lucene project.
Step 2: and retrieving the document according to the summary of the defect report and the code query of the patch file.
And step 3: and calculating the similarity of the document D and all other documents. The calculation formula is as follows:
Figure BDA0001148825690000104
wherein Vi、VjQuery vectors for documents i, j, respectively.
And then sorting the documents in a descending order according to the similarity, taking the first 3 documents, judging that the documents are potential semantic associated with the document D, and establishing association between nodes corresponding to the documents in a Neo4j database.
And the document retrieval unit is used for verifying the effect of the association established by the method. The verification method comprises the following steps: and selecting related questions from the Stackoverflow, retrieving documents by using Neo4j, comparing original answers of the questions, and determining whether the questions can be solved. The method mainly comprises the following steps:
step 1: and (5) screening a problem set. The principle of screening is as follows: questions with Lucene labels, positive number of votes for the questions, and reports of defects or mailing list links to answers to the questions.
Step 2: with the data Neo4j, a query is constructed from the questions, and results are obtained.
And step 3: according to the retrieval result, Recall rate (Recall) and average reciprocal rank (MRR) are calculated.

Claims (9)

1. A method for mining association between a defect report and a mail list language comprises the following steps:
1) analyzing the acquired defect report of the target item and the mail list to obtain stack information, code segments and a text of the defect report and stack information, code segments and a text of the mail list;
2) the document explicit semantic association mining unit identifies explicit semantic associations between the defect reports and the mail list according to the analysis result, wherein the explicit semantic associations comprise reference associations and common code element associations;
3) the implicit semantic association mining unit of the document identifies the implicit semantic association between the defect report and the mail list according to the analysis result, wherein the implicit semantic association comprises similar association and potential semantic association;
the method for mining the implicit semantic association comprises the following steps:
31) calculating the similarity SIM1 of the mail and the defect report according to the stack information in the mail and the defect report;
32) calculating the similarity SIM2 of the mail and the defect report according to the text of the mail and the defect report;
33) obtaining a comprehensive similarity SIM of the mail and the defect report based on the similarity SIM1 and the SIM 2; then determining the mails and defect reports with similar associations according to the comprehensive similarity SIM;
34) acquiring a query vector of each document; wherein the query vector of the ith document is Vi=<Wi,1,Wi,2,...,Wi,k,...,Wi,n>,
Figure FDA0002367374380000011
n is all documentsTotal number of words present, Wi,kThe number of times of occurrence of the k-th vocabulary in the document i; the documents include mail and defect reports;
35) calculating the cosine similarity of the document i and all other documents according to the query vector corresponding to the document i, and sorting according to a descending order; and then taking the first documents of the sequencing result as documents with potential semantic association with the document i.
2. The method of claim 1, wherein the reference associations comprise an association of a defect report reference and an association of a mail reference.
3. The method of claim 2, wherein the association of the defect report reference is obtained by: performing pattern matching on a text of the mail list, and judging whether reference links for the defect report or key names of the defect report are contained; if yes, identifying key names or extracting key name information in the reference links; and then positioning a corresponding defect report according to the key name and establishing a reference association.
4. The method of claim 2, wherein the association of the mail reference is obtained by: performing pattern matching on the text of the defect report, and judging whether reference information of the mail is contained; if yes, extracting Message-ID information in the reference link; and then positioning the corresponding mail according to the Message-ID and establishing reference association.
5. The method of claim 1, wherein if the same code element is present in the body text of a mail piece as in the body text of a defect report, the two are considered to have the common code element association.
6. The method of claim 5, wherein the common code element associations are mined based on the source of the code elements; firstly, analyzing the code elements, wherein if the source of the code elements is the code elements of the target project and the code elements are long code elements, the method comprises the following steps: 1) parsing the long code element into an AST; 2) traversing AST nodes and reading elements on the nodes; 3) for each node, extracting the information of the packet name where the node is located and connecting to obtain a long code element set; if the code element source is the code element of the target project and is a short code element, 1) resolving the short code element into AST; 2) traversing AST nodes, and reading elements on the nodes to obtain an initial code element set; 3) removing the duplication of the elements in the initial code element set and filtering stop words; if the code element source is the code element of other projects, adopting a naming rule method to analyze; and then judging whether common code element association exists in the text of the mail and the text of the defect report according to the analysis result.
7. The method of claim 1, wherein a formula is utilized
Figure FDA0002367374380000021
Computing a query vector Vi=<Wi,1,Wi,2,...,Wi,nQuery vector V for > sum document jj=<Wj,1,Wj,2,...,Wj,nCosine similarity of > Simiarity (V)i,Vj)。
8. The method of claim 1, wherein the graph data Neo4j is used to represent semantic associations between mined defect reports and mailing lists.
9. The method of claim 1, wherein the parsing the defect report and mailing list comprises:
21) firstly, filtering redundant text contents in a defect report and a mail list;
22) extracting stack information from the defect report and the mailing list processed in the step 21) according to the characteristics of the stack information;
23) extracting code segments from the defect report and the mailing list processed in the step 22); the remaining text is then body text.
CN201610984538.XA 2016-11-09 2016-11-09 Semantic association mining method for defect report and mail list Active CN106649557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610984538.XA CN106649557B (en) 2016-11-09 2016-11-09 Semantic association mining method for defect report and mail list

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610984538.XA CN106649557B (en) 2016-11-09 2016-11-09 Semantic association mining method for defect report and mail list

Publications (2)

Publication Number Publication Date
CN106649557A CN106649557A (en) 2017-05-10
CN106649557B true CN106649557B (en) 2020-10-20

Family

ID=58806806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610984538.XA Active CN106649557B (en) 2016-11-09 2016-11-09 Semantic association mining method for defect report and mail list

Country Status (1)

Country Link
CN (1) CN106649557B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019829B (en) * 2017-09-19 2021-05-07 绿湾网络科技有限公司 Data attribute determination method and device
CN109597747A (en) * 2017-09-30 2019-04-09 南京大学 A method of across item association defect report is recommended based on multi-objective optimization algorithm NSGA- II
CN107943514A (en) * 2017-11-01 2018-04-20 北京大学 The method for digging and system of core code element in a kind of software document
CN109359023B (en) * 2018-04-27 2020-01-24 哈尔滨工程大学 Mobile application error positioning method based on submitted information
CN109299381B (en) * 2018-10-31 2020-04-24 哈尔滨工程大学 Software defect retrieval and analysis system and method based on semantic concept
CN113760895A (en) * 2021-01-12 2021-12-07 北京沃东天骏信息技术有限公司 Method and system for automatically generating correlation path between tables

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207899B (en) * 2013-03-19 2016-12-07 新浪网技术(中国)有限公司 Text recommends method and system
CN103927360A (en) * 2014-04-18 2014-07-16 北京大学 Software project semantic information presentation and retrieval method based on graph model

Also Published As

Publication number Publication date
CN106649557A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN106649557B (en) Semantic association mining method for defect report and mail list
CN109697162B (en) Software defect automatic detection method based on open source code library
WO2019214245A1 (en) Information pushing method and apparatus, and terminal device and storage medium
JP2022535792A (en) Discovery of data field semantic meaning from data field profile data
US8560513B2 (en) Searching for information based on generic attributes of the query
Shivaji et al. Reducing features to improve code change-based bug prediction
US9390176B2 (en) System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data
CN109408578B (en) Monitoring data fusion method for heterogeneous environment
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
WO2017091985A1 (en) Method and device for recognizing stop word
CN113254751B (en) Method, equipment and storage medium for accurately extracting complex webpage structured information
CN110109678B (en) Code audit rule base generation method, device, equipment and medium
CN111506595B (en) Data query method, system and related equipment
Cheng et al. A similarity integration method based information retrieval and word embedding in bug localization
CN110188291B (en) Document processing based on proxy log
CN112286799B (en) Software defect positioning method combining sentence embedding and particle swarm optimization algorithm
Eken et al. Predicting defects with latent and semantic features from commit logs in an industrial setting
CN107368464B (en) Method and device for acquiring bidding product information
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
CN116894495A (en) Method, computer readable medium and system for enhancing machine learning pipeline with annotations
CN112115362B (en) Programming information recommendation method and device based on similar code recognition
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
CN113722421A (en) Contract auditing method and system and computer readable storage medium
US20230214679A1 (en) Extracting and classifying entities from digital content items

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant