CN106649557B

CN106649557B - Semantic association mining method for defect report and mail list

Info

Publication number: CN106649557B
Application number: CN201610984538.XA
Authority: CN
Inventors: 赵俊峰; 陈秀招; 曹英魁
Original assignee: Peking University Information Technology Institute (tianjin Binhai)
Current assignee: Peking University Information Technology Institute (tianjin Binhai)
Priority date: 2016-11-09
Filing date: 2016-11-09
Publication date: 2020-10-20
Anticipated expiration: 2036-11-09
Also published as: CN106649557A

Abstract

The invention discloses a semantic association mining method for a defect report and a mail list. The method comprises the following steps: 1) analyzing the acquired defect report of the target item and the mail list to obtain stack information, code segments and a text of the defect report and stack information, code segments and a text of the mail list; 2) the document explicit semantic association mining unit identifies explicit semantic associations between the defect reports and the mail list according to the analysis result, wherein the explicit semantic associations comprise reference associations and common code element associations; 3) and the document implicit semantic association mining unit identifies implicit semantic associations, including similar associations and potential semantic associations, between the defect reports and the mail list according to the analysis result. The invention is beneficial to efficiently positioning the related defect report and the mail list and helps developers to better reuse software resources.

Description

Semantic association mining method for defect report and mail list

Technical Field

The method is used for mining the semantic association relationship between the defect report and the mail list in the software multiplexing process, and reducing the searching, reading and learning burden of developers.

Background

In the software development process, the software reuse hopes to be capable of fully utilizing the knowledge and experience accumulated in the development of the application system in the past, avoiding repeated labor, focusing the development emphasis on the specific composition of the application and improving the software development efficiency and quality.

In recent years, the internet has seen a large number of open source project hosting sites, with more and more premium software appearing in the field of development. In these projects, not only are rich code resources provided to us, but at the same time, some excellent, mature, open source projects produce a large number of documents in a variety of forms. Defect reports and mailing lists are two of them. The documents are not only development records and communication modes of project developers, but also provide ways for other developers to learn and reuse the projects.

As two common project documents: the defect report and the mail list provide valuable reuse resources for software developers and discuss and record the problems encountered by the developers. However, as the size of open-source projects increases, the problems encountered by developers have become increasingly complex, and answers to the problems have not been able to be found through a single type of document. For example, the problem scene and the suggestions of community members are known according to the discussion in the mailing list, the method for solving the problem and the affected code modules are checked according to the patches in the defect report, and the like.

The defect report and the mail list contain rich software project information, and great convenience is brought to developers. However, some characteristics of these documents also bring obstacles to the developers for learning and reuse, mainly including:

first, the number of defect reports and mailing lists is enormous.

Secondly, the defect report and the mail list are complicated.

Thirdly, the association of the defect report with the mailing list is complicated.

Therefore, in order to help developers to better understand and reuse software resources, the association relationship between the defect report and the mailing list needs to be mined, and a document retrieval interface based on the association relationship between the defect report and the mailing list needs to be provided.

In summary, the defect reports and the mailing lists contain rich software project information, but the information is huge in quantity, complicated in content and complex in association, a great deal of time and energy are consumed for manual arrangement, and it is very difficult to locate information concerned by developers across documents. Therefore, to help developers better reuse open source project resources, it is desirable to provide a semantic association mining tool with mailing lists based on bug reports.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention aims to provide a semantic association mining method for a defect report and a mail list.

The technical scheme of the invention is as follows:

a semantic association mining method for a defect report and a mail list comprises the following steps:

1) analyzing the acquired defect report of the target item and the mail list to obtain stack information, code segments and a text of the defect report and stack information, code segments and a text of the mail list;

2) the document explicit semantic association mining unit identifies explicit semantic associations between the defect reports and the mail list according to the analysis result, wherein the explicit semantic associations comprise reference associations and common code element associations;

3) and the document implicit semantic association mining unit identifies implicit semantic associations, including similar associations and potential semantic associations, between the defect reports and the mail list according to the analysis result.

Further, the reference association includes an association of a defect report reference and an association of a mail reference.

Further, the method for obtaining the association of the defect report reference includes: performing pattern matching on a text of the mail list, and judging whether reference links for the defect report or key names of the defect report are contained; if yes, identifying key names or extracting key name information in the reference links; and then positioning a corresponding defect report according to the key name and establishing a reference association.

Further, the method for obtaining the association of the mail reference includes: performing pattern matching on the text of the defect report, and judging whether reference information of the mail is contained; if yes, extracting Message-ID information in the reference link; and then positioning the corresponding mail according to the Message-ID and establishing reference association.

Further, if the same code element exists in the body text of a mail and the body text of a defect report, the mail and the defect report are considered to have the common code element association.

Further, mining the common code element association according to the source of the code elements; firstly, analyzing the code elements, wherein if the source of the code elements is the code elements of the target project and the code elements are long code elements, the method comprises the following steps: 1) parsing the long code element into an AST; 2) traversing AST nodes and reading elements on the nodes; 3) for each node, extracting the information of the packet name where the node is located and connecting to obtain a long code element set; if the code element source is the code element of the target project and is a short code element, 1) resolving the short code element into AST; 2) traversing AST nodes, and reading elements on the nodes to obtain an initial code element set; 3) removing the duplication of the elements in the initial code element set and filtering stop words; if the code element source is the code element of other projects, adopting a naming rule method to analyze; and then judging whether common code element association exists in the text of the mail and the text of the defect report according to the analysis result.

Further, the method for mining the implicit semantic association comprises the following steps:

1) calculating the similarity SIM1 of the mail and the defect report according to the stack information in the mail and the defect report;

2) calculating the similarity SIM2 of the mail and the defect report according to the text of the mail and the defect report;

3) obtaining a comprehensive similarity SIM of the mail and the defect report based on the similarity SIM1 and the SIM 2; then determining the mails and defect reports with similar associations according to the comprehensive similarity SIM;

4) acquiring a query vector of each document; wherein the query vector of the ith document is V_i＝<W_i,1,W_i,2,...,W_i,k，...，W_i,n>，

n is the total number of terms in which all documents appear, W_i,kThe number of times of occurrence of the k-th vocabulary in the document i; the documents include mail and defect reports;

5) calculating the cosine similarity of the document i and all other documents according to the query vector corresponding to the document i, and sorting according to a descending order; and then taking the first documents of the sequencing result as documents with potential semantic association with the document i.

Further, using the formula

Computing a query vector V_i＝<W_i,1,W_i,2,...,W_i,n>And query vector V of document j_j＝<W_j,1，W_j,2,...,W_j,n>Cosine similarity of (V)_i,V_j)。

Further, the graph data Neo4j is used to represent semantic associations between mined defect reports and mailing lists.

Further, the method for analyzing the defect report and the mail list comprises the following steps:

21) firstly, filtering redundant text contents in a defect report and a mail list;

22) extracting stack information from the defect report and the mailing list processed in the step 21) according to the characteristics of the stack information;

23) extracting code segments from the defect report and the mailing list processed in the step 22); then the rest text is the text

Text.

The invention relates to the following main contents:

firstly, acquiring and analyzing a defect report and a mail list;

secondly, mining explicit semantic association of the defect report and the mail list;

and thirdly, mining the implicit semantic association of the defect report and the mail list.

The semantic association mining method for the defect report and the mailing list comprises four subunits: the system comprises a document acquisition and analysis unit, a document explicit semantic association mining unit, a document implicit semantic association mining unit and a document retrieval unit.

In the optimized semantic association mining method for the defect report and the mail list, a document acquisition and analysis unit mainly comprises the following working procedures:

step 1: and determining a document resource network address, and acquiring a defect report and a mail list document resource by adopting fixed-point crawling.

Step 2: determining a data format of the obtained document, wherein the defect report is in a JSON format; the mail list is in mbox format.

And step 3: JSON according to a document format, analyzing the format into a JSON defect report by using a tool org.json; individual Mail is processed using the Mime streamware parser in the Mime4J class library of Apache james (java Apache Mail Enterprise server) and parsed text is stored (both Mail and defect report types, and by their type).

And 4, step 4: the following are identified from the two types of text parsed out in step 3 for mail and defect reports:

a) a code fragment. The code fragments in the defect reports and mailing lists are often not complete compilable units, presenting certain difficulties to code identification. The method realizes a code segment recognition tool based on artificial rules aiming at java grammar, and the recognition process is as follows:

1. a keyword having program characteristics, such as a class definition, a method definition, a condition judgment, and an assignment statement, is used as a set K ═ { class, if, else, for, while, switch, "═ … … }, and a regular expression is used to describe a basic grammar format F defined by the method.

2. Each text segment is analyzed line by line, and if the line contains a Java keyword or a unique symbol (set K) of the Java language, or conforms to the grammar format in the method definition F, the line is determined as a code line.

3. If the comment is started with "//" or started with "/", "+/" and ended, the comment is determined. And counting the number of the annotation lines.

4. If the number of the code lines is added and the number of the annotation lines exceeds half of the total number of the lines of the text section, and a brace or a brace appears in the paragraph, the paragraph is judged to be the code segment.

b) Stack information. Refers to stack call information reporting of breakpoints when a program performs an error. The stack information is mainly determined by the following characteristic text lines:

1. exception statement row: located at the beginning of the text segment, includes words "java.

2. In the row: beginning with "at", the location of the breakpoint indicating that the error is, is located in the body portion of the text passage.

3. The rows are omitted: since some stack information is particularly long, but only the top few methods actually cause errors, it is common to have too long a stack of information represented by the ellipsis "…," ending with a "more," followed by an "at" statement.

4. Resulting in the row: beginning with "used by:" followed by a series of "at" statements.

If the appearance of the characteristic line of the text paragraph exceeds a certain threshold value, the paragraph is judged as stack information.

c) Redundant text. The main occurrences in the mail are greeting words, thank you and mail signature at the beginning and end of the mail. If the mail is a reply mail, the reference content of the original mail is also included.

d) Text. Except the three types, the rest contents are all regarded as text types.

The recognition order is in the order of redundant text → stack information → code fragment → body text. That is, redundant text content is first filtered to reduce the effect of noise. And then extracting stack information according to the stack information characteristics. Then, the code segment is extracted. And finally, the residual text is the text. Wherein, stack information, code fragments and text texts are taken as three effective texts to be analyzed as follows.

In the optimized semantic association mining method for the defect report and the mailing list, the document explicit semantic association mining unit is mainly used for identifying explicit semantic associations between the defect report and the mailing list, including reference associations and common code element associations. The work flow mainly comprises the following steps:

step 1: reference associations are mined. The defect reports and mailing lists often cross-reference content to supplement the problem-describing information. Mining reference association, namely mining reference relationship between the reference report and the reference mail, dividing the reference relationship into a reference defect report and a reference mail, and establishing an association method as follows:

a) quote a defect report. The text of the body can reference the defect report in two modes of key name and link. First, the key names of the defect report include capitalized open source item names, abbreviated as "and numbers," which are concatenated, e.g., "LUCENE-352". Secondly, each defect report in the defect tracking system corresponds to a specific link, and the link mode is as follows:

[ network protocol ]/[ report type ] [ Community site ]/[ Defect tracking System name ]/[ View mode ]/[ Key name ]

For example, references to a defect report by Lucene are linked as follows:

http://issues.apache.org/jira/browse/LUCENE-352

based on the above mode characteristics, the association mining of the defect report references is divided into the following steps: 1) judging whether a reference link for a defect report or a key name of the defect report is contained in a text of the mail list in a mode of pattern matching; 2) if yes, directly identifying the key name, or extracting key name information in the reference link; 3) and positioning a corresponding defect report according to the key name, and establishing reference association.

b) Quote mail. The unique identifier Message-ID of the mail is in the form of < local-part @ domian >. Where local-part is the mail ID and domian is the host domain name that specifies this ID. The uniqueness of the ID is handled by the host that specifies the ID and there is no provision for how it is generated. It can be observed that the Message-ID of the mail is not as intuitive and concise as the key names of the defect report, such as "< c68e39170909240919g11492b52qc65667ac4c5431d5@ mail. Thus, references to mail typically use only a linked approach. Each mail corresponds to a specific link, and the mode of the link is as follows:

[ network protocol ]/[ archive Server name ] - [ Community site ]/[ archive Format ]/[ mail List name ]/[ archive File name ]/[ mail ID ]

For example, a reference to Lucene's mail is linked as follows:

http://mail-archives.apache.org/mod_mbox/lucene-java-user/200808.mbox/<48a3076a.2679420a.1c53.ffffa5c4％40mx.google.com>

based on the above mode characteristics, the association mining of the mail references is divided into the following steps: 1) judging whether reference information of the mail is contained in a text of the defect report in a mode of pattern matching, namely reference link; 2) if yes, extracting Message-ID information in the reference link; 3) and positioning the corresponding mail according to the Message-ID, and establishing reference association.

Step 2: common code element associations are mined. If the same code element exists in the text of one mail and one defect report file, the two are considered to have a common code element association. Depending on the source of the associated code element, two categories can be distinguished: this project (Lucene) code element and other project (non-Lucene) code elements. For the two situations, a method of source code analysis and naming rules is respectively adopted for analysis, and the analysis process is as follows:

a) and source code parsing for parsing the code elements in the project. Taking lunene4.0 as an example, since many class names and method names in the source code are common words of natural language, noise is generated in association mining, in order to avoid the situation, the method divides the code elements into long code elements and short code elements for analysis, and the analysis process is as follows:

the long code element includes not only the name of the class and the name of the method, but also the name of the complete package in which the class or the method is located, such as: apache analysis, cacing tokenfilter. The method comprises the following steps: 1) resolving java source codes into AST by adopting ASTParser provided by eclipseJDT; 2) traversing AST nodes, and reading elements such as classes, methods and interfaces on the nodes; 3) and extracting the packet name information of each node, and connecting the packet name information with the node in a 'right' manner to obtain a long code element set.

Short code elements, containing only the name of the AST current node, such as CashingTokenFilter. Further filtering is required to eliminate the effect of many method names being common words in natural language. The method comprises the following specific steps: 1) resolving java source codes into AST by adopting ASTParser provided by Eclipse JDT; 2) traversing AST nodes, reading elements such as classes and methods on the nodes, and obtaining an initial code element set; 3) removing the duplication of the set elements and filtering stop words in the set; 4) and manually screening and deleting words with small meanings, such as elements of get, size, value and the like.

b) Naming rule identification for parsing other project code elements. In the Java language, variable names follow the naming rule of camel spelling to enhance its readability, specifically three are:

1. small hump spelling method. The first lower case of the first letter followed by the upper case of each first letter, such as: firstName.

2. Big hump spelling method. The first letter of the first word and each word following it is capitalized, for example: FirstName.

3. Nonstandard humped strings, i.e. with consecutive capital letters and numbers, such as UPPER2000UPPER, hasabbreviationembedded, Client2Server 2012.

For the "camel spelling" naming convention for Java variables, the regular expression "? < | A! (^ A-Z0-9)) (? ═ a-Z0-9) | (? < | A! (^ A-Z))) (? [ (? < | A! (^ 0-9))) (? ═ a-Za-z) | (? < | A! ? The names of all variables defined by the method can be identified as [ a-Z ]) ".

When the same code element is extracted from the two documents, the two documents are judged to have a common code element association.

In the optimized semantic association mining method for the defect report and the mailing list, a document implicit semantic association mining unit is mainly used for identifying implicit semantic association between the defect report and the mailing list, and mainly comprises the following steps of:

step 1: similar associations of mail and defect reports are mined.

Step 1.1: stack information in the mail and the defect report, including the function signature of the abnormal type name and the error position, is extracted respectively, and the similarity SIM1 is calculated.

Step 1.2: combined with the diversity of natural languages, the similarity of text SIM2 is calculated.

Step 1.3: calculating the comprehensive similarity of the document by adopting a heuristic rule based on the two similarity SIM1 and SIM 2; and then determining that the similar associated mails and defect reports exist according to the comprehensive similarity SIM.

Step 2: potential semantic association is mined. And mining the contribution degree of the patch information of the solved defect report to semantic association.

Step 2.1: and establishing an index. An inverted index is built for all defect reports and mailing lists.

Step 2.2: a query vector is obtained. Suppose that all documents appear lexicaln, and numbering. The processed query vector of the ith document (mail or defect report) is V_i＝<W_i,1,W_i,2,...,W_i,n>，

W_i,kRefers to the number of times the k-th word appears in document i. The summary of the defect report can highlight the central idea of the document, and the patch file solves the code information required to be modified by the defect. Therefore, the method combines the summary of the defect report and the code elements in the patch file to perform word segmentation, stop word filtering and stem extraction processing to obtain the query vector of the defect report.

Step 2.3: potential associations are established. And calculating the cosine similarity of the document D and all other documents according to the query vectors corresponding to the documents, and sequencing according to a descending order. And taking the first 3 documents of the sequencing result, and judging that the potential semantic association exists between the documents D and the sequencing result. And circularly establishing potential semantic association of all the documents. Wherein a query vector V is calculated_i＝<W_i,1,W_i,2,...,W_i,n>And V_j＝<W_j,1,W_j,2，...，W_j,n>The cosine similarity method is as follows:

optimally, in the semantic association mining method for the defect report and the mailing list, the document retrieval unit visually represents the semantic association between the mined defect report and the mailing list by adopting the graph data Neo4 j. Meanwhile, a retrieval interface is provided externally, so that a defect report and a mail list can be conveniently acquired. The method adopts the PageRank algorithm of a search engine, document nodes in a graph data structure of the algorithm are compared with web pages, semantic association edges are compared with hyperlinks, and retrieval results are reordered according to semantic association relations among documents. Due to the huge number of defect reports and mail documents, in order to improve the reordering efficiency, only the first 100 results of information retrieval are reordered.

Compared with the prior art, the invention has the following positive effects:

taking the project Lucene as an example, compared with the prior art, the method improves the search efficiency through the following experiments.

Firstly deleting a problem set Sq related to Lucene from stackoverflow, wherein the screening principle is as follows: questions with Lucene labels, positive number of votes for the questions, and reports of defects or mailing list links to answers to the questions.

Then, according to the semantic association diagram established by the method, a relevant defect report and a mail list set Ba are returned as a search result.

And finally, calculating two values to measure the advantages and disadvantages of the method, wherein the two values are respectively as follows: recall (Recall) and average reciprocal rank (MRR). Two methods of calculation are as follows:

1. calculation of recall

Evaluating whether the search result based on semantic association contains the document related to the best answer, namely MC:

recall is defined as follows:

MC_ias the ith document

2. Calculation of average reciprocal rank

The average reciprocal rank MRR index is used to measure the retrieval effect. The accuracy of the defect report or the mail list related to the best answer is determined by taking the reciprocal of the ranking in the retrieval system, and the results of all the problems are averaged. The MRR value calculation formula is as follows:

| Q | is the number of the search result documents, rank_iRanking for document i

The experiments had a total of three groups, one control group and two experimental groups. The control group is a Base group using the SVM text retrieval method; and the first experimental group selects candidate answers by using an SVM retrieval method based on classified documents, and the second experimental group reorders the results by considering the influence of semantic association on the candidate answers on the basis of the first experimental group. The two experimental groups were Base + Classification group, Base + Classification + Relationship group, respectively. The tasks of each group are the same, i.e. documents related to the mail document are retrieved from all the defect reports according to the questions in the question set Sq. The results are shown in Table 1:

table 1 is an experimental result table

Method of producing a composite material	Recall	MRR
			Base	22.22％	8.05％
Base+Classification	26.98％	13.27％
			Base+Classifcation+Relationship	39.68％	19.95％

The experimental result shows that after the semantic association relationship is introduced, the retrieval recall rate of the system to the document is greatly improved compared with the Base method. In the ranking of the retrieval result, the MRR value of the Base method is low, namely the relevant documents are ranked too late, and the help to the developer is not large. After text classification processing and semantic association mining, the MRR value is obviously improved, which is beneficial to efficiently positioning related defect reports and mail lists and helps developers to better reuse software resources.

Drawings

FIG. 1 is a process flow for mining similar associations.

FIG. 2 is a workflow for mining potential semantic associations.

Detailed Description

In the embodiment, semantic association between the defect report of the item Lucene and the mailing list is mined, and the association is stored and inquired by using the graph database Neo4j so as to verify the effect of the method.

As previously described, the method will be performed in the order of acquiring resources → mining explicit semantics → mining implicit semantics → querying the document.

The document acquisition and analysis unit executes the following steps:

step 1: and constructing file addresses, such as http:// mail-archives. apache.org/mod _ mbox/lucene-general/201509.mbox, and crawling the two types of documents according to the addresses respectively.

Step 2: the project bug reports are parsed in the Json format and the parsed text is stored in the database Neo4j by category.

And step 3: the item mailing list is parsed according to the MIME4J format, and the parsed text is stored in the database Neo4j according to the category.

The document display semantic association mining unit executes the following steps:

step 1: reference links in the documents are identified, and reference associations are mined. The defect report content reference mailing list and the mailing list reference defect report are included.

Step 2: identifying code element association in the document, and identifying code elements of the project Lucene by adopting a source code analysis method; adopting a naming rule method to identify non-Lucene code elements, and the process is as follows:

step 2.1: and (6) analyzing the source code. The method is mainly used for identifying the code elements of the project Lucene, and comprises the identification of long code elements and segment code elements. .

A long code element. The project source code is parsed using Eclipse JDT AST, identifying long code elements in the source code (containing the full package name where the class or method is located), such as org.

A short code element. The project source code is parsed with Eclipse JDT AST to distinguish long code elements (containing only class or method names) in the source code, such as the CachingtkenFilter.

Step 2.2: and identifying code elements of the non-Lucene items by using a naming rule method. According to the camel naming rule, variable names which conform to the naming rule are identified.

Step 2.3: if the text contains the same code elements, the code association exists between the two documents, and the association between the corresponding nodes of the documents is established in the Neo4j data.

The document implicit semantic association mining unit is mainly used for:

firstly, mining similar association, mainly comprising the following steps (a work flow chart is shown in fig. 1):

step 1: and calculating the similarity of the stack information. Extracting function signatures of abnormal type names and error positions of stack information of the document i to form a set ST_i. Then the similarity of the stack information of document i and document j (the two documents are mail and defect report, respectively) is:

step 2: and calculating the text similarity. After word segmentation, stop word filtering and stem extraction processing are carried out on the text content. According to the formula:

the similarity of the document D, Q is calculated. Wherein q is₁To q_nThe keywords are processed by the document Q; f (q)_iD) is the ith participle Q in the document Q_iWord frequency in document D;avgdl is the average document length in the document set;

where N is the number of all documents, N (q)_i) To contain the keyword q_iThe number of documents of (a); k is a radical of₁And b is an adjustment parameter.

And step 3: and calculating the comprehensive similarity. And (3) removing the document D, respectively calculating the stack information similarity and the text similarity of the document D and all other documents according to the methods in the step 1 and the step 2, taking the average value of the stack information similarity and the text similarity as the comprehensive similarity of the document D and other documents, and performing descending order. Taking the top 10 documents, judging that similar association exists with the document D, and establishing association between corresponding nodes of the documents in Neo4j data.

Secondly, mining potential semantic association, which mainly comprises the following steps (a work flow chart is shown in fig. 2):

step 1: and establishing an inverted index for the defect report and the mail list of the Lucene project.

Step 2: and retrieving the document according to the summary of the defect report and the code query of the patch file.

And step 3: and calculating the similarity of the document D and all other documents. The calculation formula is as follows:

wherein V_i、V_jQuery vectors for documents i, j, respectively.

And then sorting the documents in a descending order according to the similarity, taking the first 3 documents, judging that the documents are potential semantic associated with the document D, and establishing association between nodes corresponding to the documents in a Neo4j database.

And the document retrieval unit is used for verifying the effect of the association established by the method. The verification method comprises the following steps: and selecting related questions from the Stackoverflow, retrieving documents by using Neo4j, comparing original answers of the questions, and determining whether the questions can be solved. The method mainly comprises the following steps:

step 1: and (5) screening a problem set. The principle of screening is as follows: questions with Lucene labels, positive number of votes for the questions, and reports of defects or mailing list links to answers to the questions.

Step 2: with the data Neo4j, a query is constructed from the questions, and results are obtained.

And step 3: according to the retrieval result, Recall rate (Recall) and average reciprocal rank (MRR) are calculated.

Claims

1. A method for mining association between a defect report and a mail list language comprises the following steps:

3) the implicit semantic association mining unit of the document identifies the implicit semantic association between the defect report and the mail list according to the analysis result, wherein the implicit semantic association comprises similar association and potential semantic association;

the method for mining the implicit semantic association comprises the following steps:

31) calculating the similarity SIM1 of the mail and the defect report according to the stack information in the mail and the defect report;

32) calculating the similarity SIM2 of the mail and the defect report according to the text of the mail and the defect report;

33) obtaining a comprehensive similarity SIM of the mail and the defect report based on the similarity SIM1 and the SIM 2; then determining the mails and defect reports with similar associations according to the comprehensive similarity SIM;

34) acquiring a query vector of each document; wherein the query vector of the ith document is V_i＝＜W_i,1,W_i,2,...,W_i,k,...,W_i,n＞，

n is all documentsTotal number of words present, W_i,kThe number of times of occurrence of the k-th vocabulary in the document i; the documents include mail and defect reports;

35) calculating the cosine similarity of the document i and all other documents according to the query vector corresponding to the document i, and sorting according to a descending order; and then taking the first documents of the sequencing result as documents with potential semantic association with the document i.

2. The method of claim 1, wherein the reference associations comprise an association of a defect report reference and an association of a mail reference.

3. The method of claim 2, wherein the association of the defect report reference is obtained by: performing pattern matching on a text of the mail list, and judging whether reference links for the defect report or key names of the defect report are contained; if yes, identifying key names or extracting key name information in the reference links; and then positioning a corresponding defect report according to the key name and establishing a reference association.

4. The method of claim 2, wherein the association of the mail reference is obtained by: performing pattern matching on the text of the defect report, and judging whether reference information of the mail is contained; if yes, extracting Message-ID information in the reference link; and then positioning the corresponding mail according to the Message-ID and establishing reference association.

5. The method of claim 1, wherein if the same code element is present in the body text of a mail piece as in the body text of a defect report, the two are considered to have the common code element association.

6. The method of claim 5, wherein the common code element associations are mined based on the source of the code elements; firstly, analyzing the code elements, wherein if the source of the code elements is the code elements of the target project and the code elements are long code elements, the method comprises the following steps: 1) parsing the long code element into an AST; 2) traversing AST nodes and reading elements on the nodes; 3) for each node, extracting the information of the packet name where the node is located and connecting to obtain a long code element set; if the code element source is the code element of the target project and is a short code element, 1) resolving the short code element into AST; 2) traversing AST nodes, and reading elements on the nodes to obtain an initial code element set; 3) removing the duplication of the elements in the initial code element set and filtering stop words; if the code element source is the code element of other projects, adopting a naming rule method to analyze; and then judging whether common code element association exists in the text of the mail and the text of the defect report according to the analysis result.

7. The method of claim 1, wherein a formula is utilized

Computing a query vector V_i＝＜W_i,1,W_i,2,...,W_i,nQuery vector V for > sum document j_j＝＜W_j,1,W_j,2,...,W_j,nCosine similarity of > Simiarity (V)_i,V_j)。

8. The method of claim 1, wherein the graph data Neo4j is used to represent semantic associations between mined defect reports and mailing lists.

9. The method of claim 1, wherein the parsing the defect report and mailing list comprises:

23) extracting code segments from the defect report and the mailing list processed in the step 22); the remaining text is then body text.