CN108182181A

CN108182181A - Repeated detection method for mass contribution merging request based on mixed similarity

Info

Publication number: CN108182181A
Application number: CN201810100193.6A
Authority: CN
Inventors: 余跃; 李志星; 尹刚; 王涛; 王怀民; 范强; 於杰; 张迅晖; 胡东阳
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-02-01
Filing date: 2018-02-01
Publication date: 2018-06-19
Anticipated expiration: 2038-02-01
Also published as: CN108182181B

Abstract

The invention belongs to the field of software collaborative development and discloses a mass contribution merging request repeatability detection method based on mixed similarity. The method comprises the following steps: for the newly submitted public contribution merging request, firstly calculating the text similarity between the public contribution merging request and the historical public contribution merging request; then calculating the change similarity of the contribution of the history public with the history public; further collecting a group of historical repeated contribution data sets on a popular collaborative development platform, and under the training of the data sets, combining the two similarities by using a weight calculation method based on a greedy search strategy to calculate the mixed similarity between the popular contributions; and finally, obtaining a group of historical mass contribution merging request lists which are most likely to be repeated with the given mass contribution merging request according to the size of the mixed similarity value. The invention can detect the repeatability of the public contribution in time, avoid the repeated manual code examination work and improve the efficiency of the public contribution examination.

Description

A kind of public contribution based on hybrid similarity merges the repeated detection method of request

Technical field

The invention belongs to software collaboration development fields, are related to a kind of public contribution based on hybrid similarity and merge request weight Renaturation detection method.

Background technology

At open source community (such as GitHub), the software development model based on large-scale groups collaboration substantially increases software wound New efficiency is excited in the production process that more and more developers join in open source software.However, this development mode is one The process that parallel and nothing is uniformly coordinated is planted, when multiple developers spontaneously contribute code to same open source software project, such as Fruit they want to realize same purpose, it is possible to the contribution of repetition can be submitted to merge request (i.e. Pull-request), especially It is that those have attracted a large amount of peripheral developers, continuously receive community's contribution popular project be easier to occur it is this Problem.As shown in Fig. 2, two developers Bob and Alice are cloned and (fork) same main version repository, so latter two exploitation Person individually makes an amendment on respective local clone bank.When they want to realize same function or repair same code During defect, since they are unaware of the work that other side doing, two people, which can may make corresponding modification and then submit, to be merged It asks in main version repository, two merging requests of submission all can respectively undergo contribution examination and update operates, until there is certain position Developer recognizes that the two repeat the presence that public contribution merges request.

It repeats public contribution merging to ask to cause the waste to platform resource, increases the maintenance cost of platform.Simultaneously It also results in and the contribution for performing repetition is asked to check flow to repeating masses' contribution merging, this can expend examiner's additional time And energy.Merge in the life cycle of request in a public contribution and (be submitted to platform from it and be accepted or rejected this to it The section time), the public contribution repeated, which merges request, to be identified at any point in time, and more late identified, it is made Into resource and wasted effort problem it is more serious.In addition, in a public contribution merges the checking process of request, contributor is also Often it is updated according to the feedback of examiner it is perfect, therefore, if cannot identify that the public contribution of repetition is closed as early as possible And ask, two contributors may also can do the work of repeated and redundant, and then throw doubt upon to the ability of the Executive Team of project. Especially if it is considered as that the contribution submitted in an evening merges the repetition of request and examined that the contribution that they submit, which merges request, The person of looking into closes, this negatively affects more serious caused by contributor.

At present GitHub platforms (GitHub be one towards increasing income and the hosted platform of privately owned software project because only propping up Hold git as unique version library format carry out trustship, therefore named GitHub) on identification repeat contribution merge request mechanism be according to Bad examiner artificially goes to find.However, for those popular projects, public developer is continuously toward key plate sheet Code contribution is submitted in library, a large amount of contribution, which merges request, needs code inspection.Some examiner is allowed to remember all contributions to the history of Merge the information of request, and merge request with the contribution newly submitted and be compared, the way for then judging repeatability is unrealistic 's.Under current mechanism, only when some developer recognize by chance two repetition contribution merge request presence, they Repeatability is just found, this has resulted in most repetition contribution merging request and can not timely have been identified.Upper It states under situation, it is very necessary that one can merge request presentation stage automatically to detect its repeated tool in contribution. First, automatic prospecting tools are capable of the work of assisted review person, them is made to avoid doing the repeated work of redundancy.Secondly, when first Between detect automatically repetition contribution merge request can allow both sides contributor establish as soon as possible contact and cooperate together, avoid him Respectively continue to do the work of repetition again.

Invention content

In order to solve the above technical problems, the present invention propose it is a kind of in open source software project hosted platform there may be Repeatability contribution based on hybrid similarity detection method, specific technical solution is as follows.

A kind of public contribution based on hybrid similarity merges the repeated detection method of request, includes the following steps：

The public contribution that S1, calculating are newly submitted, which merges, asks to merge the text similarity between request with history masses contribution, The text similarity includes title text similarity and description text similarity；

The public contribution that S2, calculating are newly submitted, which merges, asks to merge the change similarity between request with history masses contribution, The change similarity, which refers to that public contribution merges, asks to change the similarity of paths between file；

S3, one group of history repetition contribution merging requested data set is collected on Collaborative Development Platform, using based on greedy plan Weight searching algorithm slightly repeats the history contribution merging requested data set and is trained, and obtains text similarity and change The weighted value of similarity further calculates the hybrid similarity between public contribution merging request according to weighted value；

S4, according to the step S1 to step S3, each history masses, which contribute, merges that request is corresponding to obtain a mixing Similarity is ranked up according to the size of hybrid similarity value, is obtained one group and is merged request repetition with the public contribution newly submitted History masses contribute merge request list.

Further, the detailed process of the step S1 is：

S11, merge from the public contribution of the new submission ask to contribute to merge in request with the history masses to carry respectively Title text and description text are taken, obtains two title texts and two description texts；

S12, title text and description text are pre-processed；

S13, the title text by pretreatment and description text are respectively converted into multi-C vector, obtain two heading-texts Two description text vectors of this vector sum；

S14, the similarity between two title text vectors, i.e., the masses of described new submission are calculated using Cosine formula Contribution merges the title text similarity that request merges request with history masses contribution；Two are calculated using Cosine formula Similarity between text vector is described, i.e., the public contribution of described new submission merges request and merges with history masses contribution The description text similarity of request.

Further, the detailed process of the step S2 is：

S21, the public contribution for extracting the new submission respectively merge request and are contributed with the history masses and merge request The file of concrete modification, obtains two file sets；

S22, the similarity of paths between file two-by-two is calculated in two file sets, i.e., the public contribution newly submitted merges please Seek the change similarity merged with history masses contribution between request.

Further, the Collaborative Development Platform is GitHub platforms.

Further, pretreatment is carried out to title text and description text in the step S12 and specifically includes participle, conversion Root and removal stop words.

Compared with prior art, the invention has the advantages that：1st, the present invention is put down for open source software project trustship In platform it is that may be present repeatability contribution, it is proposed that a kind of detection method based on hybrid similarity.This method is to improve code One ring of key of efficiency is checked, can avoid reviewer repetition checks work, and core developer is helped more efficiently to organize generation Code review process improves public contribution sink-efficiency.2nd, the present invention proposes comprehensive utilization and includes public contribution and merge request marking Change similarity caused by text similarity and changed file including topic and text calculates public contribution and merges request Between similarity, can preferably disclose public contribution and merge repeatability between request.3rd, the present invention passes through automatic identification and people The mode of work inspection, which from GitHub platforms has collected one group of history and repeats public contribution, merges requested data set, the data and can For automatic detection model being asked to optimize to repeating public contribution merging, its Effect on Detecting is improved.4th, the present invention proposes Two kinds of similarities are carried out efficient combinations using based on the strategy that greed is searched for, can more react public contribution conjunction so as to calculate And ask the hybrid similarity value of similarity.

Description of the drawings

Fig. 1 is the method for the present invention flow diagram；

Fig. 2 is the flow chart that multiple developers in background technology are contributed parallel；

The public contribution that Fig. 3 is the present invention merges request change similarity calculation algorithm routine code map；

Fig. 4 is the similarity of paths computational algorithm program code figure of two files；

Fig. 5 is the weight calculation algorithm routine code map based on greedy search strategy in the present invention；

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work Embodiment shall fall within the protection scope of the present invention.

It is the method for the present invention flow diagram as shown in Figure 1；It is as follows：

The public contribution that S1, calculating are newly submitted, which merges, asks to merge the text similarity between request with history masses contribution, The text similarity includes title text similarity and description text similarity.

For contributing the text for merging and being extracted in request header and description from public, the pretreated of standard is first carried out Journey, including segmenting, converting root and removal stop words.One sentence is cut into phrase, and there are many strategies of the prior art It may be used, this depends on data type to be processed and application field.There are some texts that can be split under common situation Into multiple words, however it should integrally be regarded as a word, such as represent code in the public context for contributing and merging and asking The text of path and hyperlink is generally all very long, but they refer to be a complete concept, therefore they should not It is split and comes.Therefore, we used regular expression segmenter to parse urtext, and here is some regular expressions And by its matched text.

Code path：

–\w+(:\:\:\w+)*

–“ActionDispatch::Http::URL”

Masses' contribution merges number of the request on GitHub platforms：

–\#\d+

–“#10319”

After text is segmented, each word can be converted into root-form (for example, " was " is converted to " be ", " errors " is converted to " error "), this conversion is completed by Porter root transfer algorithms.Finally, some often occur But the resolution of a sentence can be removed without the stop words of too big contribution (such as " the ", " a ").

Text by pretreatment can be according to TF-IDF models (Term Frequency-Inverse Document Frequency, abridge TF-IDF) be further converted into can be at vector space model (Vector Space Model, VSM) The multi-C vector of middle calculating, the text i of vectorization can be expressed as:TextVec_i=(w_{I, 1}, w_{I, 2}..., w_{I, v}), vectorial is every An one-dimensional word for corresponding to text, what v was represented is the sum of word in entire text corpus.w_{I, k}Value be that text i is corresponding The weight of k-th of element in vector, the value are got by the calculating of TF-IDF models:

w_{I, k}=tf_{I, k}×idf_{I, k}

In formula above, tf_{I, k}It represents word frequency, is the frequency that k-th word occurs in text i, idf_{I, k}Represent inverse text Shelves frequency, for weighing discrimination of the word to article.

After text is quantified, we calculate two text vector TextVec using Cosine formula_iAnd TextVec_j Similarity SimText (i, j), specific formula for calculation is:

Based on Cosine calculation formula, the title text similarity between two public contribution merging requests is obtained respectively SimText_title(i, j) and description text similarity SimText_desc(i, j)；I, j represent text, | | it represents to seek vector Mould.

Cooperation on GitHub platforms has the support of Git tools, when contributor carries in GitHub platforms dependent on Git After a contribution is handed over to merge request, contribute involved modification that can be shown in the form of a kind of diff.To be based on diff Information calculates the similarity that two public contributions merge request, and original diff data are resolved to structural data first, from And extract a public contribution merges request concrete modification which module and which file.Specific code algorithm such as Fig. 3 institutes Show, calculate the change similarity between two public contribution merging requests.The input of the algorithm is that two public contribution merging please Seek and (represented in algorithm with PR) the file set files being respectively modified_iAnd files_j.The 1st row initializes in algorithmic code in Fig. 3 One list is used to storing the interim findings generated in algorithm, and the code of the 2nd row to the 5th row is for calculating two file sets The file path similarity of middle any two file, and two files and its similarity are stored in list, and two files File path similarity algorithm as shown in Figure 4 calculated.6th line code to the element in list according to similarity value into Row sequence, the 7th line code determine the similarity that finally retain how many a files pair.Eighth row code be initialized one it is new List, to store the file pair and its similarity value finally to be retained.The code of 9th row to the 13rd row from temporary table according to The secondary file pair for finding out similarity maximum is simultaneously put into their similarity in final list, since same file is finally arranging Table only occurs once, i.e., same file can only have maximum similarity value, therefore the 12nd line code meeting with another file The file that the file in maximum similarity file pair and other files are formed in intermediate result list is to all deleting.Most Afterwards, the similarity value in final list is added up afterwards divided by two is changed the maximum value of file set scale, and then obtain two Masses' contribution merges the change similarity of request.Algorithm shown in Fig. 4 is used for calculating the similarity of paths of two files.First, The path of two files is carried out cutting by the function according to path separators, respectively obtains two directory name set.Then pass through 3rd row to the 7th line code calculates the depth of the public subdirectory of longest of two file paths, finally with the depth divided by two texts The maximum value of part pathdepth is the similarity of paths of two files.

History is collected from GitHub platforms repeat the detailed process that public contribution merges requested data set in the present embodiment It is as follows：

(1) random sampling：Have chosen the popular project of 26, GitHub platforms；For a project, from its all public tribute It offers in merging request and randomly selects out a part.

(2) artificial screening：Request is merged for each the masses' contribution being selected, its each of hand inspection includes Other public contributions merge the comment that request is quoted, and further pick out and merge the repeated comment of request about public contribute, The present invention contributes the comment for merging request repeatability to be referred to as indicative comment this about public.

(3) Rule Extraction：Based on the indicative comment set collected by previous step, it is found that commentator is pointing out one greatly Crowd's contribution merging request and another masses' contribution merging request are when repetition, some word or expressions are frequently used. For example " the dup of ", " closed by " and " addressed in " in several groups of following comments is often to be used by reviewer To point out the phrase of repeatability.

–“dup of#xxxx”

–“Closed by https://github.com/rails/rails/pull/13867”

–“This has been addressed in#27768.”

Therefore, go out regular expression based on these indicative Opinions Extractions, be used for using these regular expressions as rule The indicative comment of Auto-matching.It is listed below the example of a part of rule:

clos(e|ed|ing)(\w+){,5}(by|of)(\w+:){,5}#\d+

(4) automatic identification：According to above-mentioned regular expression recognition rule, indicative comment can be automatically identified, from And it finds two mutually repeated public contributions and merges request.If a comment is identified as indicative comment, can be from this The public contribution that extraction is cited inside a comment merges request number, merges request with the public contribution belonging to indicative comment The repetition masses that partner, which contribute, merges request.

(5) hand inspection：The data of mistake can be introduced by carrying out automatic identification according to rule, i.e., there are some masses' contributions to close And it asks to not being mutually to repeat.Therefore it needs to carry out hand inspection to the data of automatic identification.Based on hand inspection Standard is：

1) author for repeating masses' contribution merging request is unaware of the presence that source masses contribute merging request.This is required from two A public contribution, which merges, goes observation to judge whether author knows in the comment data of request.

2) examiners contribute the repeatability for merging request to be in agreement to public.I.e. an examiner proposes one big It is after another masses' contribution merges the repetition of request, other examiner do not occur and hold opposition meaning that crowd's contribution, which merges request, See, but illustrate that approving of and close one of public contribution merges request.

On the other hand, have been calculated given public contribution merge request merge with history masses contribution ask it is various types of After the similarity of type, ask most like public contribution merging please to find out to merge with given masses' contribution using these similarities Ask list.It is big to calculate two present invention employs the mode of hybrid similarity to make full use of the similarity of these types of type Final similarity, the calculation formula of final similarity Sim (i, j) are as follows between crowd's contribution merging request：

Sim (i, j)=a × SimText_title(i, j)+

b×SimText_desc(i, j)+

c×SimDiff_file(i, j)

In formula above, Sim's (i, j) is hybrid similarity after being combined by a variety of Similarity-Weighteds, title Text similarity SimText_title(i, j), description text similarity SimText_desc(i, j), change similarity SimDiff_file (i, j), their corresponding weights are a, b, c respectively.To choose preferably weights, as shown in figure 5, the GitHub based on collection is put down Platform repeats public contribution and merges requested data set, their numerical value is automatically determined using greedy searching algorithm.The algorithm Input includes one group and repeats what is attempted when public contribution merging request set, algorithm iteration maximum times and algorithm are searched for every time Step-length.Finally, which returns to the weight of one group of local optimum.In algorithmic code shown in Fig. 5, preceding 3 row (1-3 rows) code Three weights are initialized, and weighted value is formed vector and is operated, then with initialize weight vectors come Obtain initial valuation functions value.Can valuation functions preferably reflect various types of phases for assessing one group of weight vectors Merge the contribution degree of request similitude to practical public contribution like degree, i.e., can one group of weight vectors, which generate, more tallies with the actual situation Similarity proportion.One public contribution is merged for request, in the list of return, the public contribution repeated with it is closed And ask sequence more more forward better, therefore valuation functions fitness is defined as:

What DupPR was represented in above formula is that history repeats public contribution and merges requested data set, wts represent current weight to Amount,<pre,prl>Represent that a pair of public contribution repeated merges request, what SimPRs (Prl) was returned is the most similar with prl One group of masses, which contributes, merges request list, Rank (pr_e, SimPRs (Prl)) return the result is that pr_eSequence in lists.

Fig. 5 4-21 line codes iteratively search for the better weight parameter of effect, and until iterations, to have reached algorithm defeated The maximum iteration specified in entering.In the 5th line code, we create first a list for store iteration each time In search history record.In each iteration, we have tried weight vectors are changed from both direction:Sweep forward (7-10 line codes) and reverse search (11-14 line codes).When each search starts, current optimal weights vector is first Preservation (the 7th row and the 12nd line code) is recorded, in sweep forward, that be observed weight can increase the list of a step-length Position (eighth row code), and in reverse search, that be observed weight can reduce unit (the 12nd row generation of a step-length Code).Weight vectors after being updated be used to calculate new valuation functions value (the 9th row and the 13rd line code), at the same time, newly Weight vectors be also recorded historical record search_history (the 10th row and the 14th line code).When all power After weight is all observed, i.e. a, b, tri- weights of c are observed one time, highest function evaluation value can be taken out from search history (the 15th line code), if this value is more taller than the valuation functions value of current optimal weights vector, then current optimal weights Vector and optimum evaluation functional value correspondingly can all be updated (the 16th the-the 19 row of row), then next otherwise without update The iteration of wheel starts.Finally, the weight vectors (the 23rd row) that the output of algorithm shown in Fig. 5 behaves oneself best.

S4, according to the step S1 to step S3, each history masses, which contribute, merges that request is corresponding to obtain a mixing Similarity is ranked up according to the size of hybrid similarity value, is obtained one group and is merged request repetition with the public contribution newly submitted History masses contribute merge request list.A top-k value can be preset in embodiment, takes in list preceding top-k History masses, which contribute, merges request, and the most like public contribution that the public contribution as newly submitted merges request merges request.

In conclusion the public contribution proposed by the present invention based on hybrid similarity merges the repeated detection method energy of request The repeatability of enough public contributions of detection in time, avoids generating the work of repeater's work code inspection, improves what public contribution examined Efficiency.

It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with Understanding without departing from the principles and spirit of the present invention can carry out these embodiments a variety of variations, modification, replace And modification, the scope of the present invention is defined by the appended.

Claims

1. a kind of public contribution based on hybrid similarity merges the repeated detection method of request, which is characterized in that including following Step：

The public contribution that S1, calculating are newly submitted, which merges, asks to merge the text similarity between request with history masses contribution, described Text similarity includes title text similarity and description text similarity；

The public contribution that S2, calculating are newly submitted, which merges, asks to merge the change similarity between request with history masses contribution, described Change similarity, which refers to that public contribution merges, asks to change the similarity of paths between file；

S3, one group of history repetition contribution merging requested data set is collected on Collaborative Development Platform, using based on Greedy strategy Weight searching algorithm repeats the history contribution merging requested data set and is trained, and obtains text similarity and becomes more like The weighted value of degree further calculates the hybrid similarity between public contribution merging request according to weighted value；

S4, according to the step S1 to step S3, each history masses, which contribute, to be merged request corresponding to obtain a mixing similar Degree, is ranked up according to the size of hybrid similarity value, obtains one group and merges going through for request repetition with the public contribution newly submitted History masses, which contribute, merges request list.

2. a kind of public contribution of base hybrid similarity as described in claim 1 merges the repeated detection method of request, special Sign is that the detailed process of the step S1 is：

S11, merge from the public contribution of the new submission ask to merge to extract in request with history masses contribution to mark respectively Text and description text are inscribed, obtains two title texts and two description texts；

S12, title text and description text are pre-processed；

S13, will by pretreatment title text and description text be respectively converted into multi-C vector, obtain two title texts to Amount and two description text vectors；

S14, the similarity between two title text vectors, i.e., the public contribution of described new submission are calculated using Cosine formula Merge the title text similarity that request merges request with history masses contribution；Two descriptions are calculated using Cosine formula Similarity between text vector, i.e., the public contribution of described new submission merge request and merge request with history masses contribution Description text similarity.

3. a kind of public contribution of base hybrid similarity as described in claim 1 merges the repeated detection method of request, special Sign is that the detailed process of the step S2 is：

S21, the public contribution for extracting the new submission respectively merge request and merge with history masses contribution and ask specifically The file having modified obtains two file sets；

S22, calculate in two file sets the similarity of paths between file two-by-two, i.e., the public contribution newly submitted merge request with The history masses contribute the change similarity merged between request.

4. a kind of public contribution of base hybrid similarity as described in claim 1 merges the repeated detection method of request, special Sign is：The Collaborative Development Platform is GitHub platforms.

5. a kind of public contribution of base hybrid similarity as claimed in claim 2 merges the repeated detection method of request, special Sign is that carrying out pretreatment to title text and description text in the step S12 specifically includes participle, conversion root and removal Stop words.