CN111915216A

CN111915216A - Open-source software project developer recommendation method based on secondary attention mechanism

Info

Publication number: CN111915216A
Application number: CN202010818089.8A
Authority: CN
Inventors: 潘国盛; 姚远; 徐锋
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2020-11-10
Anticipated expiration: 2040-08-14
Also published as: CN111915216B

Abstract

The invention discloses an open source software project developer recommendation method based on a secondary attention mechanism, which is used for modeling project team characteristics and project characteristics based on the relationship among project team members, the relationship between project teams and projects, the text description information of the projects and the like. Firstly, through network representation learning and text representation learning, the text characteristics of developers and projects are obtained. Then, by utilizing the first layer attention mechanism, the relative weight of the team existing developers relative to the project team is learned, so that the characteristics of the project team are obtained. Then, by using a second layer attention mechanism, the relative weights of the project team and the project document about the project are learned, so as to obtain the characteristics of the project overall. And finally, calculating the similarity between the overall characteristics of the project and the characteristics of the developers, and recommending proper developers for the open-source software project according to the similarity sequence.

Description

Open-source software project developer recommendation method based on secondary attention mechanism

Technical Field

The invention relates to an open source software project developer recommendation method based on a secondary attention mechanism.

Background

At present, more and more software projects are developed depending on the open source platform, more and more developers begin to pay attention to the software projects in the open source platform, and many high-quality and large-scale open source software projects such as VSCode, Flutter and Tensorflow appear. According to the 2019 annual report issued by Github, more than 4000 million developer users exist around the world at present, 1000 million new developers are added in 2019 in one year, and 130 million developers make the first contribution to open sources.

For an open source software project, during the development process of the project, some existing team members quit the current project development for various reasons, and due to the evolution of related dependent technologies, the addition of developers familiar with related fields is also needed. Therefore, whether or not to continuously supplement the addition of new developers of a quality suitable for the current project and team is important to the good continuation of the project. However, recommending appropriate development novices to an open source software project faces a significant challenge due to the large number of developers in the open source community and the very messy domain, programming language, and project experience that developers are good at. Much research has thus been generated into recommendation tools for recommending new members to the open source software project team.

The research on the recommendation problem of the open source software project needs to consider the following two problems:

(1) the interaction of the project team with the candidate developer is modeled,

(2) modeling the interaction of the project document and the candidate developer;

for interactive modeling between a candidate developer and an open source software project, due to the characteristics of team development of the open source software project, in the process of selecting the candidate developer, modeling is needed according to the matching degree between project tasks and developer attributes, and in order to reduce the running-in cost and the communication difficulty of the developer after the developer is added, similarity modeling between the candidate developer and an existing team member is needed to be added in the process of recommending by the candidate developer. When the candidate developers are recommended, the team documents and the project team possibly have different influence weights, and similarly, when the project team is used as a selection basis, the influence weights of team members on the similarity analysis of the candidate developers are different. Existing approaches are based on a relatively fixed pattern when dealing with both types of interactions. When the relation between project teams and project documents is processed, a splicing mode is usually adopted, so that modeling of influence weights between the project documents and the teams is omitted. In dealing with project team and candidate developer relationships, one member of the team is typically taken as a representative, ignoring modeling of all members of the entire team and relative weights among the members.

Disclosure of Invention

The purpose of the invention is as follows: when the existing developer recommendation method considers interactive modeling between a candidate developer and an existing open-source software project, relative weights between the candidate developer and existing team members of the open-source software project and relative relations between project documents and project teams are not considered or are not considered fully. Aiming at the problems and the defects in the prior art, the invention provides an open source software project developer recommendation method based on a secondary attention mechanism, which comprehensively considers the relationship among project team members, the relationship between project teams and projects, the text description information of the projects and the like, and solves the developer recommendation problem according to the relationship structure characteristics of an open source platform. The technical scheme needs to consider the following problems:

(1) how to consider the relative relationships between project team members when building project team features;

(2) how to consider the relative relationship between project documents and project teams when building project population features;

for the problem (1), since different division of labor may exist in a single project team, the candidate developer does not need to have a high match with all developers in the team, and the matching degree of the members of the partial team will affect the similarity between the final team and the candidate developer, and further affect the match between the final project and the candidate developer. The present invention utilizes a first level of attention mechanism to automatically learn a weight for different developers in a team when interacting with a current candidate developer for influencing and computing a feature representation of a project team.

To solve problem (2), the feature representations of the candidate developers need to interact with the project's own document feature representation and the project team feature representation derived from (1) at the same time. Because the project and the project team have important influence on the selection of the candidate developers, the invention adopts a second layer of attention mechanism to learn a weight coefficient for the project characteristic representation and the team characteristic representation and perform weighted summation on the weight coefficient and the project characteristic representation and the team characteristic representation, and the obtained characteristic representation is used as a final characteristic representation to perform interaction with the candidate developers to obtain the matching score.

The invention provides a recommendation model DETEX (development Team Expansion model) based on a secondary attention mechanism by utilizing a multi-layer attention mechanism. And modeling the project team characteristics and the project characteristics based on the relationship among the project team members, the relationship between the project team and the project, the text description information of the project and the like, and considering the similarity between the candidate developer and the open source software project. Based on the assumption that the current team member is more suitable for the current development project than other developers, the team member in the current development project is regarded as a positive case, other developers are regarded as negative cases, the problem is converted into a prediction problem for the existing team member, DETEX model parameters are trained on the basis, and model output obtained according to the parameters is used as matching degree for sorting.

Experiments on real data show that the method provided by the invention has a remarkable improvement on the matching accuracy of team expansion and candidate developers compared with the existing method.

The technical scheme is as follows: a secondary attention mechanism-based open source software project developer recommendation method trains a project developer recommendation model by using attribute network data composed of existing open source platform projects and developers, carries out recommendation sequencing on given candidate developers according to project document information and project member information, models project team characteristics and project characteristics based on the relationship among project team members, the relationship among the project teams and the projects and text description information of the projects by using a secondary attention mechanism, and finally carries out recommendation according to calculated developer and project matching degree sequencing, wherein the recommendation model adopted by the method mainly comprises the following contents:

1) modeling a relative weight relationship between existing members of a project team through a first layer attention mechanism to obtain project team characteristics;

2) modeling the relative weight relationship between the project team and the project document through a second layer of attention mechanism to obtain the project overall characteristics.

Using P to represent an open source software project set, D to represent a developer set, P epsilon P to represent a project to be expanded, and for the project P, T_pRepresenting the current set of development team members for the software project. The method aims to find a developer D E D corresponding to the suitable added open source software project p,

(1) modeling relative relationships between existing members of a team through a first level attention mechanism to obtain characteristics of a project team;

and simultaneously inputting the feature representation of the candidate developer, the feature representation of the open source software project and the skill attribute feature representation of the current project team member into the DETEX model. Firstly, the characteristic representation of a developer is input into a nonlinear layer of the DETEX model, and the characteristic representation of a software item is input into a nonlinear layer of the DETEX model, and the formula is shown as follows

v_d＝μ(W_dd+b_d)

Wherein W_d、b_dD represents a feature representation obtained by network representation learning, v_dRepresents the output of the non-linear layer and μ represents the activation function, where a leakage corrected linear unit (leak relu) with a negative parametric slope (negative slope) of 0.01 is used, and the activation function curve is shown in fig. 1.

For the team feature representation, the feature representation of the developers in the team is subjected to a convergence operation, which is represented as follows:

wherein a is^TAs a network parameter, t_iRepresents the current software project team T_pThe member of the developer in (1),

represents the developer t_iOutput characterization in a non-linear layer, v_dA feature indicating that the candidate developer outputs in the non-linear layer indicates that an element-by-element multiplication is indicated. The interaction of the candidate developer with each member in the existing team of the open source software project is modeled in the formula, and different influence is given to different members in the team through the attention mechanism

That is, the degree of influence on the candidate developers by the members within the team is different.

(2) The relative relationship between the team and the project document is modeled by a second layer of attention mechanism to obtain the overall characteristics of the project.

For the project documents, obtaining feature representation of the project documents by a word vector mean method, and inputting the features of the project documents into a non-linear layer:

v_p＝μ(W_pp+b_p)

wherein W_p、b_pFor the weight parameters and bias terms, p represents the feature representation of the project document obtained by the word vector mean. Next, the model is modeledInteractions between developers and projects and project teams are selected. To model the preference relationship between different candidate developers for the open source project document and the project's current team at the time of selection, an attention mechanism described by the following formula is employed:

wherein the content of the first and second substances,

v_p、

v_trespectively, different neural network layer parameters, alpha_pAnd alpha_tRepresenting the corresponding weights of the open source project and the team, and obtaining a characteristic representation v representing the project and the team as a whole_comb：

v_comb＝α_pv_p+α_tv_t

According to v_combPredicting the overall similarity between the candidate developer and the open source software project through a multi-layer perceptron structure, wherein the formula is as follows:

wherein MLP represents polyThe layer sensing machine is used for sensing the layer,

after the last layer of MLP, sigmoid operation is carried out once to make the final output value range be [0, 1]The similarity of (c).

Drawings

FIG. 1 is a graph of an activation function;

FIG. 2 is a block diagram of the DETEX model in an embodiment of the present invention;

FIG. 3 is the result of DETEX under the HR index;

FIG. 4 is the result of DETEX at the nDCG index;

FIG. 5 is a flow chart of a method implementation of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

The open-source software project developer recommendation method based on the secondary attention mechanism models the relative relationships among project team members and between project teams and project documents on the basis of the feature representation of a pre-trained developer and the feature representation of the project documents. Wherein the first layer of attention mechanism models the relative relationship between members of the project team to form a feature representation of the project team, and the second layer of attention mechanism models the relative relationship between the project team and the project document to form a feature representation of the project. And finally, obtaining final similarity and recommendation results through the interaction between the project overall feature representation and the candidate developers.

as shown in fig. 2, to model existing relationships, calculate the similarity between a candidate developer and a given open source software project, a representation of the candidate developer, a representation of the open source software project, and a skill attribute representation of a current project team member need to be input to the DETEX model at the same time as inputs. To make DETEX have the ability to better express complex interactions, encoding non-linear relationships, the developer's and software project's feature representation is first input into one non-linear layer (E in FIG. 2)_pAnd E_d) The formula is shown below

v_d＝μ(W_dd+b_d)

Where d represents the characterization obtained by network representation learning and μ represents the activation function, we use here a LeakyReLU with a negative slope of 0.01.

For team feature representation, since each source project team comprises a plurality of developers, the feature representations of the developers in the team are subjected to a convergence operation (E in fig. 2)_t) Expressed as follows:

wherein t is_iRepresents the current software project team T_pThe member of the developer in (1),

represents the developer t_iA characteristic output in the non-linear layer indicates that, an indicates element-by-element multiplication. The interaction of the candidate developer and each member in the existing team of the open source software project is modeled in the formula, and different influences are given to different members in the team through an attention mechanism, namely the influence of the matching degree of different candidate developers and the same team member on the final matching degree of the team is different.

For the project documents, the feature representation of the project documents is obtained by a word vector mean method, and the features of the project documents are input into a non-linear layer:

v_p＝μ(W_pp+b_p)

where p represents the project document representation obtained by the word vector mean. Next we model the interaction between candidate developers and the project and project team. To model the preference relationship between different candidate developers for the selection of open source project documents and the project's current team, we employ an attention mechanism as described by the following formula:

wherein alpha is_pAnd alpha_tRepresenting the corresponding weights of the open source project and the team, so that a representation v representing the project and the team as a whole can be obtained_comb：

v_comb＝α_pv_p+α_tv_t

According to v_combThe overall similarity between the candidate developer and the open source software project can be predicted through a multi-layer perceptron structure, and the formula is as follows:

where MLP stands for the multi-layer perceptron,

The specific implementation flow of the method is shown in fig. 5. The method comprises the steps of firstly inputting a graph network of { project-developer } of an open source software project as input, and training a recommendation model of an open source software project developer. And after the model training is finished, inputting the document information of the recommended project and the information of the existing team members into the model for prediction. The model calculates the relative relationship between the project members and the relative relationship between the project team and the project document according to the secondary attention mechanism provided by the invention. And calculating the matching degree ranking of the candidate developers relative to the recommended items according to the weight coefficient of the relative relation.

DETEX model training method: the training method of the DETEX model considers members in the current team of projects as positive examples and other developers not in the team as negative examples, based on the assumption that the current team members are better suited for the current development project than other developers. To make this assumption more consistent with the reality of the data, only software projects of at least 5 star numbers and 5 developers and developers who participated in at least 5 projects are reserved during the initial screening of the training data to ensure the project superiority of the projects and developers. The trained model parameters are more reasonable.

On model training, the optimization problem is transformed into a binary problem, and the following cross-entropy objective function is optimized:

where p is the set of software items in the training set, T_pProject p set of developers in the current project team, s_p.dAnd

respectively representing the similarity of the real matching label and the prediction, and sigma represents a sigmoid function. The training purpose is achieved by adjusting network parameters to maximize a cross entropy function.

To train the DETEX model using the above formula, each existing developer in each project development team is set to a positive case s with the project's label _p.d1. Meanwhile, for each positive example, a negative example is randomly selected, and the label of the negative example is set to be s_p.d＝0。

Experimental setup: in the aspect of generating the set, a leave-one-out method is adopted, namely, one developer is removed from a development team member set of each open-source software project to serve as a test set positive case. Meanwhile, because the sequencing test is very time-consuming among all developers, 100 developers which are not in the project development team are randomly selected for each positive case as negative cases on the premise of not losing generality. The remaining developers will compose a training set for training the model.

Evaluation indexes are as follows: two indexes, namely Hit Rate (HR) and nDCG, are selected to evaluate the DETEX performance, and the calculation formulas of HR and nDCG are as follows:

wherein hit_tE {0, 1} is 1 when the rank of similarity of formal developers in the test set is less than or equal to K, and is 0 and r when the rank is greater than K_tE {1, 2.. k } represents the rank of the antecedent developer in the test set. When the positive example developer ranks higher, the larger the two indicators, indicating better performance of the tested method, K was set to 1, 5, 10, and 20 in the experiment.

Experimental data: open source software platform Github data and programming question and answer community StackOverflow data are adopted. For the Giuhub data, to satisfy the experimental assumptions, to ensure training quality, software projects participating in less than 5 developers and participating in less than 5 star and 5 developers were filtered out. Duplicate entries for fork are also deduplicated. For the StackOverflow data, 400 skills tags that occur with high frequency were manually screened out. Other data-related statistics are shown in table 1.

The comparison method comprises the following steps: according to the modeling of the invention, the expansion of a software development team is a one-class recommendation problem, so the following four methods are selected for comparison with the method of the invention:

(1) BPR, a pair-wise recommendation method;

(2) NCF, a method of modeling the interaction between an item and a user using a neural network;

(3) the TECE adds interactive modeling between the candidate developer and the team leader on the basis of the NCF;

(4) tBPR, based on BPR, incorporates modeling for project teams.

Wherein BPR and NCF utilize recommendation system related method to carry out team expansion modeling, and TECE and tBPR add modeling to the current project team on the basis of traditional user-item interaction.

The experimental results are as follows: we first compared our method using the DETEX model directly to these comparison methods, and fig. 3 and 4 show the experimental results for HR and nDCG, respectively. Compared with the comparison methods, the method provided by the invention is remarkably improved under two indexes. For example, compared with the comparison method TECE with the best experimental result, the HR and nDCG are respectively improved by 20.2% -77.2% and 43.0% -77.2%. There are two main reasons for significant lift. First, from a network representation learning perspective, we build and use a representation containing skill information for each developer, as compared to existing approaches; second, from a model level, the model of the present invention models the interaction of candidate developers with each member of a software team's existing development team, and uses an attention mechanism to give relative weight between developers and developers, as well as between projects and teams.

We also tried to study the improvement of the final effect of using our DETEX model alone. Because the model adopted by the existing method does not have the step of obtaining the pre-training feature representation based on network representation learning, the skill representation obtained by the network representation learning method is input into the model of the existing method, and the results of HR @10 and nDCG @10 are shown in Table 2. First, it can be seen that, with the addition of the skill expression, the results of all the comparison methods are significantly improved, and the NCF is improved by 14.4% and 19.3% compared with the original HR @10 and nDCG @10, respectively. It can also be observed that the DETEX model used in the present invention still performs better than other comparative methods. For example, there is still an 11.6% increase in nDCG @10 over TECE with the skill representation added. Such results indicate that both the representation of skill information and the modeling for teams in DETEX bring useful improvements to the final recommendation results.

Data statistics for the examples of Table 1

Number of open source software items	6599
		Number of developers	11931
Average number of developers per project	8.94
		Average number of participating items per developer	5.10
StackOverflow developer count	123214
		Number of co-developers	879
StackOverflow question skill tag number	400
		StackOverflow problem number	53566

Table 2 shows the results of the experiment after learning

Claims

1. A secondary attention mechanism-based open-source software project developer recommendation method is characterized by comprising the following steps: training a project developer recommendation model by using attribute network data composed of existing open source platform projects and developers, recommending and sequencing given candidate developers according to project document information and project member information, modeling project team characteristics and project characteristics based on the relationship among project team members, the relationship between the project team and the projects and text description information of the projects by using a secondary attention mechanism, and finally recommending according to the calculated matching degree sequencing of the developers and the projects, wherein the recommendation model adopted by the method mainly comprises the following steps:

2. The secondary attention mechanism-based open-source software project developer recommendation method of claim 1, characterized in that: introducing a developer characteristic representation by using a network representation learning method, then learning a relative weight relationship between the existing members of the project team by using a first layer attention mechanism, and obtaining a representation of the project team characteristic by using the weight; firstly, inputting the characteristic expression of a developer and a software project into a nonlinear layer, wherein the formula is as follows:

v_d＝μ(W_dd+b_d)

wherein W_d、b_dRespectively representing a network weight parameter and a bias term, d represents a feature representation obtained by network representation learning, and mu represents an activation function;

wherein t is_iRepresents the current software project team T_pDeveloper Member of Li, v_dRepresenting the feature representation output by the candidate developer in the non-linear layer,

represents the developer t_iA characteristic output in the non-linear layer indicates that, an indicates element-by-element multiplication; the interaction of the candidate developer with each member in the existing team of the open source software project is modeled in the formula, and different influence is given to different members in the team through the attention mechanism, namely the influence degree of the members in the team is different for different candidate developers.

3. The secondary attention mechanism-based open-source software project developer recommendation method of claim 1, characterized in that: modeling a relative weight relationship between the team and the project document through a second layer of attention mechanism to obtain overall characteristics of the project;

v_p＝μ(W_pp+b_p)

wherein W_pAnd b_pRespectively representing the weight parameters and the bias terms of the network, and p represents the characteristic representation of the project document obtained by the mean value of the word vector. Interactions between candidate developers and projects and project teams are then modeled; to model the preference relationship between different candidate developers for the open source project document and the project's current team at the time of selection, an attention mechanism described by the following formula is employed:

wherein alpha is_pAnd alpha_tRepresenting the corresponding weights of the open source project and the team, and obtaining a characteristic representation v representing the project and the team as a whole_comb：

v_comb＝α_pv_p+α_tv_t

where MLP stands for the multi-layer perceptron,