CN108647863B

CN108647863B - Project popularity analysis method based on mixed effect linear regression model

Info

Publication number: CN108647863B
Application number: CN201810377403.6A
Authority: CN
Inventors: 常俊胜; 胡东阳; 王涛; 余跃; 王怀民; 尹刚; 李耀宗
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-04-23
Filing date: 2018-04-25
Publication date: 2020-10-27
Anticipated expiration: 2038-04-25
Also published as: CN108647863A

Abstract

Aiming at the problems that the existing research separately explores a defect report and a feature report respectively and has one-sidedness in the evaluation of the popularity of a project, the invention provides a method for analyzing the popularity of the project based on a mixed effect linear regression model, which provides the influence relationship between the number of the defect reports and the number of the feature reports in the project on the popularity of the project by collecting project data from a GitHub and then using statistical analysis and regression modeling, and analyzes the relationship between the improvement of the popularity of the project and the popularity of the defect reports and the feature reports by the difference of the influence factors of the defect reports and the feature reports on the popularity of the project in the project; further, through four-dimensional analysis on the description diversity of the defect report and the feature report, the difference of the defect report and the feature report in the description diversity is found out. According to the method, the popularity of the project is comprehensively researched by analyzing the difference comparison between the number of the defect reports and the number of the feature reports in the project, so that the popularity of the project can be comprehensively evaluated.

Description

Project popularity analysis method based on mixed effect linear regression model

Technical Field

The invention belongs to the field of computer open source software analysis, and particularly relates to an analysis method for influence of a defect report (bug) and a feature report (feature issue) on project popularity in a project development process.

Background

Software development is a complex process involving many steps and associated developers. Code defects (bugs) usually occur in the software development process or new functions (features) are proposed, so a bug report (bug issue) and a feature report (feature issue) are two very important factors in the software project development process.

The number of the bug issue and the feature issue in projects with different targets and requirements can be different, and the difference of the number of the bug issue and the feature issue can have certain influence on the project development, such as the popularity of the project. The existing research mainly explores bug issue and feature issue respectively, and the popularity of the project is judged on one side.

GitHub is an open source code hosting website that has hundreds of millions of users. It allows developers to create and manage projects. GitHub contains billions of software project information. GitHub provides features such as flowers (flowers), seeds (feeds), network graphs (network graphs), and reports (issues), which may help developers to better manage the code repository. There are a wide variety of labels available to GitHub's issue, nor does GitHub provide specific resolution and differentiation of defect labels and feature labels. By checking the tag type of the issue, the bug issue and feature issue can be automatically identified. The GitHub project can be retrieved through REST APIs, and various types of projects are provided for various application fields (such as game software, web application programs, operating systems and the like). The programming language in the project is also a style change, and the number of developers is more or less. These project features make GitHub a very attractive open source platform to collect data for empirical studies. In previous work, data on the GitHub was used to study programming languages, the popularity of projects on a large scale, and to study software testing.

The mixed effect linear regression model is different from the common linear model in that the mixed effect linear regression model has a random effect besides a fixed effect, and the regression effect has comprehensiveness and noise resistance. The mixed-effect linear regression model is sometimes also referred to as a multi-level linear model or a hierarchical linear model.

The mixed effect linear regression model formula is:

Y＝Xβ+ZU+

wherein, Y is a dependent variable vector, X is an independent variable matrix, beta is a fixed effect parameter vector corresponding to X, and Z is a random effect variable matrix, and the structure of the matrix is the same as that of X. U is a random effect parameter vector corresponding to Z and is a noise vector.

At present, no method for analyzing the popularity of the project by utilizing a mixed effect linear regression model exists.

Disclosure of Invention

Aiming at the problems that most of the existing researches are to independently explore the bug issue and the feature issue and few of the existing researches comprehensively research the influence on the popularity of the project from the difference comparison of the numbers of the bug issue and the feature issue in the project, so that the popularity of the project cannot be comprehensively evaluated, the invention provides a project popularity analysis method based on a mixed effect linear regression model. Further, by analyzing the descriptive diversity of the bug issue and the feature issue in four dimensions, the difference between the bug issue and the feature issue in the descriptive diversity is found out.

The technical scheme is as follows:

the project popularity analysis method based on the mixed effect linear regression model comprises the following steps:

step one, collecting project data from a GitHub to establish a data set; the specific process is as follows:

1.2 randomly selecting F items from the GitHub, wherein F is a natural number, and setting the value of F according to the requirement of result accuracy;

1.2 selecting data in the project by: selecting all the esses in the F items, recording the quantity of the esses as S, wherein S is a natural number, and then counting the statistical data of the S esses;

secondly, constructing a mixed effect linear regression model, wherein the construction method comprises the following steps:

2.1 dependent and independent variables defining the mixed-effect linear regression model:

nStars the total number of Stars (praise) for an item, an indication of the popularity of the item;

avg. timeLatency _ bug: the average resolving time of the bug issue in the project represents the speed of the bug issue, and the unit is minutes.

AVG. timeLatency _ feature: the average resolution time of feature issue in the project represents the speed of feature issue resolution in minutes.

Details _ bug: average number of comments of bug issue in the project.

Details _ feature: average number of reviews of feature issue in a project.

nIssueBef the number of asses generated by project 3 months before the start of this issue, representing the workload of the project.

nMembers the total number of project members, representing the team size of the project.

HasAssssignee: binary, and if the issue has at least one submitter (issue), the value is 1.

textLen-Total number of words in the issue text, representing the complexity of the issue.

issueType ═ bug: the issue type is bug issue.

issueType ═ feature: the issue type is feature issue.

Wherein nsars is a dependent variable, and avg.

2.2 obtaining dependent variable and independent variable data of the mixed effect linear regression model defined in the step 2.1 by using an Application Programming Interface (API) provided by the GitHub official;

2.3, constructing a mixed effect linear regression model by using the dependent variable and independent variable data of the mixed effect linear regression model obtained in the step 2.2 through an lmer package in the R language to obtain a model, wherein the processing form of the model obtained through the lmer package is as follows:

model<-lmer(scale(log(nStars+0.5))～

scale(log(AVG.timeLatency_bug.+0.5))

+scale(log(AVG.timeLatency_feature.+0.5))

+scale(log(AVG.comments_bug.+0.5))

+scale(log(AVG.comments_feature.+0.5))

+scale(log(nIssueBef+0.5))

+scale(log(nMembers+0.5))

+scale(log(hasAssignee+0.5))

+scale(log(textLen+0.5))

+factor(issueType＝bug)

+factor(issueType＝feature),data＝data,REML＝FALSE)

the lmer package in the R language is common general knowledge in the art.

And thirdly, carrying out variance analysis on the model to obtain a multiple regression analysis result, including the standard error and the difference sum of squares of the estimated value, and further calculating the variance contribution rate of the bug issue quantity and the feature issue quantity in the project, namely the project popularity influence degree. The variance contribution rate calculation method of the bug issue quantity is as follows: the sum of the squared differences at "issueType ═ bug" divided by the sum of the squared differences for all independent variables;

the variance contribution rate calculation method of feature issue quantity is as follows: the sum of the squared differences at "issueType ═ feature" is divided by the sum of the squared differences for all arguments.

If the variance contribution rate of the number of bug issue is greater than that of the number of feature issue, the influence degree of the number of bug issue in the project on the popularity of the project is larger; otherwise, the influence degree of the feature issue quantity on the item popularity is more.

As a further improvement of the technical scheme of the invention, project developers continue to analyze the description diversity of the bug issue and the feature issue and find out the difference between the bug issue and the feature issue in the description diversity. The process is as follows:

4.1 randomly extracts M bug issues and N feature issues from S issues, M, N all natural numbers, and the sum of M, N does not exceed S.

4.2 project developers read the web page content of each issue and mark keywords and sentences of the web page content.

4.3 comparing the difference of the descriptive diversity of bug issue and feature issue from the four attributes of code segment (code), link (https), @, picture (picture). The method comprises the following steps:

4.3.1 logging in the webpage interface of issue, counting the information of four attributes of the sample issue: if the issue interface contains a code segment, the code segment tag is marked as 1, otherwise, the code segment tag is marked as 0. If the issue interface contains an https link, the https link label is marked as 1, otherwise it is marked as 0. If the issue interface contains @, the @ tag is noted as 1, otherwise noted as 0. If the issue interface contains picture content, the picture (picture) tag is marked as 1, otherwise, the picture (picture) tag is marked as 0.

4.3.2 calculating the proportion of the four attributes marked to be equal to 1 in M bug esses and N feature esses respectively;

4.4 if the proportion of the four attributes marked with 1 in the bug issue is more than that in the feature issue, the description diversity of the bug issue is higher than that of the feature issue, and the influence degree of the bug issue on the item popularity is larger; otherwise, if the proportion of the four attributes marked with 1 in the feature issue is more than that in the bug issue, the description diversity of the feature issue is higher than that of the bug issue, and the bug issue has a larger influence degree on the item popularity; otherwise, the bug issue and feature issue have a similar impact on item popularity.

As a further improvement of the technical solution of the present invention, in step 1.1, in order to ensure the universality of the experimental data, the following restrictions are set for the selection of the items: the selected items include at least 10 or more bug issues and 10 or more feature issues.

As a further improvement of the technical solution of the present invention, the statistical data of the issue in step 1.2 includes:

(1) key indexes of the project to which each issue belongs include project language, number of project branches (forks), number of project praise (Stars), number of project members (members), and number of issues generated by the project three months before the start of the current issue;

(2) key indicators for each issue, including the length of the issue's title (title) and body, the number of comments (comments), the creation time, the closing time, whether this issue is assigned.

(3) For each issue, submitting developer information for that issue, including marking whether the developer submitted the issue before submitting the issue, and if so, marking as 1; otherwise, it is marked 0.

As a further improvement of the technical solution of the present invention, in step 1.2, in order to ensure the reliability of the analysis method result, the following limitations are set for data selection in issue: for the processing time of issue, only the time difference from the creation time of issue to the first closing time of issue is counted.

Compared with the prior art, the invention has the beneficial effects that:

●, analyzing the influence of the bug issue and the feature issue on the popularity of the project through a mixed effect linear regression model, and comprehensively researching the influence on the popularity of the project through the difference comparison of the quantity of the bug issue and the feature issue in the project, thereby comprehensively evaluating the popularity of the project.

●, through analyzing the description diversity of the bug issue and the feature issue in four dimensions, the difference of the bug issue and the feature issue in description diversity is found out, and furthermore, suggestions can be provided for project developers, and the popularity of the project is improved by increasing the description diversity of the bug issue or the feature issue.

Drawings

FIG. 1 is a general flow diagram of the present invention.

Detailed Description

1.1 selecting items

The data of the research are derived from a GitHub project, 272 Github projects are randomly selected, in order to ensure the reliability of the experimental result, the projects selected by the inventor at least comprise more than 10 bug esses and more than 10 featureeisue, and the universality of the experimental data is ensured. Table 1 lists example tags for the bug issue and feature issue, and issues with these tags will be considered as either the bug issue or the feature issue.

TABLE 1 bug issue and feature issue tags

bug issue	Bug; defect; type is bug; a Browser Bug; bugfix, etc
		feature issue	feature; request; propofol; featreq; feautre et al

1.2 selecting data

To ensure the reliability of the experimental results, for the processing time of issue, we only consider the time difference between the creation time of issue and the first closing time of issue. Based on the above principle, we selected 287,703 esses from 272 selected items. We count some key indexes of the items to which the isuse belongs, including the item language, the number of items forks, the number of items Stars and the number of items chambers. We also counted some key indicators of 287,703 issues, including the length of the title and body for each issue, the number of comments, the creation time, the closing time, whether this issue is allocated. We also make statistics of the relevant information of the developer who submitted this issue, including marking whether this developer submitted the issue before submitting this issue, and if so, marking as 1, otherwise marking as 0. Table 2 shows the information about the statistics of 287,703 isues.

TABLE 2287,703 summary statistics

Statistical information	Mean value of	Standard deviation of	Minimum value	Median value	Maximum value
						Number of items members	13.4	30.3	0.0	4.0	175.0
Number of items for	1,609	3,709.0	0.0	463.0	49,657.0
						Number of items Stars	5,839.0	10,446.4	0.0	1,251.0	69,834.0
Total number of issue	1,062.0	1,038.3	1.0	764.0	7,910.0
						Total number of bug issue	962.8	986.2	1.0	704.0	7,910.0
Total number of feature esses	239.2	309.5	1.0	160.0	2,139.0

2.1 define the dependent and independent variables of the regression model, our independent variables come from three different levels, project level, devipper level and issue level:

nStars, the total number of endorsements for a project, an indication of the popularity of the project;

Details _ bug: average number of comments of bug issue in the project.

Details _ feature: average number of reviews of feature issue in a project.

HasAsssignee binary, with a value of 1 if the issue has at least one submitter.

issueType ═ bug: issue is of type bug.

issueType ═ feature: the issue is feature type.

Wherein nsars is a dependent variable, and avg.

2.2 because GitHub provides an official API, experimenters can conveniently obtain historical behavior data of project development on GitHub. Therefore, the variable data of the experiment are all dependent variable and independent variable data of the mixed effect linear regression model defined in the step 2.1 obtained through an application programming interface API provided by the gitchub authority.

model<-lmer(scale(log(nStars+0.5))～

scale(log(AVG.timeLatency_bug.+0.5))

+scale(log(AVG.timeLatency_feature.+0.5))

+scale(log(AVG.comments_bug.+0.5))

+scale(log(AVG.comments_feature.+0.5))

+scale(log(nIssueBef+0.5))

+scale(log(nMembers+0.5))

+scale(log(hasAssignee+0.5))

+scale(log(textLen+0.5))

+factor(issueType＝bug)

+factor(issueType＝feature),data＝data,REML＝FALSE)

where nsarss are dependent variables, issueType-bug and issueType-feature are the main analytical independent variables, and the other variables are all random effect variables.

And thirdly, carrying out variance analysis on the model to obtain a multiple regression analysis result, including the standard error of the estimated value and the sum of squares of the difference. Table 3 shows the results of the multiple regression analysis:

TABLE 3 multiple regression analysis results

	Standard error of estimated value	Sum of squares of differences
			log(AVG.timeLatency_bug.+0.5)	-0.2173(0.0020)	4,827.8
log(AVG.timeLatency_feature.+0.5)	-0.1838(0.0019)	3,443.4
			log(AVG.comments_bug.+0.5)	0.1747(0.0025)	3,112.2
log(AVG.comments_feature.+0.5)	0.1025(0.0037)	2,413.4
			log(nIssueBef+0.5)	0.0403(0.0075)	23.6
log(nMembers+0.5)	0.0633(0.0052)	121.2
			log(hasAssignee+0.5)	0.0690(0.0023)	765.6
log(textLen+0.5)	0.0149(0.0019)	51.8
			issueType＝bug	0.4292(0.0063)	12,784.5
issueType＝feature	0.1107(0.0024)	2,023.8

Statistical analysis shows that nStars is positively correlated with both bug type and feature, and the higher the number of bug issues and feature issues of an item, the higher the nStars, and the higher the popularity of the item. Wherein, the issueType variance contribution rate is 12,784.5/(4,827.8+3,443.4+ … +12784.5+2,023.8) ═ 43.2%, and the issueType variance contribution rate is 2,023.8/(4,827.8+3,443.4+ … +12784.5+2,023.8) ═ 6.8%. 43.2% is much higher than 6.8%, so the number of bug issues in the project has a greater impact on the popularity of the project than the number of feature issues in the project.

And (4) conclusion: from the research on the influence of bug issue and feature issue based on the mixed effect linear regression model on the popularity of the project, when the number of the bug issue and the feature issue of the project is more, the higher the Stars (nStars) of the project, the higher the popularity of the project. And the influence degree of the number of bug issues in the project on the popularity of the project is larger than the influence degree of the number of feature issues in the project on the popularity of the project.

And fourthly, continuing analyzing the description diversity of the bug issue and the feature issue by the project developers based on the research conclusion of the third step mixed effect linear regression model, and finding out the difference of the bug issue and the feature issue on the description diversity. The process is as follows:

4.1 first we randomly drawn 10000 bug esses and feature esses (5000 for each class of data) from the dataset.

4.3.1 logging in to the webpage interface of issue counts the related information of the sample issue. For example, if the issue interface contains a code section, the code section tag is marked as 1, otherwise it is marked as 0. If the issue interface contains an https link, the https link label is marked as 1, otherwise it is marked as 0. If the issue interface contains @, the @ tag is noted as 1, otherwise noted as 0. If the issue interface contains picture content, the picture (picture) tag is marked as 1, otherwise, the picture (picture) tag is marked as 0.

4.3.2 calculate the ratio of the four attributes with a label equal to 1 in 5000 bug issue and 5000 feature issue samples, respectively; the statistical results are shown in table 4.

Table 4 describes the diversity

	Ratio in bug issue	Ratio in feature issue
			Code segment (code)	34.0％	13％
Link (https)	52.0％	24.0％
			@	16.0％	6.0％
Picture (picture)	14.0％	2.0％

4.4 analyzing the differences in the four dimensions of bug issue and feature issue description diversity by Table 4, it is shown that bug issue is more descriptive than feature issue. The conclusion is that one of the reasons why the bug issue has a greater influence on the item popularity than the feature issue is that the bug issue has more descriptive diversity than the feature issue. Therefore, the invention provides a regression model-based project popularity analysis method, which suggests project developers to increase the description diversity of featureissue and improve the popularity of projects.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. The project popularity analysis method based on the mixed effect linear regression model is characterized by comprising the following steps of:

1.1 randomly selecting F items from the GitHub, wherein F is a natural number, and setting the value of F according to the requirement of result accuracy;

nStars the total number of praise for an item;

avg. timeLatency _ bug: in the project, bug issue is the average solving time of the defect report, and the unit is minute;

AVG. timeLatency _ feature: feature issue in the project is the average resolution time of the feature report, in minutes;

details _ bug: average number of comments of bug issue in the project;

details _ feature: average number of reviews of feature issue in the project;

nIssueBef the number of issue produced in the 3 month project before this issue begins;

nMembers the total number of project members;

HasAsssignee, binary, if the issue has at least one submitter, the value is 1;

textLen total number of words in the issue text;

issueType ═ bug: the issue type is bug issue;

issueType ═ feature: the issue type is feature issue;

2.2 acquiring dependent variable and independent variable data of the mixed effect linear regression model defined in the step 2.1 by using an application programming interface API (application programming interface) provided by the GitHub official;

2.3, constructing a mixed effect linear regression model by using an lmer package in the R language for the dependent variable and independent variable data of the mixed effect linear regression model obtained in the step 2.2 to obtain a model;

thirdly, carrying out variance analysis on the model to obtain a multiple regression analysis result, and calculating variance contribution rates of the number of bugissues and the number of feature esses in the project, namely the influence degree of the popularity of the project; if the variance contribution rate of the quantity of the bug issue is greater than the variance contribution rate of the quantity of the feature issue, the influence degree of the quantity of the bug issue in the project on the popularity of the project is larger; otherwise, the influence degree of the feature issue quantity on the item popularity is more.

2. The method for item popularity analysis based on a mixed-effect linear regression model according to claim 1, wherein the selection of items in step 1.1 is limited as follows: the selected items include at least 10 or more bugesses and 10 or more feature issues.

3. The method for item popularity analysis based on a mixed-effect linear regression model as claimed in claim 1, wherein the statistical data of issue in step 1.2 includes:

(1) key indexes of projects to which each issue belongs;

(2) key indicators for each issue;

(3) for each issue, developer information for that issue is submitted.

4. The method for analyzing popularity of projects based on the mixed-effect linear regression model as claimed in claim 1, wherein in the step 1.2, the following limits are set for the data selection in issue: for the processing time of issue, only the time difference from the creation time of issue to the first closing time of issue is counted.

5. The method for analyzing the popularity of the project based on the mixed-effect linear regression model as claimed in any one of claims 1 to 4, wherein the project developer continues to analyze the descriptive diversity of the bug issue and the feature issue to find out the difference between the descriptive diversity of the bug issue and the feature issue; the process is as follows:

4.1 randomly extracting M bug issues and N feature issues from S issues, wherein M, N are natural numbers, and the sum of M, N is not more than S;

4.2 the project developer reads the webpage content of each issue and marks the keywords and sentences of the webpage content;

4.3 comparing differences of the descriptive diversity of the bug issue and the feature issue from four attributes of code segment, link, @andpicture; the method comprises the following steps:

4.3.1 logging in the webpage interface of issue, counting the information of four attributes of the sample issue: if the issue interface contains a code segment, recording a code segment label as 1, otherwise, recording as 0; if the issue interface contains https links, the https link label is marked as 1, otherwise, the https link label is marked as 0; if the issue interface contains @, then the @ label is marked as 1, otherwise, the @ label is marked as 0; if the issue interface contains picture content, the picture label is marked as 1, otherwise, the picture label is marked as 0;

4.4 if the proportion of the four attributes marked with 1 in the bug issue is more than that in the feature issue, the description diversity of the bug issue is higher than that of the feature issue, and the influence degree of the bug issue on the item popularity is larger; otherwise, if the proportion of the four attributes marked with 1 in the feature issue is more than that in the bug issue, the description diversity of the feature issue is higher than that of the bug issue, and the influence degree of the bug issue on the item popularity is larger; otherwise, the bug issue and feature issue have a similar impact on item popularity.