CN112114795B - Method and device for predicting deactivation of auxiliary tool in open source community - Google Patents

Method and device for predicting deactivation of auxiliary tool in open source community Download PDF

Info

Publication number
CN112114795B
CN112114795B CN202010989416.6A CN202010989416A CN112114795B CN 112114795 B CN112114795 B CN 112114795B CN 202010989416 A CN202010989416 A CN 202010989416A CN 112114795 B CN112114795 B CN 112114795B
Authority
CN
China
Prior art keywords
project
auxiliary tool
tool
auxiliary
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010989416.6A
Other languages
Chinese (zh)
Other versions
CN112114795A (en
Inventor
蒋竞
刘征宇
王鑫
张莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010989416.6A priority Critical patent/CN112114795B/en
Publication of CN112114795A publication Critical patent/CN112114795A/en
Application granted granted Critical
Publication of CN112114795B publication Critical patent/CN112114795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/33Intelligent editors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention relates to a method and a device for predicting the outage of an auxiliary tool in an open source community, belongs to the technical field of computer science, and solves the problems that the outage of the auxiliary tool cannot be predicted accurately and reasonably due to the fact that the use/outage definition of the auxiliary tool is fuzzy and the acquired features are few in the existing prediction method. The method comprises the following steps: acquiring project data and data of project use auxiliary tools to obtain a historical data set; extracting effective characteristics of the project using auxiliary tools based on the historical data set, generating characteristic vectors, and obtaining an input matrix based on the characteristic vectors; constructing an auxiliary tool deactivation prediction model PATpredict based on the input matrix and the XGboost algorithm classifier; the auxiliary tool deactivation prediction model PATpredict is utilized to carry out deactivation prediction on the target auxiliary tool used by the target project to obtain a deactivation prediction result, the prediction result can be obtained quickly and efficiently, and the accuracy of the prediction result is improved.

Description

Method and device for predicting deactivation of auxiliary tool in open source community
Technical Field
The invention relates to the technical field of computer science, in particular to a method and a device for predicting the outage of an auxiliary tool in an open source community.
Background
The open source community is also called an open source code community, is a platform for publishing software source codes according to a corresponding open source software license agreement, and is also a space for developers to freely learn and communicate. Typical open source software communities are GitHub, open source China, etc., wherein GitHub is the largest open source software project hosting platform around the world.
Existing research shows that project use of accessibility in open source communities is a common phenomenon, and accessibility outage is also an important phenomenon, but still has the following drawbacks: in the prior art, the selected related information for the outage prediction of the auxiliary tool of the open source community is less, so that the accuracy of prediction results obtained by the outage prediction of a large number of auxiliary tools is low; the use/stop condition of the auxiliary tool is fuzzy due to the open source community project, and the model provided by the prior art cannot accurately and reasonably predict the stop phenomenon of the auxiliary tool.
Disclosure of Invention
In view of the foregoing analysis, embodiments of the present invention are directed to provide a method for predicting the outage of an assistant tool in an open source community, so as to solve the problem that the outage of the assistant tool cannot be predicted accurately and reasonably due to fewer features obtained in the existing method.
In one aspect, an embodiment of the present invention provides a method for predicting deactivation of an auxiliary tool in an open source community, including the following steps:
acquiring project data and data of project use auxiliary tools to obtain a historical data set;
extracting effective characteristics of project use auxiliary tools based on the historical data set, generating characteristic vectors, and obtaining an input matrix based on the characteristic vectors;
constructing a disabled prediction model PATpredict of an auxiliary tool based on the input matrix and the XGboost algorithm classifier;
and performing shutdown prediction on the target auxiliary tool used by the target project by using the auxiliary tool shutdown prediction model PATpredict to obtain a shutdown prediction result.
Further, effective features of the historical data set are extracted from four dimensions of project attributes, effects of project use target auxiliary tools, auxiliary tool attributes and characteristics of the project use auxiliary tools.
Further, the effective features extracted based on the item attribute dimension include: the programming language used by the project, whether the project is an organization project, whether the project has a wiki introduction website, whether the project has an official website, whether the project has a homepage in GitHub, the age of the project, the year of creation of the project, an opening source license contained in the project, the maximum text similarity of the project description of the non-deactivated auxiliary tool, the average text similarity of the project description of the non-deactivated auxiliary tool, the maximum text similarity of the project description of the deactivated auxiliary tool, and the average text similarity of the project description of the deactivated auxiliary tool;
the effective features extracted based on the effect dimension of the project using the target assist tool include: success ratio, failure ratio, error ratio, proportion of pending to task number, longest task time executed by the auxiliary tool, average task time, commit number of the auxiliary tool, contribution request number of the project containing the auxiliary tool name keyword, and contribution number of the project contributors of the project execution result of the auxiliary tool;
the effective features extracted based on the assistant tool attribute dimension comprise: the accessory tool name, the accessory tool category, whether the accessory tool is registered in the GitHub store;
the effective features extracted based on the feature dimension of the project use aid include: the number of aids used for the project, the number of aids disabled for the project.
Further, an auxiliary tool deactivation prediction model PATpredict is constructed based on the input matrix and the XGboost algorithm, and the method comprises the following steps:
adding labels to the input matrix whether an item deactivates an auxiliary tool, the labels including deactivation and non-deactivation;
and inputting the input matrix and the corresponding label into an XGboost algorithm analyzer for model training to obtain a disabled prediction model PATpredict of the auxiliary tool.
Further, the method for using the auxiliary tool deactivation prediction model PATpredict to perform deactivation prediction on the auxiliary tool used by the target project to obtain a deactivation prediction result comprises the following steps:
acquiring project data corresponding to the target project and data of the target project using auxiliary tools to obtain a historical data set to be predicted;
extracting effective features of target items using auxiliary tools based on the historical data set to be predicted, generating a feature vector to be predicted, and obtaining an input matrix to be predicted based on the feature vector to be predicted;
and inputting the input matrix to be predicted into an auxiliary tool to stop the prediction model PATpredict to obtain a prediction result.
In another aspect, an embodiment of the present invention provides an apparatus for predicting the outage of an auxiliary tool in an open source community, including:
the training data acquisition module is used for acquiring project data and data of a project use auxiliary tool to obtain a historical data set;
the effective feature extraction module is used for extracting effective features of project use auxiliary tools based on the historical data set, generating feature vectors and obtaining an input matrix based on the feature vectors;
the prediction model obtaining module is used for constructing an auxiliary tool deactivation prediction model PATpredict according to the input matrix and the XGboost algorithm classifier;
and the stopping prediction module is used for performing stopping prediction on the target auxiliary tool used by the target project by using the auxiliary tool stopping prediction model PATpredict to obtain a stopping prediction result.
Further, the effective feature extraction module extracts effective features of the historical data set from four dimensions of project attributes, effects of project use target auxiliary tools, auxiliary tool attributes and characteristics of project use auxiliary tools.
Further, the effective feature extraction module extracts effective features based on the item attribute dimension, including: the programming language used by the project, whether the project is an organization project, whether the project has a wiki introduction website, whether the project has an official website, whether the project has a homepage in GitHub, the age of the project, the year of creation of the project, an opening source license contained in the project, the maximum text similarity of the project description of the non-deactivated auxiliary tool, the average text similarity of the project description of the non-deactivated auxiliary tool, the maximum text similarity of the project description of the deactivated auxiliary tool, and the average text similarity of the project description of the deactivated auxiliary tool;
the effective features extracted based on the effect dimension of the project using the target assist tool include: success ratio, failure ratio, error ratio, proportion of pending to task number, longest task time executed by the auxiliary tool, average task time, commit number of the auxiliary tool, contribution request number of the project containing the auxiliary tool name keyword, and contribution number of the project contributors of the project execution result of the auxiliary tool;
the effective features extracted based on the assistant tool attribute dimension comprise: the accessory tool name, the accessory tool category, whether the accessory tool is registered in the GitHub store;
the effective features extracted based on the feature dimension of the project use aid include: the number of aids used for the project, the number of aids disabled for the project.
Further, the prediction model obtaining module executes the following process:
adding labels to the input matrix whether an item deactivates an auxiliary tool, the labels including deactivation and non-deactivation;
and inputting the input matrix and the corresponding label into an XGboost algorithm analyzer for model training to obtain a disabled prediction model PATpredict of the auxiliary tool.
Further, the deactivation prediction module performs the following process:
acquiring project data corresponding to the target project and data of the target project using auxiliary tools to obtain a historical data set to be predicted;
extracting effective features of target items using auxiliary tools based on the historical data set to be predicted, generating a feature vector to be predicted, and obtaining an input matrix to be predicted based on the feature vector to be predicted;
and inputting the input matrix to be predicted into an auxiliary tool to stop the prediction model PATpredict to obtain a prediction result.
Compared with the prior art, the invention can realize at least one of the following beneficial effects:
1. a method for predicting the outage of an auxiliary tool in an open source community comprises the steps of obtaining project data and data of the auxiliary tool used by a project from different dimensions to obtain a historical data set, constructing an auxiliary tool outage prediction model based on the historical data set, and finally conducting outage prediction on a target auxiliary tool used by a target project.
2. The attribute data which are acquired by the crawler technology and used for predicting whether a project can stop using a certain auxiliary tool is comprehensive, effective characteristics are acquired from four dimensions of project attributes, the effect of using a target auxiliary tool for the project, the attributes of the auxiliary tool and the characteristics of using the auxiliary tool for the project, technical support and basis are provided for constructing a stop using prediction model PATpredict of the auxiliary tool in the later period, and meanwhile, the accuracy of stop using prediction of the auxiliary tool can be improved due to the complete and comprehensive effective characteristics.
3. The auxiliary tool outage prediction model PATpredict is obtained through input matrix and label construction, is simple and easy to implement, provides technical support for the outage situation of the target auxiliary tool used by the target tool in the later period of prediction, and is high in accuracy and high in prediction speed.
In the invention, the technical schemes can be combined with each other to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a flow diagram of a method for predicting outage of accessibility in an open source community, under an embodiment;
FIG. 2 is a model framework for a predictive method of accessibility deactivation in an open source community, under an embodiment;
FIG. 3 is a block diagram of a predictive device for facilitating tool deactivation in an open source community in accordance with another embodiment;
reference numerals:
100-a training data acquisition module; 200-valid feature extraction module, 300-prediction model obtaining module, 400-deactivation prediction module.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
The related information selected by the existing method for predicting the outage of the open-source community auxiliary tool is less, so that the accuracy of prediction results obtained by predicting the outage of a large number of auxiliary tools is low, the use/outage conditions of the auxiliary tools of the open-source community are fuzzy, and the model provided by the prior art cannot accurately and reasonably predict the outage of the auxiliary tool. Therefore, the method and the device for predicting the outage of the auxiliary tool in the open source community are provided, the outage means that the auxiliary tool is not used in the last 90 days of activity of the project using the auxiliary tool, project data and data of the auxiliary tool used by the project are obtained from different dimensions to obtain a historical data set, an auxiliary tool outage prediction model is built based on the historical data set, and finally the outage prediction is performed on the target auxiliary tool used by the target project, so that the problem of low accuracy of the existing prediction method is solved, and the accuracy of the prediction result is improved.
In one embodiment of the present invention, a method for predicting the outage of an assistant tool in an open source community is disclosed, the outage means that an item using the assistant tool does not use the assistant tool within 90 days of the last activity, as shown in fig. 1, and the method includes the following steps S1 to S4.
And step S1, acquiring project data and data of project use auxiliary tools to obtain a historical data set. On a Github platform of an open source community, a user submits a contribution request to trigger the use of an auxiliary tool, so in order to acquire tool use conditions as much as possible, 10000 project IDs before the contribution request are selected, project data and data of the project using the auxiliary tool can be acquired according to the selected project IDs, and a crawler technology can be used for the data acquisition method. The crawled project data and the data of the project use aid constitute a historical data set.
Preferably, the valid features of the historical data set are extracted from four dimensions of the item attributes, the effect of the item use target assistant, the assistant attributes and the characteristics of the item use assistant. In the application, influence factors of the project stopping auxiliary tool are comprehensively and completely summarized from four dimensions of the project attribute, the effect of the project using target auxiliary tool, the auxiliary tool attribute and the project using auxiliary tool, including basic attribute of the project, basic attribute of the auxiliary tool, habit of the project using the auxiliary tool and the like.
The effective features extracted based on the item attribute dimension comprise: the programming language used by the project, whether the project is an organization project, whether the project has a wiki introduction website, whether the project has an official website, whether the project has a homepage in GitHub, the age of the project, the year of creation of the project, an opening source license included in the project, the maximum text similarity of the project description of the non-deactivated assistant tool, the average text similarity of the project description of the non-deactivated assistant tool, the maximum text similarity of the project description of the deactivated assistant tool, and the average text similarity of the project description of the deactivated assistant tool. The project can be divided into a personal project and an organization project, the organization project refers to a project created and managed by a certain organization, and the organization project is generally considered to be more standard, so that the construction of a later prediction model is facilitated.
Specifically, the maximum text similarity and the average similarity of the item descriptions of the non-deactivated auxiliary tools refer to that the text similarities of all items of the non-deactivated auxiliary tools are calculated respectively, and then the maximum value and the average value of all the text similarities are obtained to obtain the maximum text similarity and the average similarity of the item descriptions of the non-deactivated auxiliary tools. The maximum text similarity and the average similarity of the item descriptions of the auxiliary tool are calculated in the same way as the maximum text similarity and the average similarity of the item descriptions of the auxiliary tool which is not deactivated. Wherein, the meterThe step of calculating the text similarity of the two item descriptions comprises the following steps: firstly, the natural language processing technology realized based on an NLTK tool is utilized to complete the processing of word segmentation, word stem extraction and stop word removal on the project description, wherein the word segmentation means that a section of project description text is segmented into a plurality of word forms; extracting the word stem mainly means removing a complex form and a verb tense of a word, namely prototype conversion of the word; the word processing is used for reducing the data dimension, removing some frequently used words without specific meaning, such as 'i', 'a', 'is', and the like, and then converting the item description text into a TF-IDF vector by using Term Frequency-Inverse file Frequency TF-IDF (Term Frequency-Inverse Document Frequency). Based on the word stock, a text feature vector of two item descriptions is obtained, each dimension in the vector represents TF-IDF value of corresponding word in the text, and then cosine distance is used for representing text similarity TextsSim (D) between the two item descriptionsa,Db) The formula is shown as follows:
Figure BDA0002690343410000081
wherein D isaAnd DbFor two item description texts, α and β represent item description texts D, respectivelyaAnd DbThe corresponding TF-IDF vector, | | α | | | represents the 2 nd-norm of the vector α.
The effective features extracted based on the effect dimension of the project using the target aid include: success ratio, failure ratio, error ratio, proportion of pending to task number of the execution result of the auxiliary tool for use in the project, the longest task time for the auxiliary tool to execute, average task time, commit number of the auxiliary tool for use in the project, contribution request number of the project containing the keyword of the auxiliary tool name, and contribution number of the project.
Specifically, in the Github platform of the open source community, each time a developer submits a contribution request, the developer triggers the execution of the assistant tool, and the Github platform records the execution result of the assistant tool. The data set of the project use target auxiliary tool can be crawled through a crawler technology, the data set comprises the result of each execution of the auxiliary tool, and the success ratio, the failure ratio, the error ratio, the proportion of pending to task quantity of the execution state of the auxiliary tool, the longest task time of the execution of the auxiliary tool, the average task time, the commit quantity of the project use auxiliary tool, the contribution request quantity of the project containing the auxiliary tool name keyword and the project contributor quantity can be obtained through counting all the execution results. Since the execution result comprises three conditions of success, failure and error, the success ratio, the failure ratio and the error ratio of the project using the target auxiliary tool can be obtained by directly counting the execution result.
The effective features extracted based on the attribute dimension of the auxiliary tool comprise: the assistant tool name, the assistant tool category, whether the assistant tool is registered in the GitHub store. Specifically, the name of the auxiliary tool, the type of the auxiliary tool, whether the auxiliary tool is registered in the GitHub store, and the like are basic information of the auxiliary tool used for the project, and the icon of the auxiliary tool is recorded when the tool is used for the project.
The effective characteristics extracted based on the characteristic dimension of the project use auxiliary tool comprise: the number of aids used for the project, the number of aids disabled for the project. The Github platform of the open source community comprises detailed records of auxiliary tools used by projects, so that the auxiliary tool name used by each project can be obtained through the crawler technology and the corresponding relation between the manually marked tool icon and the auxiliary tool name, and the quantity of the auxiliary tools used by the projects and the quantity of the auxiliary tools used when the projects are not used in the open source community can be obtained through the duplicate removal processing.
Attribute data, acquired by a crawler technology, for predicting whether a project can stop using a certain auxiliary tool or not is comprehensive, 27 effective features are extracted from four dimensions of project attributes, the effect of using a target auxiliary tool for the project, the attributes of the auxiliary tool and the characteristics of using the auxiliary tool for the project, technical support and basis are provided for constructing a stop using prediction model PATpredict of the auxiliary tool in the later stage, and meanwhile, the accuracy of stop using prediction of the auxiliary tool can be improved through the complete and comprehensive effective features.
And step S2, extracting effective characteristics of the project use auxiliary tool based on the historical data set, generating characteristic vectors, and obtaining an input matrix based on the characteristic vectors. Specifically, after 27 valid features of the project use aid are obtained based on step S1, the valid features are converted into vector forms respectively to generate a plurality of rows of feature vectors, and the plurality of feature vectors are combined to form an input matrix including m rows of feature vectors. Where m is the number of active features, i.e., m has a value of 27. And obtaining an input matrix based on the effective characteristics of the project use auxiliary tool, and providing a basis for constructing and constructing a shutdown prediction model PATpredict of the auxiliary tool at the later stage.
And step S3, constructing an auxiliary tool deactivation prediction model PATpredict based on the input matrix and the XGboost algorithm classifier. The method comprises the following steps of constructing an auxiliary tool deactivation prediction model PATpredict based on an input matrix and an XGboost algorithm, and comprises the following steps:
and S301, adding a label of whether the auxiliary tool is deactivated or not to the input matrix, wherein the label comprises deactivation and non-deactivation. After the input matrix is obtained, labels can be added to the input matrix according to whether the auxiliary tool is stopped or not stopped according to the item, and the labels comprise stopping and non-stopping.
Step S302, inputting the input matrix and the corresponding label into an XGboost algorithm analyzer for model training to obtain a disabled auxiliary tool prediction model PATpredict. Specifically, the XGboost algorithm is an integrated learning algorithm based on a gradient lifting tree, has a good classification effect, is essentially composed of a plurality of decision trees, and during training, the decision parameters are fitted through the result obtained by inputting a matrix and the residual error of a real identifier, the model performance is improved through successive iteration, and finally the auxiliary tool shutdown prediction model PATpredict is obtained through training of all training set data. In the application, an XGboost algorithm can be adopted to construct a prediction model, the feature vectors generated in the step S2 and the corresponding labels are used as the input of the model, the model is trained, and the final auxiliary tool deactivation prediction model PATpredict is obtained.
The auxiliary tool outage prediction model PATpredict is obtained through input matrix and label construction, is simple and easy to implement, provides technical support for the outage situation of the target auxiliary tool used by the target tool in the later period, and is high in accuracy and high in prediction speed.
And step S4, utilizing the auxiliary tool deactivation prediction model PATPredict to carry out deactivation prediction on the target auxiliary tool used by the target project, and obtaining a deactivation prediction result. After the auxiliary tool deactivation prediction model PATPredict is obtained based on the step S3, the deactivation prediction can be performed on the target item using the target auxiliary tool.
Preferably, the method for performing deactivation prediction on an auxiliary tool used by a target project by using an auxiliary tool deactivation prediction model PATpredict to obtain a deactivation prediction result comprises the following steps:
step S401, acquiring project data corresponding to a target project and data of the target project using an auxiliary tool to obtain a historical data set to be predicted;
s402, extracting effective features of the target project using auxiliary tools based on a historical data set to be predicted, generating a feature vector to be predicted, and obtaining an input matrix to be predicted based on the feature vector to be predicted;
and step S403, inputting the input matrix to be predicted into an auxiliary tool to stop the prediction model PATpredict, and obtaining a prediction result.
Specifically, before the stopping condition of the target item using the target auxiliary tool is predicted, item data corresponding to the target item and data of the target item using the target auxiliary tool are crawled and combined to obtain a historical data set to be predicted; and then extracting 27 effective features from the historical data set to be predicted, generating a feature vector, inputting the feature vector into an auxiliary tool deactivation prediction model PATpredict to obtain a prediction result, wherein the prediction result given by the auxiliary tool deactivation prediction model PATpredict is the use probability and the deactivation probability of the target item on the target auxiliary tool, comparing the two probabilities, and the greater probability is the deactivation prediction result.
Compared with the prior art, the method for predicting the outage of the auxiliary tool in the open source community clearly defines the definition of project use/outage assistance, obtains the project data and the data of the project use auxiliary tool from different dimensions to obtain the historical data set, constructs the auxiliary tool outage prediction model based on the historical data set, and finally conducts outage prediction on the auxiliary tool used by the target project, so that the problem of low accuracy of the existing prediction method is solved, and the accuracy of the prediction result is improved.
As shown in fig. 2, effective features are extracted from four dimensions, a prediction model PATpredict of the auxiliary tool shutdown is finally constructed, and finally, the model can be used to predict the shutdown condition of the target auxiliary tool used by the target project.
In another embodiment of the present invention, a device for predicting the outage of an assistant tool in an open source community is disclosed, the outage means that an item using the assistant tool does not use the assistant tool within 90 days of the last activity, as shown in fig. 3. The system comprises a training data acquisition module 100, a data acquisition module and a data acquisition module, wherein the training data acquisition module is used for acquiring project data and data of project use auxiliary tools to obtain a historical data set; the effective feature extraction module 200 is used for extracting effective features of project use auxiliary tools based on the historical data set, generating feature vectors and obtaining an input matrix based on the feature vectors; the prediction model obtaining module 300 is used for constructing an auxiliary tool deactivation prediction model PATpredict according to the input matrix and the XGboost algorithm classifier; and the stopping prediction module 400 is used for performing stopping prediction on the target auxiliary tool used by the target project by using the auxiliary tool stopping prediction model PATpredict to obtain a stopping prediction result.
The prediction device in the application further comprises a front-end page display module which is used for displaying the page of the disabled prediction result, and a tool developer or a project manager can conveniently check and manage the prediction result. The front-end page display module is completed by using a React frame and is mainly responsible for rendering pages and logic of partial page interaction, and a user can interact with the system foreground through a browser. According to the method and the device, the system page is divided into three parts according to the user requirements: project information section page, tool information page, and deactivation prediction page. Project information designs 1 page, which shows a list of projects, and can click on a selected project and see which auxiliary tools are used for the project. The tool information part designs 1 page, which shows the official website of 76 auxiliary tools and tools in the open source community and the corresponding project quantity and status quantity using the tools. The shutdown prediction part designs 1 page, shows items of items and tools, can click a button in front of the items to perform shutdown prediction, predicts whether the items can shutdown the auxiliary tools, and pops up a window to display a prediction result and a real result after prediction is completed.
Preferably, the valid feature extraction module extracts valid features of the historical data set from four dimensions of the item attribute, the effect of the item use target assistant, the assistant attribute and the feature of the item use assistant. The effective feature extraction module extracts effective features based on the project attribute dimension, and the effective features comprise: the programming language used by the project, whether the project is an organization project, whether the project has a wiki introduction website, whether the project has an official website, whether the project has a homepage in GitHub, the age of the project, the year of creation of the project, an opening source license included in the project, the maximum text similarity of the project description of the non-deactivated assistant tool, the average text similarity of the project description of the non-deactivated assistant tool, the maximum text similarity of the project description of the deactivated assistant tool, and the average text similarity of the project description of the deactivated assistant tool.
The effective features extracted based on the effect dimension of the project using the target aid include: success ratio, failure ratio, error ratio, proportion of pending to task number of the execution result of the auxiliary tool for use in the project, the longest task time for the auxiliary tool to execute, average task time, commit number of the auxiliary tool for use in the project, contribution request number of the project containing the keyword of the auxiliary tool name, and contribution number of the project. The effective features extracted based on the attribute dimension of the auxiliary tool comprise: the assistant tool name, the assistant tool category, whether the assistant tool is registered in the GitHub store. The effective characteristics extracted based on the characteristic dimension of the project use auxiliary tool comprise: the number of aids used for the project, the number of aids disabled for the project.
The attribute data used for predicting whether a project can stop using a certain auxiliary tool in the open source community crawled by the effective feature extraction module is comprehensive, 27 effective features are extracted from the project attribute, the effect of the project using target auxiliary tools, the auxiliary tool attribute and the feature of the project using auxiliary tools, technical support and basis are provided for constructing a stop using prediction model PATpredict of the auxiliary tool in the later stage, and meanwhile, the accuracy of stop using prediction of the auxiliary tool can be improved by the complete and comprehensive effective features.
Preferably, the prediction model obtaining module performs the following procedure:
adding labels to the input matrix whether the items disable the auxiliary tool, the labels including disable and non-disable;
and inputting the input matrix and the corresponding label into an XGboost algorithm analyzer for model training to obtain a disabled prediction model PATpredict of the auxiliary tool.
Preferably, the deactivation prediction module performs the following process:
acquiring project data corresponding to a target project and data of the target project using an auxiliary tool to obtain a historical data set to be predicted;
extracting effective features of the target project using auxiliary tools based on the historical data set to be predicted, generating a feature vector to be predicted, and obtaining an input matrix to be predicted based on the feature vector to be predicted;
and inputting the input matrix to be predicted into an auxiliary tool to stop the prediction model PATpredict to obtain a prediction result.
By the aid of the prediction device for the outage of the auxiliary tool in the open source community, the definition of project use/outage assistance is clearly defined, project data and data of the project use auxiliary tool are obtained from different dimensions to obtain a historical data set, an auxiliary tool outage prediction model is constructed based on the historical data set, and finally the outage prediction is carried out on the auxiliary tool used by the target project.
Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (6)

1. A method for predicting the outage of auxiliary tools in an open source community is characterized by comprising the following steps:
acquiring project data in an open source community and data of project use auxiliary tools to obtain a historical data set; extracting effective characteristics of the historical data set from four dimensions of project attributes, effects of project use target auxiliary tools, auxiliary tool attributes and characteristics of the project use auxiliary tools;
the effective features extracted based on the item attribute dimension include: the programming language used by the project, whether the project is an organization project, whether the project has a wiki introduction website, whether the project has an official website, whether the project has a homepage in GitHub, the project age and the project creation year, an opening source license contained by the project, the maximum text similarity of the project description of the non-deactivated auxiliary tool, the average text similarity of the project description of the non-deactivated auxiliary tool, the maximum text similarity of the project description of the deactivated auxiliary tool and the average text similarity of the project description of the deactivated auxiliary tool;
the effective features extracted based on the effect dimension of the project using the target assist tool include: the success ratio, the failure ratio and the error ratio of the execution result of the auxiliary tool for the project use are obtained, the execution state of the auxiliary tool is the proportion of pending to the number of tasks, the longest task time and the average task time of the auxiliary tool, the commit number and the contribution request number of the auxiliary tool for the project use, and the project comprises the contribution request number and the contribution number of keywords of the auxiliary tool name;
the effective features extracted based on the assistant tool attribute dimension comprise: the assistant tool name, the assistant tool category, and whether the assistant tool is registered in the GitHub store;
the effective features extracted based on the feature dimension of the project use aid include: the number of aids used for the project and the number of aids disabled for the project;
the deactivation means that the item using the auxiliary tool does not use the auxiliary tool within 90 days of the last activity;
extracting effective characteristics of project use auxiliary tools based on the historical data set, generating characteristic vectors, and obtaining an input matrix based on the characteristic vectors;
constructing a disabled prediction model PATpredict of an auxiliary tool based on the input matrix and the XGboost algorithm classifier;
and performing shutdown prediction on the target auxiliary tool used by the target project by using the auxiliary tool shutdown prediction model PATpredict to obtain a shutdown prediction result.
2. The method for predicting the outage of the auxiliary tool in the open source community according to claim 1, wherein an auxiliary tool outage prediction model PATpredict is constructed based on the input matrix and an XGboost algorithm, and the method comprises the following steps:
adding labels to the input matrix whether an item deactivates an auxiliary tool, the labels including deactivation and non-deactivation;
and inputting the input matrix and the corresponding label into an XGboost algorithm analyzer for model training to obtain a disabled prediction model PATpredict of the auxiliary tool.
3. The method for predicting the outage of the auxiliary tool in the open source community according to claim 1, wherein the method for predicting the outage of the auxiliary tool used by the target project by using the auxiliary tool outage prediction model PATpredict obtains an outage prediction result, and comprises the following steps:
acquiring project data corresponding to the target project and data of the target project using auxiliary tools to obtain a historical data set to be predicted;
extracting effective features of target items using auxiliary tools based on the historical data set to be predicted, generating a feature vector to be predicted, and obtaining an input matrix to be predicted based on the feature vector to be predicted;
and inputting the input matrix to be predicted into an auxiliary tool to stop the prediction model PATpredict to obtain a prediction result.
4. An apparatus for predicting tool outage in an open source community, comprising:
the training data acquisition module is used for acquiring project data and data of a project use auxiliary tool to obtain a historical data set;
the effective feature extraction module is used for extracting effective features of project use auxiliary tools based on the historical data set, generating feature vectors and obtaining an input matrix based on the feature vectors; the effective feature extraction module extracts effective features of the historical data set from four dimensions of project attributes, effects of project use target auxiliary tools, auxiliary tool attributes and characteristics of the project use auxiliary tools; the effective features extracted based on the item attribute dimension include: the programming language used by the project, whether the project is an organization project, whether the project has a wiki introduction website, whether the project has an official website, whether the project has a homepage in GitHub, the project age and the project creation year, an opening source license contained by the project, the maximum text similarity of the project description of the non-deactivated auxiliary tool, the average text similarity of the project description of the non-deactivated auxiliary tool, the maximum text similarity of the project description of the deactivated auxiliary tool and the average text similarity of the project description of the deactivated auxiliary tool; the effective features extracted based on the effect dimension of the project using the target assist tool include: the success ratio, the failure ratio and the error ratio of the execution result of the auxiliary tool for the project use are obtained, the execution state of the auxiliary tool is the proportion of pending to the number of tasks, the longest task time and the average task time of the auxiliary tool, the commit number and the contribution request number of the auxiliary tool for the project use, and the project comprises the contribution request number and the contribution number of keywords of the auxiliary tool name; the effective features extracted based on the assistant tool attribute dimension comprise: the assistant tool name, the assistant tool category, and whether the assistant tool is registered in the GitHub store; the effective features extracted based on the feature dimension of the project use aid include: the number of aids used for the project and the number of aids disabled for the project;
the deactivation means that the item using the auxiliary tool does not use the auxiliary tool within 90 days of the last activity;
the prediction model obtaining module is used for constructing an auxiliary tool deactivation prediction model PATpredict according to the input matrix and the XGboost algorithm classifier;
and the stopping prediction module is used for performing stopping prediction on the target auxiliary tool used by the target project by using the auxiliary tool stopping prediction model PATpredict to obtain a stopping prediction result.
5. The apparatus for predicting tool deactivation in open source community according to claim 4, wherein the prediction model obtaining module performs the following process:
adding labels to the input matrix whether an item deactivates an auxiliary tool, the labels including deactivation and non-deactivation;
and inputting the input matrix and the corresponding label into an XGboost algorithm analyzer for model training to obtain a disabled prediction model PATpredict of the auxiliary tool.
6. The apparatus for predicting tool outage in an open source community according to claim 5, wherein the outage prediction module performs the following process:
acquiring project data corresponding to the target project and data of the target project using auxiliary tools to obtain a historical data set to be predicted;
extracting effective features of target items using auxiliary tools based on the historical data set to be predicted, generating a feature vector to be predicted, and obtaining an input matrix to be predicted based on the feature vector to be predicted;
and inputting the input matrix to be predicted into an auxiliary tool to stop the prediction model PATpredict to obtain a prediction result.
CN202010989416.6A 2020-09-18 2020-09-18 Method and device for predicting deactivation of auxiliary tool in open source community Active CN112114795B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010989416.6A CN112114795B (en) 2020-09-18 2020-09-18 Method and device for predicting deactivation of auxiliary tool in open source community

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010989416.6A CN112114795B (en) 2020-09-18 2020-09-18 Method and device for predicting deactivation of auxiliary tool in open source community

Publications (2)

Publication Number Publication Date
CN112114795A CN112114795A (en) 2020-12-22
CN112114795B true CN112114795B (en) 2022-02-11

Family

ID=73801375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010989416.6A Active CN112114795B (en) 2020-09-18 2020-09-18 Method and device for predicting deactivation of auxiliary tool in open source community

Country Status (1)

Country Link
CN (1) CN112114795B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808278A (en) * 2017-10-11 2018-03-16 河海大学 A kind of Github open source projects based on sparse self-encoding encoder recommend method
CN109165163A (en) * 2018-08-31 2019-01-08 北京航空航天大学 A method of prediction open source community contribution request review result
CN109522011A (en) * 2018-10-17 2019-03-26 南京航空航天大学 A kind of code line recommended method of context depth perception live based on programming
CN110162634A (en) * 2019-05-21 2019-08-23 北京鸿联九五信息产业有限公司 A kind of text handling method based on machine learning
CN110688303A (en) * 2019-08-28 2020-01-14 武汉大学 Software workpiece relation mining method based on integrated development platform

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10755590B2 (en) * 2015-06-18 2020-08-25 The Joan and Irwin Jacobs Technion-Cornell Institute Method and system for automatically providing graphical user interfaces for computational algorithms described in printed publications
US10671355B2 (en) * 2018-01-21 2020-06-02 Microsoft Technology Licensing, Llc. Code completion with machine learning
CN110110087A (en) * 2019-05-15 2019-08-09 济南浪潮高新科技投资发展有限公司 A kind of Feature Engineering method for Law Text classification based on two classifiers
CN110222181B (en) * 2019-06-06 2021-08-31 福州大学 Python-based film evaluation emotion analysis method
CN110688491B (en) * 2019-09-25 2022-05-10 暨南大学 Machine reading understanding method, system, device and medium based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808278A (en) * 2017-10-11 2018-03-16 河海大学 A kind of Github open source projects based on sparse self-encoding encoder recommend method
CN109165163A (en) * 2018-08-31 2019-01-08 北京航空航天大学 A method of prediction open source community contribution request review result
CN109522011A (en) * 2018-10-17 2019-03-26 南京航空航天大学 A kind of code line recommended method of context depth perception live based on programming
CN110162634A (en) * 2019-05-21 2019-08-23 北京鸿联九五信息产业有限公司 A kind of text handling method based on machine learning
CN110688303A (en) * 2019-08-28 2020-01-14 武汉大学 Software workpiece relation mining method based on integrated development platform

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
What are the Characteristics of Reopened Pull Requests? A Case Study on Open Source Projects in GitHub;JingJiang等;《IEEE Access ( Volume: 7)》;20190715;第7卷;第102751-102761页 *
面向开源生态的软件数据挖掘技术研究综述;尹刚等;《软件学报》;20180313;第29卷(第8期);第2258-2271页 *

Also Published As

Publication number Publication date
CN112114795A (en) 2020-12-22

Similar Documents

Publication Publication Date Title
US8103671B2 (en) Text categorization with knowledge transfer from heterogeneous datasets
Pan et al. A survey on transfer learning
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN111274790B (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN111797241B (en) Event Argument Extraction Method and Device Based on Reinforcement Learning
CN107103363B (en) A kind of construction method of the software fault expert system based on LDA
Vysotska et al. Development of Information System for Textual Content Categorizing Based on Ontology.
Sellam et al. Deepbase: Deep inspection of neural networks
WO2021001047A1 (en) System, apparatus and method of managing knowledge generated from technical data
Fazayeli et al. Towards auto-labelling issue reports for pull-based software development using text mining approach
CN112328475A (en) Defect positioning method for multiple suspicious code files
Li et al. Exploit a multi-head reference graph for semi-supervised relation extraction
CN115357720B (en) BERT-based multitasking news classification method and device
CN112114795B (en) Method and device for predicting deactivation of auxiliary tool in open source community
CN115757694A (en) Recruitment industry text recall method, system, device and medium
Al-Jamal et al. Image captioning techniques: A review
CN111476035B (en) Chinese open relation prediction method, device, computer equipment and storage medium
CN107943972A (en) A kind of intelligent response method and its system
CN114996400A (en) Referee document processing method and device, electronic equipment and storage medium
CN114328844A (en) Text data set management method, device, equipment and storage medium
Kuttiyapillai et al. Improved text analysis approach for predicting effects of nutrient on human health using machine learning techniques
JP5342574B2 (en) Topic modeling apparatus, topic modeling method, and program
Sharma et al. Optical Character Recognition Using Hybrid CRNN Based Lexicon-Free Approach with Grey Wolf Hyperparameter Optimization
CN116127078B (en) Large-scale extremely weak supervision multi-label policy classification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant