CN111061864A

CN111061864A - Automatic open source community Fork abstract generation method, system and medium based on feature extraction

Info

Publication number: CN111061864A
Application number: CN201911338392.1A
Authority: CN
Inventors: 毛新军; 张超; 卢遥
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-04-24
Anticipated expiration: 2039-12-23
Also published as: CN111061864B

Abstract

The invention discloses a method, system and medium for automatic generation of open source community Fork abstracts based on feature extraction. The invention is aimed at input submission data; corresponding feature classification is obtained through a pre-trained machine learning classification model, and the submitted data is subjected to The submission content is generated to obtain the corresponding submission content; the characteristics of the submitted data are classified, and the submission content is generated into a submission summary; according to the submission summary, a natural language fork summary is generated, which can be based on a large number of open source community project data. , extract Fork-related data from open source projects, filter and optimize to extract project contribution features, and automatically generate natural language fork summaries through machine learning algorithms.

Description

Automatic open source community Fork abstract generation method, system and medium based on feature extraction

Technical Field

The invention relates to the field of open source software development, in particular to a method, a system and a medium for automatically generating an open source community Fork abstract based on feature extraction, which are used for extracting project contribution features aiming at the defects of current opaque open source community Fork information based on a large amount of open source community project data and automatically generating a natural language type Fork abstract through a machine learning algorithm.

Background

In Open Source Software (OSS) development, form (repeated edition, derivation, and branching)) based development has become an important component of group development. Fork's purpose is to make a full copy of a code repository, and the Fork mechanism allows developers to copy their code repository without the author's consent. Developers are free of Fork common repositories and make changes in Fork's repositories. Fork is a method of starting a new project.

However, the rapid development of the OSS community also presents some challenges to Fork-based development. On the one hand, the rapid growth of contributors has resulted in a large number of branches and contributions, especially many popular projects, which have enriched the ecological diversity of open source communities. On the other hand, as the number of Forks increases, existing Fork visualization tools are unable to maintain a good overview of Fork information, especially for changes in individual Forks. However, the development of an open source project cannot take a large amount of Fork data as reference, and because the existing tools cannot meet the requirement of developers on the transparency of Fork information, the developers must rely on a manual method to retrieve the Fork. In addition, due to the vast differences in developer experience and habits, a large number of Forks contain incomplete annotations, unclear properties, and opaque information. These Forks may consume some time and effort from developers, making them ineffective in understanding the goals and characteristics of other developers' contribution based on Forks. Thus, the opaque Fork information and lack of suitable tools make it difficult for manual methods to effectively identify many forks and for core developers to make proper decisions.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a method, a system and a medium for automatically generating the Fork abstract of the open source community based on feature extraction.

In order to solve the technical problems, the invention adopts the technical scheme that:

an automatic open source community Fork abstract generation method based on feature extraction comprises the following implementation steps:

1) acquiring input submission data;

2) obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;

3) classifying the characteristics of the submitted data and generating a submitted abstract according to the submitted content;

4) and generating a Fork abstract of a natural language form according to the submitted abstract.

Optionally, step 2) is preceded by a step of training a machine learning classification model, and the detailed steps include:

s1) data preprocessing: firstly, cleaning linked data, repeated problem data and nonstandard format data in problem data, marking the problem data containing a specified special field and stopping deleting words; then marking the rest problem data as feature label feature, problem label bug and contribution distribution classification labels;

s2) converting the preprocessed problem data into a multidimensional vector;

s3) training the machine learning classification model by the multi-dimensional vector obtained by conversion and the corresponding characteristic classification label.

Optionally, the step S2) of converting the preprocessed data into the multidimensional vector includes:

s2.1) extracting text characteristics of the preprocessed problem data to obtain a word frequency counting matrix of words in the data;

s2.2) evaluating the weight of each word in the word frequency counting matrix by adopting a word frequency statistical method TF-IDF, and converting the word frequency matrix into a multidimensional vector in the form of a TF-IDF matrix by using the weight.

Optionally, the machine learning classification model is a random forest based machine learning classification model.

Optionally, the step 2) of generating the submission content for the submission data to obtain the corresponding submission content specifically means that the submission data is generated into the corresponding submission content by using an extraction keyword algorithm.

Optionally, the generating of the submission summary according to the classification of the submission features and the generated submission content in step 3) specifically includes generating the classification containing the submission features and the generated submission summary of the submission content by using a specified template, where the specified template includes the following information: @ commit represents the ith commit in Fork; @ author represents the submitter; @ feature is the classification of the obtained submission features, and comprises three characteristic classification tags of problem tag feature, no problem tag bug and contribution constraint; @ content is the resulting submission; @ status is status information extracted from the submission; @ change is change information extracted from a submission.

Optionally, the detailed steps of step 4) include:

3.1) scattering, classifying and re-counting a plurality of submitted summary data to respectively obtain the content and the quantity of submitted summaries corresponding to the classification labels with three characteristics of problem label feature, no problem label bug and contribution distribution;

and 3.2) putting submitted summaries corresponding to the classification labels with the three characteristics of feature label feature, problem label bug and contribution constraint obtained according to a preset rule at corresponding positions of the Fork summary template and obtaining a final Fork summary.

In addition, the invention also provides an open source community Fork abstract automatic generation system based on feature extraction, which comprises the following steps:

an input program unit for acquiring input submission data;

the input processing program unit is used for obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;

the submitted abstract generating program unit is used for classifying the characteristics of the submitted data and generating the submitted abstract according to the submitted content;

and the Fork abstract generating program unit is used for generating a Fork abstract of a natural language form according to the submitted abstract.

In addition, the invention also provides an open source community Fork abstract automatic generation system based on feature extraction, which comprises a computer device, wherein the computer device is programmed or configured to execute the steps of the open source community Fork abstract automatic generation method based on feature extraction, or a computer program which is programmed or configured to execute the open source community Fork abstract automatic generation method based on feature extraction is stored on a memory of the computer device.

In addition, the invention also provides a computer readable storage medium, which stores a computer program programmed or configured to execute the method for automatically generating the Fork abstract of the open source community based on the feature extraction.

Compared with the prior art, the invention has the following advantages: the present invention addresses the input submitted data; obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content; classifying the characteristics of the submitted data and generating a submitted abstract according to the submitted content; and generating a natural language type Fork abstract according to the submitted abstract, thereby extracting data related to Fork from the open source community project, screening and optimizing the extracted project contribution characteristics based on a large amount of open source community project data aiming at the defect that the current open source community Fork information is opaque, and automatically generating the natural language type Fork abstract through a machine learning algorithm.

Drawings

FIG. 1 is a schematic diagram of the basic principle of the method according to the embodiment of the present invention.

FIG. 2 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a template of submission content generated in the embodiment of the present invention.

FIG. 4 is a schematic flow chart of step 4) in the embodiment of the present invention.

FIG. 5 is a diagram of a Fork abstract template in an embodiment of the present invention.

FIG. 6 is a diagram of rules for matching different default data according to an embodiment of the present invention.

FIG. 7 shows the results of the submitted data classification accuracy and Fork digest accuracy tests in the embodiment of the present invention.

Detailed Description

The method, system and medium for automatically generating the open source community Fork abstract based on feature extraction according to the present invention will be further described in detail below by taking python programming language as an example. Needless to say, on the basis, a person skilled in the art can also transplant the embodiment to other programming languages, and also can implement the method, the system and the medium for automatically generating the open source community Fork abstract based on feature extraction.

As shown in fig. 1 and fig. 2, the implementation steps of the method for automatically generating the open source community Fork abstract based on feature extraction in this embodiment include:

1) acquiring input submission data;

Training a machine learning classification model may use the characteristic relationships between data with characteristic classification labels in the GitHub project and the input submission data to classify the submission characteristics. Thus, the problem data may be pre-processed as input to a trained machine learning classification model to train the trained machine learning classification model. Finally, the submitted data input by the user is input to a trained machine learning classification model to predict classes for which the characteristics are classified. In this embodiment, step 2) is preceded by a step of training a machine learning classification model, and the detailed steps include:

s1) data preprocessing: firstly, cleaning linked data, repeated problem data and nonstandard format data in problem data (issue), marking the problem data containing a specified special field and stopping deleting words; then marking the rest problem data as problem label feature, no problem label bug and contribution distribution classification labels; in this embodiment, three types of tags, "feature", "bug", and "distribution" are respectively adopted for the classification tags with three characteristics, namely, problem tag feature, no problem tag bug, and contribution distribution. Finally, common stop words (e.g., "the" and "a") will be re-moved, which occur frequently with little effect on distinguishing between different documents.

S2) converting the preprocessed problem data into a multidimensional vector;

In this embodiment, the step S2) of converting the preprocessed data into the multidimensional vector includes:

the text feature extraction can be performed by using a known text feature extraction algorithm as required, for example, in this embodiment, a countvectorer model is used to convert words in a text into a word frequency count matrix, for example, a matrix containing element text [ i ] [ j ], which represents the word frequency of j words under a type i text;

s2.2) evaluating the weight of each word in the word Frequency counting matrix by adopting a word Frequency statistical method TF-IDF (Term Frequency-reverse Document Frequency), converting the weight into a multi-dimensional vector in a TF-IDF matrix form, and converting the counting matrix processed by the CountVectorizer into a standardized TF-IDF matrix

The data applicable to the invention is mostly in text form and short in data length. According to the characteristics of data, the machine learning classification model in the embodiment is a machine learning classification model based on random forest (RandomForest), and the experimental effect is corrected. As an optional implementation manner, in this embodiment, modules such as vectorization, acquisition coefficients, machine learning classification model training, and the like are integrated into a whole by using a pipeline technology, and are repeatedly executed in the process of circularly debugging parameters, so as to finally form a completed classification model, which can automatically classify according to input submitted data.

In this embodiment, the step 2) of generating the submission content for the submission data to obtain the corresponding submission content specifically means that the submission data is generated into the corresponding submission content by using a keyword extraction algorithm. As an optional implementation manner, the algorithm for extracting keywords is a TextRank algorithm in this embodiment, and in addition, other well-known algorithms for extracting keywords may also be used.

In this embodiment, the generating of the submission summary according to the classification of the submission features and the generated submission content in step 3) specifically refers to generating the classification containing the submission features and the generated submission summary of the submission content by using a specified template, where the specified template includes the following information: @ commit represents the ith commit in Fork; @ author represents the submitter; @ feature is the classification of the obtained submission features, and comprises three feature classification tags, namely feature tag feature, question tag bug and contribution tag constraint; @ content is the resulting submission; @ status is status information extracted from the submission; @ change is change information extracted from a submission. As an alternative embodiment, the form of the template in this embodiment is shown in fig. 3.

As shown in fig. 4, the detailed steps of step 4) of this embodiment include:

3.1) scattering, classifying and re-counting a plurality of submitted abstract data to respectively obtain the content and the quantity of submitted abstract corresponding to three characteristic classification labels of characteristic label feature, problem label bug and contribution distribution;

The form abstract Template in this embodiment is shown in fig. 5, and includes two sub-modules, namely, Template1 and Template2, the sub-module Template1 shows the structure and elements of the final desired final result form summary, and the sub-module Template2 shows how the content of form is formed. According to the investigation of open source community developers, people generally pay attention to whether the fork abstract can accurately express fork information, important data are not omitted, and the change and contribution characteristics of each submission node can be highlighted.

In sub-module Template 1:

@ fork _ summary is the final desired end result;

@ b _ commit and @ e _ commit indicate the start commit data and end commit data selected by the user. For convenience, this embodiment typically uses the last four digits of the sha validation code of the commit data to represent the address of the commit data.

@ fork _ name is the name of fork obtained from input data in the present embodiment;

@ fork _ content is a specific content description of fork generated by the present embodiment.

In sub-module Template 2:

k is a combination of three elements feature, bug and constraint. The variables @ numk and @ content correspond to the number and content of each k condition, which is the data obtained in the previous statistical process.

@ feature is a property class of submission.

@ feature _ content is the content of each property;

@ fork _ content is the sum of all properties.

In general, Template2 shows the detailed work done by fork on a particular property.

To solve various error conditions of generating the Fork summary, in this embodiment, in consideration of the situations that the Fork category is null, the Fork feature is null, the feature of submitted data is repeated, and the like, the following rules are constructed to match different default data, so as to ensure the natural language fluency of the final result Fork summary that is finally desired, as shown in fig. 6, where:

rule1 indicates:

if Fork class @ num _k0, then the sum of all properties @ fork _ content is null;

rule2 indicates:

if Fork characteristic @ content_kIf null, then the sum of all properties @ fork _ content is null;

rule3 indicates:

if the data is committedb_commitAnd submitting the datae_commitFeature repetition (characters between 4 th bit and 1 st bit are the same), interceptinge_commitIs simultaneously assigned to the submitted datab_commitAnd submitting the datae_commit；

Rule4 indicates:

if Fork class @ num_kIf the sum of (1) is 0, the final desired final result for the query is the generated string of "pair-missing, non-contributing".

In order to further verify the automatic generation method of the open source community Fork abstract based on feature extraction in the embodiment, 30 sets of manual tests and questionnaire tests are performed in the embodiment, and example tests of 17 developers in the GitHub are performed, so that the classification accuracy of submitted data and the Fork abstract accuracy are shown in table 1 and fig. 7.

Table 1: submitting a data classification accuracy and a Fork abstract accuracy table.

Label	Precision	Recall	F1-score	support
					Contribution	0.59	0.79	0.67	448
Feature	0.66	0.78	0.58	343
					Bug	0.64	0.67	0.72	200

In table 1, Label, Precision, Recall, F1-score, and support respectively represent the Label type, accuracy, Recall, average of accuracy and Recall, and the number of support labels, and containment, Feature, and Bug respectively represent three Feature classification labels of Contribution distribution, Feature Label Feature, and problem Label Bug. As can be known from table 1 and fig. 7, the open source community Fork abstract automatic generation method based on feature extraction in the embodiment can achieve Fork abstract generation accuracy of 0.672, and is 47% helpful for development of developers.

In summary, the method for automatically generating the open source community Fork abstract based on feature extraction can automatically generate the Fork abstract, and the method for automatically generating the open source community Fork abstract based on feature extraction can immediately output the Fork abstract through simple initialization setting by using a submission address as an input. This embodiment will use this tool to test the production summary of a project in a real OSS community.

In addition, this embodiment also provides an automatic generation system of open source community Fork abstract based on feature extraction, including:

an input program unit for acquiring input submission data;

a Fork abstract generating program unit for generating a Fork abstract of a natural language form according to the submitted abstract

In addition, the embodiment also provides an open source community Fork abstract automatic generation system based on feature extraction, which includes a computer device, where the computer device is programmed or configured to execute the steps of the aforementioned open source community Fork abstract automatic generation method based on feature extraction, or a memory of the computer device stores a computer program that is programmed or configured to execute the aforementioned open source community Fork abstract automatic generation method based on feature extraction.

In addition, the present embodiment also provides a computer readable storage medium, which stores thereon a computer program programmed or configured to execute the aforementioned automatic generation method of the open source community Fork abstract based on feature extraction.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. an open source community Fork abstract automatic generation method based on feature extraction, is characterized in that implementing step comprises:

1) Get the input submission data;

2) Obtain the corresponding feature classification of the submitted data through the pre-trained machine learning classification model, and generate the submission content for the submitted data to obtain the corresponding submission content;

3) Categorize the characteristics of the submitted data and generate a submission summary;

4) Generate a natural language fork summary based on the submission summary.

2. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 1, wherein the step 2) further comprises the step of training a machine learning classification model, and the detailed steps include:

S1) Data preprocessing: first, clean the linked data, repeated question data, and non-standard format data in the question data, mark and stop word deletion of the question data containing the specified special fields; then delete the remaining question data Labeled as issue label feature, no issue label bug, contribution contribution three feature classification labels;

S2) Convert the preprocessed problem data into a multi-dimensional vector;

S3) Train a machine learning classification model with the converted multi-dimensional vector and its corresponding feature classification label.

3. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 2, wherein step S2) the detailed step of converting the preprocessed data into a multi-dimensional vector comprises:

S2.1) Perform text feature extraction on the preprocessed problem data to obtain the word frequency count matrix of the words in the data;

S2.2) Use the word frequency statistical method TF-IDF to evaluate the weight of each word in the word frequency count matrix, and convert the weight to the word frequency matrix to obtain a multi-dimensional vector in the form of a TF-IDF matrix.

4 . The method for automatically generating open source community Fork abstracts based on feature extraction according to claim 1 , wherein the machine learning classification model is a random forest-based machine learning classification model. 5 .

5. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 1, wherein in step 2), generating the submission content for the submission data to obtain the corresponding submission content specifically refers to using the extraction key for the submission data. The word algorithm generates the corresponding submissions.

6. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 1, wherein in step 3), generating a submission summary according to the classification of the submission features and the generated submission content specifically refers to using a specified template to generate a submission summary. Generate a submission summary including the classification of the submission features and the generated submission content, and the specified template includes the following information: @commiti represents the i-th submission in the Fork; @author represents the submitter; @feature is the obtained submission feature Classification, including issue label feature, no issue label bug, and contribution contribution three feature classification labels; @content is the obtained submission content; @status is the status information extracted from the submission; @change is the change information extracted from the submission.

7. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 1, wherein the detailed steps of step 4) include:

3.1) Disaggregate and classify multiple submitted abstract data and re-statistics, respectively, to obtain the content and quantity of submitted abstracts corresponding to the three feature classification labels of problem label feature, no problem label bug, and contribution contribution;

3.2) According to the preset rules, place the submission abstract corresponding to the three feature classification labels of feature label feature, problem label bug, and contribution contribution in the corresponding position of the Fork abstract template and obtain the final Fork abstract.

8. An open source community Fork summary automatic generation system based on feature extraction, is characterized in that comprising:

Input program unit for obtaining input submission data;

The input processing program unit is used to obtain the corresponding feature classification of the submitted data through the pre-trained machine learning classification model, and generate the submitted content for the submitted data to obtain the corresponding submitted content;

The submission summary generating program unit is used to classify the characteristics of the submitted data and the submission content to generate the submission summary;

The fork summary generator unit is used to generate a natural language fork summary based on the submission summary.

9. A system for automatically generating an open source community Fork abstract based on feature extraction, comprising computer equipment, characterized in that the computer equipment is programmed or configured to execute the feature extraction-based open source community according to any one of claims 1 to 7 The steps of the method for automatically generating a fork abstract, or a computer program programmed or configured to execute the method for automatically generating a fork abstract in an open source community based on feature extraction according to any one of claims 1 to 7 is stored in the memory of the computer device.

10. A computer-readable storage medium, characterized in that, the computer-readable storage medium stores a program or configuration to execute the feature extraction-based automatic generation of an open source community Fork abstract according to any one of claims 1 to 7. A computer program of the method.