Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a method, a system and a medium for automatically generating the Fork abstract of the open source community based on feature extraction.
In order to solve the technical problems, the invention adopts the technical scheme that:
an automatic open source community Fork abstract generation method based on feature extraction comprises the following implementation steps:
1) acquiring input submission data;
2) obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;
3) classifying the characteristics of the submitted data and generating a submitted abstract according to the submitted content;
4) and generating a Fork abstract of a natural language form according to the submitted abstract.
Optionally, step 2) is preceded by a step of training a machine learning classification model, and the detailed steps include:
s1) data preprocessing: firstly, cleaning linked data, repeated problem data and nonstandard format data in problem data, marking the problem data containing a specified special field and stopping deleting words; then marking the rest problem data as feature label feature, problem label bug and contribution distribution classification labels;
s2) converting the preprocessed problem data into a multidimensional vector;
s3) training the machine learning classification model by the multi-dimensional vector obtained by conversion and the corresponding characteristic classification label.
Optionally, the step S2) of converting the preprocessed data into the multidimensional vector includes:
s2.1) extracting text characteristics of the preprocessed problem data to obtain a word frequency counting matrix of words in the data;
s2.2) evaluating the weight of each word in the word frequency counting matrix by adopting a word frequency statistical method TF-IDF, and converting the word frequency matrix into a multidimensional vector in the form of a TF-IDF matrix by using the weight.
Optionally, the machine learning classification model is a random forest based machine learning classification model.
Optionally, the step 2) of generating the submission content for the submission data to obtain the corresponding submission content specifically means that the submission data is generated into the corresponding submission content by using an extraction keyword algorithm.
Optionally, the generating of the submission summary according to the classification of the submission features and the generated submission content in step 3) specifically includes generating the classification containing the submission features and the generated submission summary of the submission content by using a specified template, where the specified template includes the following information: @ commit represents the ith commit in Fork; @ author represents the submitter; @ feature is the classification of the obtained submission features, and comprises three characteristic classification tags of problem tag feature, no problem tag bug and contribution constraint; @ content is the resulting submission; @ status is status information extracted from the submission; @ change is change information extracted from a submission.
Optionally, the detailed steps of step 4) include:
3.1) scattering, classifying and re-counting a plurality of submitted summary data to respectively obtain the content and the quantity of submitted summaries corresponding to the classification labels with three characteristics of problem label feature, no problem label bug and contribution distribution;
and 3.2) putting submitted summaries corresponding to the classification labels with the three characteristics of feature label feature, problem label bug and contribution constraint obtained according to a preset rule at corresponding positions of the Fork summary template and obtaining a final Fork summary.
In addition, the invention also provides an open source community Fork abstract automatic generation system based on feature extraction, which comprises the following steps:
an input program unit for acquiring input submission data;
the input processing program unit is used for obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;
the submitted abstract generating program unit is used for classifying the characteristics of the submitted data and generating the submitted abstract according to the submitted content;
and the Fork abstract generating program unit is used for generating a Fork abstract of a natural language form according to the submitted abstract.
In addition, the invention also provides an open source community Fork abstract automatic generation system based on feature extraction, which comprises a computer device, wherein the computer device is programmed or configured to execute the steps of the open source community Fork abstract automatic generation method based on feature extraction, or a computer program which is programmed or configured to execute the open source community Fork abstract automatic generation method based on feature extraction is stored on a memory of the computer device.
In addition, the invention also provides a computer readable storage medium, which stores a computer program programmed or configured to execute the method for automatically generating the Fork abstract of the open source community based on the feature extraction.
Compared with the prior art, the invention has the following advantages: the present invention addresses the input submitted data; obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content; classifying the characteristics of the submitted data and generating a submitted abstract according to the submitted content; and generating a natural language type Fork abstract according to the submitted abstract, thereby extracting data related to Fork from the open source community project, screening and optimizing the extracted project contribution characteristics based on a large amount of open source community project data aiming at the defect that the current open source community Fork information is opaque, and automatically generating the natural language type Fork abstract through a machine learning algorithm.
Detailed Description
The method, system and medium for automatically generating the open source community Fork abstract based on feature extraction according to the present invention will be further described in detail below by taking python programming language as an example. Needless to say, on the basis, a person skilled in the art can also transplant the embodiment to other programming languages, and also can implement the method, the system and the medium for automatically generating the open source community Fork abstract based on feature extraction.
As shown in fig. 1 and fig. 2, the implementation steps of the method for automatically generating the open source community Fork abstract based on feature extraction in this embodiment include:
1) acquiring input submission data;
2) obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;
3) classifying the characteristics of the submitted data and generating a submitted abstract according to the submitted content;
4) and generating a Fork abstract of a natural language form according to the submitted abstract.
Training a machine learning classification model may use the characteristic relationships between data with characteristic classification labels in the GitHub project and the input submission data to classify the submission characteristics. Thus, the problem data may be pre-processed as input to a trained machine learning classification model to train the trained machine learning classification model. Finally, the submitted data input by the user is input to a trained machine learning classification model to predict classes for which the characteristics are classified. In this embodiment, step 2) is preceded by a step of training a machine learning classification model, and the detailed steps include:
s1) data preprocessing: firstly, cleaning linked data, repeated problem data and nonstandard format data in problem data (issue), marking the problem data containing a specified special field and stopping deleting words; then marking the rest problem data as problem label feature, no problem label bug and contribution distribution classification labels; in this embodiment, three types of tags, "feature", "bug", and "distribution" are respectively adopted for the classification tags with three characteristics, namely, problem tag feature, no problem tag bug, and contribution distribution. Finally, common stop words (e.g., "the" and "a") will be re-moved, which occur frequently with little effect on distinguishing between different documents.
S2) converting the preprocessed problem data into a multidimensional vector;
s3) training the machine learning classification model by the multi-dimensional vector obtained by conversion and the corresponding characteristic classification label.
In this embodiment, the step S2) of converting the preprocessed data into the multidimensional vector includes:
s2.1) extracting text characteristics of the preprocessed problem data to obtain a word frequency counting matrix of words in the data;
the text feature extraction can be performed by using a known text feature extraction algorithm as required, for example, in this embodiment, a countvectorer model is used to convert words in a text into a word frequency count matrix, for example, a matrix containing element text [ i ] [ j ], which represents the word frequency of j words under a type i text;
s2.2) evaluating the weight of each word in the word Frequency counting matrix by adopting a word Frequency statistical method TF-IDF (Term Frequency-reverse Document Frequency), converting the weight into a multi-dimensional vector in a TF-IDF matrix form, and converting the counting matrix processed by the CountVectorizer into a standardized TF-IDF matrix
The data applicable to the invention is mostly in text form and short in data length. According to the characteristics of data, the machine learning classification model in the embodiment is a machine learning classification model based on random forest (RandomForest), and the experimental effect is corrected. As an optional implementation manner, in this embodiment, modules such as vectorization, acquisition coefficients, machine learning classification model training, and the like are integrated into a whole by using a pipeline technology, and are repeatedly executed in the process of circularly debugging parameters, so as to finally form a completed classification model, which can automatically classify according to input submitted data.
In this embodiment, the step 2) of generating the submission content for the submission data to obtain the corresponding submission content specifically means that the submission data is generated into the corresponding submission content by using a keyword extraction algorithm. As an optional implementation manner, the algorithm for extracting keywords is a TextRank algorithm in this embodiment, and in addition, other well-known algorithms for extracting keywords may also be used.
In this embodiment, the generating of the submission summary according to the classification of the submission features and the generated submission content in step 3) specifically refers to generating the classification containing the submission features and the generated submission summary of the submission content by using a specified template, where the specified template includes the following information: @ commit represents the ith commit in Fork; @ author represents the submitter; @ feature is the classification of the obtained submission features, and comprises three feature classification tags, namely feature tag feature, question tag bug and contribution tag constraint; @ content is the resulting submission; @ status is status information extracted from the submission; @ change is change information extracted from a submission. As an alternative embodiment, the form of the template in this embodiment is shown in fig. 3.
As shown in fig. 4, the detailed steps of step 4) of this embodiment include:
3.1) scattering, classifying and re-counting a plurality of submitted abstract data to respectively obtain the content and the quantity of submitted abstract corresponding to three characteristic classification labels of characteristic label feature, problem label bug and contribution distribution;
and 3.2) putting submitted summaries corresponding to the classification labels with the three characteristics of feature label feature, problem label bug and contribution constraint obtained according to a preset rule at corresponding positions of the Fork summary template and obtaining a final Fork summary.
The form abstract Template in this embodiment is shown in fig. 5, and includes two sub-modules, namely, Template1 and Template2, the sub-module Template1 shows the structure and elements of the final desired final result form summary, and the sub-module Template2 shows how the content of form is formed. According to the investigation of open source community developers, people generally pay attention to whether the fork abstract can accurately express fork information, important data are not omitted, and the change and contribution characteristics of each submission node can be highlighted.
In sub-module Template 1:
@ fork _ summary is the final desired end result;
@ b _ commit and @ e _ commit indicate the start commit data and end commit data selected by the user. For convenience, this embodiment typically uses the last four digits of the sha validation code of the commit data to represent the address of the commit data.
@ fork _ name is the name of fork obtained from input data in the present embodiment;
@ fork _ content is a specific content description of fork generated by the present embodiment.
In sub-module Template 2:
k is a combination of three elements feature, bug and constraint. The variables @ numk and @ content correspond to the number and content of each k condition, which is the data obtained in the previous statistical process.
@ feature is a property class of submission.
@ feature _ content is the content of each property;
@ fork _ content is the sum of all properties.
In general, Template2 shows the detailed work done by fork on a particular property.
To solve various error conditions of generating the Fork summary, in this embodiment, in consideration of the situations that the Fork category is null, the Fork feature is null, the feature of submitted data is repeated, and the like, the following rules are constructed to match different default data, so as to ensure the natural language fluency of the final result Fork summary that is finally desired, as shown in fig. 6, where:
rule1 indicates:
if Fork class @ num k0, then the sum of all properties @ fork _ content is null;
rule2 indicates:
if Fork characteristic @ contentkIf null, then the sum of all properties @ fork _ content is null;
rule3 indicates:
if the data is committedb_commitAnd submitting the datae_commitFeature repetition (characters between 4 th bit and 1 st bit are the same), interceptinge_commitIs simultaneously assigned to the submitted datab_commitAnd submitting the datae_commit;
Rule4 indicates:
if Fork class @ numkIf the sum of (1) is 0, the final desired final result for the query is the generated string of "pair-missing, non-contributing".
In order to further verify the automatic generation method of the open source community Fork abstract based on feature extraction in the embodiment, 30 sets of manual tests and questionnaire tests are performed in the embodiment, and example tests of 17 developers in the GitHub are performed, so that the classification accuracy of submitted data and the Fork abstract accuracy are shown in table 1 and fig. 7.
Table 1: submitting a data classification accuracy and a Fork abstract accuracy table.
Label
|
Precision
|
Recall
|
F1-score
|
support
|
Contribution
|
0.59
|
0.79
|
0.67
|
448
|
Feature
|
0.66
|
0.78
|
0.58
|
343
|
Bug
|
0.64
|
0.67
|
0.72
|
200 |
In table 1, Label, Precision, Recall, F1-score, and support respectively represent the Label type, accuracy, Recall, average of accuracy and Recall, and the number of support labels, and containment, Feature, and Bug respectively represent three Feature classification labels of Contribution distribution, Feature Label Feature, and problem Label Bug. As can be known from table 1 and fig. 7, the open source community Fork abstract automatic generation method based on feature extraction in the embodiment can achieve Fork abstract generation accuracy of 0.672, and is 47% helpful for development of developers.
In summary, the method for automatically generating the open source community Fork abstract based on feature extraction can automatically generate the Fork abstract, and the method for automatically generating the open source community Fork abstract based on feature extraction can immediately output the Fork abstract through simple initialization setting by using a submission address as an input. This embodiment will use this tool to test the production summary of a project in a real OSS community.
In addition, this embodiment also provides an automatic generation system of open source community Fork abstract based on feature extraction, including:
an input program unit for acquiring input submission data;
the input processing program unit is used for obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;
the submitted abstract generating program unit is used for classifying the characteristics of the submitted data and generating the submitted abstract according to the submitted content;
a Fork abstract generating program unit for generating a Fork abstract of a natural language form according to the submitted abstract
In addition, the embodiment also provides an open source community Fork abstract automatic generation system based on feature extraction, which includes a computer device, where the computer device is programmed or configured to execute the steps of the aforementioned open source community Fork abstract automatic generation method based on feature extraction, or a memory of the computer device stores a computer program that is programmed or configured to execute the aforementioned open source community Fork abstract automatic generation method based on feature extraction.
In addition, the present embodiment also provides a computer readable storage medium, which stores thereon a computer program programmed or configured to execute the aforementioned automatic generation method of the open source community Fork abstract based on feature extraction.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.