CN111061864A - Automatic open source community Fork abstract generation method, system and medium based on feature extraction - Google Patents

Automatic open source community Fork abstract generation method, system and medium based on feature extraction Download PDF

Info

Publication number
CN111061864A
CN111061864A CN201911338392.1A CN201911338392A CN111061864A CN 111061864 A CN111061864 A CN 111061864A CN 201911338392 A CN201911338392 A CN 201911338392A CN 111061864 A CN111061864 A CN 111061864A
Authority
CN
China
Prior art keywords
submission
fork
data
feature
open source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911338392.1A
Other languages
Chinese (zh)
Other versions
CN111061864B (en
Inventor
毛新军
张超
卢遥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201911338392.1A priority Critical patent/CN111061864B/en
Publication of CN111061864A publication Critical patent/CN111061864A/en
Application granted granted Critical
Publication of CN111061864B publication Critical patent/CN111061864B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于特征提取的开源社区Fork摘要自动生成方法、系统及介质,本发明针对输入的提交数据;通过预先训练好的机器学习分类模型得到对应的特性分类,并针对提交数据进行提交内容生成得到对应的提交内容;将提交数据的特性分类、提交内容生成提交摘要;根据提交摘要生成自然语言式的Fork摘要,能够基于大量的开源社区项目数据针对当前开源社区Fork信息不透明的缺陷,从开源项目中提取Fork相关的数据并进行筛选和优化提取项目贡献特征,通过机器学习的算法自动生成自然语言式的Fork摘要。

Figure 201911338392

The invention discloses a method, system and medium for automatic generation of open source community Fork abstracts based on feature extraction. The invention is aimed at input submission data; corresponding feature classification is obtained through a pre-trained machine learning classification model, and the submitted data is subjected to The submission content is generated to obtain the corresponding submission content; the characteristics of the submitted data are classified, and the submission content is generated into a submission summary; according to the submission summary, a natural language fork summary is generated, which can be based on a large number of open source community project data. , extract Fork-related data from open source projects, filter and optimize to extract project contribution features, and automatically generate natural language fork summaries through machine learning algorithms.

Figure 201911338392

Description

Automatic open source community Fork abstract generation method, system and medium based on feature extraction
Technical Field
The invention relates to the field of open source software development, in particular to a method, a system and a medium for automatically generating an open source community Fork abstract based on feature extraction, which are used for extracting project contribution features aiming at the defects of current opaque open source community Fork information based on a large amount of open source community project data and automatically generating a natural language type Fork abstract through a machine learning algorithm.
Background
In Open Source Software (OSS) development, form (repeated edition, derivation, and branching)) based development has become an important component of group development. Fork's purpose is to make a full copy of a code repository, and the Fork mechanism allows developers to copy their code repository without the author's consent. Developers are free of Fork common repositories and make changes in Fork's repositories. Fork is a method of starting a new project.
However, the rapid development of the OSS community also presents some challenges to Fork-based development. On the one hand, the rapid growth of contributors has resulted in a large number of branches and contributions, especially many popular projects, which have enriched the ecological diversity of open source communities. On the other hand, as the number of Forks increases, existing Fork visualization tools are unable to maintain a good overview of Fork information, especially for changes in individual Forks. However, the development of an open source project cannot take a large amount of Fork data as reference, and because the existing tools cannot meet the requirement of developers on the transparency of Fork information, the developers must rely on a manual method to retrieve the Fork. In addition, due to the vast differences in developer experience and habits, a large number of Forks contain incomplete annotations, unclear properties, and opaque information. These Forks may consume some time and effort from developers, making them ineffective in understanding the goals and characteristics of other developers' contribution based on Forks. Thus, the opaque Fork information and lack of suitable tools make it difficult for manual methods to effectively identify many forks and for core developers to make proper decisions.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a method, a system and a medium for automatically generating the Fork abstract of the open source community based on feature extraction.
In order to solve the technical problems, the invention adopts the technical scheme that:
an automatic open source community Fork abstract generation method based on feature extraction comprises the following implementation steps:
1) acquiring input submission data;
2) obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;
3) classifying the characteristics of the submitted data and generating a submitted abstract according to the submitted content;
4) and generating a Fork abstract of a natural language form according to the submitted abstract.
Optionally, step 2) is preceded by a step of training a machine learning classification model, and the detailed steps include:
s1) data preprocessing: firstly, cleaning linked data, repeated problem data and nonstandard format data in problem data, marking the problem data containing a specified special field and stopping deleting words; then marking the rest problem data as feature label feature, problem label bug and contribution distribution classification labels;
s2) converting the preprocessed problem data into a multidimensional vector;
s3) training the machine learning classification model by the multi-dimensional vector obtained by conversion and the corresponding characteristic classification label.
Optionally, the step S2) of converting the preprocessed data into the multidimensional vector includes:
s2.1) extracting text characteristics of the preprocessed problem data to obtain a word frequency counting matrix of words in the data;
s2.2) evaluating the weight of each word in the word frequency counting matrix by adopting a word frequency statistical method TF-IDF, and converting the word frequency matrix into a multidimensional vector in the form of a TF-IDF matrix by using the weight.
Optionally, the machine learning classification model is a random forest based machine learning classification model.
Optionally, the step 2) of generating the submission content for the submission data to obtain the corresponding submission content specifically means that the submission data is generated into the corresponding submission content by using an extraction keyword algorithm.
Optionally, the generating of the submission summary according to the classification of the submission features and the generated submission content in step 3) specifically includes generating the classification containing the submission features and the generated submission summary of the submission content by using a specified template, where the specified template includes the following information: @ commit represents the ith commit in Fork; @ author represents the submitter; @ feature is the classification of the obtained submission features, and comprises three characteristic classification tags of problem tag feature, no problem tag bug and contribution constraint; @ content is the resulting submission; @ status is status information extracted from the submission; @ change is change information extracted from a submission.
Optionally, the detailed steps of step 4) include:
3.1) scattering, classifying and re-counting a plurality of submitted summary data to respectively obtain the content and the quantity of submitted summaries corresponding to the classification labels with three characteristics of problem label feature, no problem label bug and contribution distribution;
and 3.2) putting submitted summaries corresponding to the classification labels with the three characteristics of feature label feature, problem label bug and contribution constraint obtained according to a preset rule at corresponding positions of the Fork summary template and obtaining a final Fork summary.
In addition, the invention also provides an open source community Fork abstract automatic generation system based on feature extraction, which comprises the following steps:
an input program unit for acquiring input submission data;
the input processing program unit is used for obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;
the submitted abstract generating program unit is used for classifying the characteristics of the submitted data and generating the submitted abstract according to the submitted content;
and the Fork abstract generating program unit is used for generating a Fork abstract of a natural language form according to the submitted abstract.
In addition, the invention also provides an open source community Fork abstract automatic generation system based on feature extraction, which comprises a computer device, wherein the computer device is programmed or configured to execute the steps of the open source community Fork abstract automatic generation method based on feature extraction, or a computer program which is programmed or configured to execute the open source community Fork abstract automatic generation method based on feature extraction is stored on a memory of the computer device.
In addition, the invention also provides a computer readable storage medium, which stores a computer program programmed or configured to execute the method for automatically generating the Fork abstract of the open source community based on the feature extraction.
Compared with the prior art, the invention has the following advantages: the present invention addresses the input submitted data; obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content; classifying the characteristics of the submitted data and generating a submitted abstract according to the submitted content; and generating a natural language type Fork abstract according to the submitted abstract, thereby extracting data related to Fork from the open source community project, screening and optimizing the extracted project contribution characteristics based on a large amount of open source community project data aiming at the defect that the current open source community Fork information is opaque, and automatically generating the natural language type Fork abstract through a machine learning algorithm.
Drawings
FIG. 1 is a schematic diagram of the basic principle of the method according to the embodiment of the present invention.
FIG. 2 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a template of submission content generated in the embodiment of the present invention.
FIG. 4 is a schematic flow chart of step 4) in the embodiment of the present invention.
FIG. 5 is a diagram of a Fork abstract template in an embodiment of the present invention.
FIG. 6 is a diagram of rules for matching different default data according to an embodiment of the present invention.
FIG. 7 shows the results of the submitted data classification accuracy and Fork digest accuracy tests in the embodiment of the present invention.
Detailed Description
The method, system and medium for automatically generating the open source community Fork abstract based on feature extraction according to the present invention will be further described in detail below by taking python programming language as an example. Needless to say, on the basis, a person skilled in the art can also transplant the embodiment to other programming languages, and also can implement the method, the system and the medium for automatically generating the open source community Fork abstract based on feature extraction.
As shown in fig. 1 and fig. 2, the implementation steps of the method for automatically generating the open source community Fork abstract based on feature extraction in this embodiment include:
1) acquiring input submission data;
2) obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;
3) classifying the characteristics of the submitted data and generating a submitted abstract according to the submitted content;
4) and generating a Fork abstract of a natural language form according to the submitted abstract.
Training a machine learning classification model may use the characteristic relationships between data with characteristic classification labels in the GitHub project and the input submission data to classify the submission characteristics. Thus, the problem data may be pre-processed as input to a trained machine learning classification model to train the trained machine learning classification model. Finally, the submitted data input by the user is input to a trained machine learning classification model to predict classes for which the characteristics are classified. In this embodiment, step 2) is preceded by a step of training a machine learning classification model, and the detailed steps include:
s1) data preprocessing: firstly, cleaning linked data, repeated problem data and nonstandard format data in problem data (issue), marking the problem data containing a specified special field and stopping deleting words; then marking the rest problem data as problem label feature, no problem label bug and contribution distribution classification labels; in this embodiment, three types of tags, "feature", "bug", and "distribution" are respectively adopted for the classification tags with three characteristics, namely, problem tag feature, no problem tag bug, and contribution distribution. Finally, common stop words (e.g., "the" and "a") will be re-moved, which occur frequently with little effect on distinguishing between different documents.
S2) converting the preprocessed problem data into a multidimensional vector;
s3) training the machine learning classification model by the multi-dimensional vector obtained by conversion and the corresponding characteristic classification label.
In this embodiment, the step S2) of converting the preprocessed data into the multidimensional vector includes:
s2.1) extracting text characteristics of the preprocessed problem data to obtain a word frequency counting matrix of words in the data;
the text feature extraction can be performed by using a known text feature extraction algorithm as required, for example, in this embodiment, a countvectorer model is used to convert words in a text into a word frequency count matrix, for example, a matrix containing element text [ i ] [ j ], which represents the word frequency of j words under a type i text;
s2.2) evaluating the weight of each word in the word Frequency counting matrix by adopting a word Frequency statistical method TF-IDF (Term Frequency-reverse Document Frequency), converting the weight into a multi-dimensional vector in a TF-IDF matrix form, and converting the counting matrix processed by the CountVectorizer into a standardized TF-IDF matrix
The data applicable to the invention is mostly in text form and short in data length. According to the characteristics of data, the machine learning classification model in the embodiment is a machine learning classification model based on random forest (RandomForest), and the experimental effect is corrected. As an optional implementation manner, in this embodiment, modules such as vectorization, acquisition coefficients, machine learning classification model training, and the like are integrated into a whole by using a pipeline technology, and are repeatedly executed in the process of circularly debugging parameters, so as to finally form a completed classification model, which can automatically classify according to input submitted data.
In this embodiment, the step 2) of generating the submission content for the submission data to obtain the corresponding submission content specifically means that the submission data is generated into the corresponding submission content by using a keyword extraction algorithm. As an optional implementation manner, the algorithm for extracting keywords is a TextRank algorithm in this embodiment, and in addition, other well-known algorithms for extracting keywords may also be used.
In this embodiment, the generating of the submission summary according to the classification of the submission features and the generated submission content in step 3) specifically refers to generating the classification containing the submission features and the generated submission summary of the submission content by using a specified template, where the specified template includes the following information: @ commit represents the ith commit in Fork; @ author represents the submitter; @ feature is the classification of the obtained submission features, and comprises three feature classification tags, namely feature tag feature, question tag bug and contribution tag constraint; @ content is the resulting submission; @ status is status information extracted from the submission; @ change is change information extracted from a submission. As an alternative embodiment, the form of the template in this embodiment is shown in fig. 3.
As shown in fig. 4, the detailed steps of step 4) of this embodiment include:
3.1) scattering, classifying and re-counting a plurality of submitted abstract data to respectively obtain the content and the quantity of submitted abstract corresponding to three characteristic classification labels of characteristic label feature, problem label bug and contribution distribution;
and 3.2) putting submitted summaries corresponding to the classification labels with the three characteristics of feature label feature, problem label bug and contribution constraint obtained according to a preset rule at corresponding positions of the Fork summary template and obtaining a final Fork summary.
The form abstract Template in this embodiment is shown in fig. 5, and includes two sub-modules, namely, Template1 and Template2, the sub-module Template1 shows the structure and elements of the final desired final result form summary, and the sub-module Template2 shows how the content of form is formed. According to the investigation of open source community developers, people generally pay attention to whether the fork abstract can accurately express fork information, important data are not omitted, and the change and contribution characteristics of each submission node can be highlighted.
In sub-module Template 1:
@ fork _ summary is the final desired end result;
@ b _ commit and @ e _ commit indicate the start commit data and end commit data selected by the user. For convenience, this embodiment typically uses the last four digits of the sha validation code of the commit data to represent the address of the commit data.
@ fork _ name is the name of fork obtained from input data in the present embodiment;
@ fork _ content is a specific content description of fork generated by the present embodiment.
In sub-module Template 2:
k is a combination of three elements feature, bug and constraint. The variables @ numk and @ content correspond to the number and content of each k condition, which is the data obtained in the previous statistical process.
@ feature is a property class of submission.
@ feature _ content is the content of each property;
@ fork _ content is the sum of all properties.
In general, Template2 shows the detailed work done by fork on a particular property.
To solve various error conditions of generating the Fork summary, in this embodiment, in consideration of the situations that the Fork category is null, the Fork feature is null, the feature of submitted data is repeated, and the like, the following rules are constructed to match different default data, so as to ensure the natural language fluency of the final result Fork summary that is finally desired, as shown in fig. 6, where:
rule1 indicates:
if Fork class @ num k0, then the sum of all properties @ fork _ content is null;
rule2 indicates:
if Fork characteristic @ contentkIf null, then the sum of all properties @ fork _ content is null;
rule3 indicates:
if the data is committedb_commitAnd submitting the datae_commitFeature repetition (characters between 4 th bit and 1 st bit are the same), interceptinge_commitIs simultaneously assigned to the submitted datab_commitAnd submitting the datae_commit
Rule4 indicates:
if Fork class @ numkIf the sum of (1) is 0, the final desired final result for the query is the generated string of "pair-missing, non-contributing".
In order to further verify the automatic generation method of the open source community Fork abstract based on feature extraction in the embodiment, 30 sets of manual tests and questionnaire tests are performed in the embodiment, and example tests of 17 developers in the GitHub are performed, so that the classification accuracy of submitted data and the Fork abstract accuracy are shown in table 1 and fig. 7.
Table 1: submitting a data classification accuracy and a Fork abstract accuracy table.
Label Precision Recall F1-score support
Contribution 0.59 0.79 0.67 448
Feature 0.66 0.78 0.58 343
Bug 0.64 0.67 0.72 200
In table 1, Label, Precision, Recall, F1-score, and support respectively represent the Label type, accuracy, Recall, average of accuracy and Recall, and the number of support labels, and containment, Feature, and Bug respectively represent three Feature classification labels of Contribution distribution, Feature Label Feature, and problem Label Bug. As can be known from table 1 and fig. 7, the open source community Fork abstract automatic generation method based on feature extraction in the embodiment can achieve Fork abstract generation accuracy of 0.672, and is 47% helpful for development of developers.
In summary, the method for automatically generating the open source community Fork abstract based on feature extraction can automatically generate the Fork abstract, and the method for automatically generating the open source community Fork abstract based on feature extraction can immediately output the Fork abstract through simple initialization setting by using a submission address as an input. This embodiment will use this tool to test the production summary of a project in a real OSS community.
In addition, this embodiment also provides an automatic generation system of open source community Fork abstract based on feature extraction, including:
an input program unit for acquiring input submission data;
the input processing program unit is used for obtaining corresponding characteristic classification of the submitted data through a machine learning classification model trained in advance, and generating the submitted content aiming at the submitted data to obtain the corresponding submitted content;
the submitted abstract generating program unit is used for classifying the characteristics of the submitted data and generating the submitted abstract according to the submitted content;
a Fork abstract generating program unit for generating a Fork abstract of a natural language form according to the submitted abstract
In addition, the embodiment also provides an open source community Fork abstract automatic generation system based on feature extraction, which includes a computer device, where the computer device is programmed or configured to execute the steps of the aforementioned open source community Fork abstract automatic generation method based on feature extraction, or a memory of the computer device stores a computer program that is programmed or configured to execute the aforementioned open source community Fork abstract automatic generation method based on feature extraction.
In addition, the present embodiment also provides a computer readable storage medium, which stores thereon a computer program programmed or configured to execute the aforementioned automatic generation method of the open source community Fork abstract based on feature extraction.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1.一种基于特征提取的开源社区Fork摘要自动生成方法,其特征在于实施步骤包括:1. an open source community Fork abstract automatic generation method based on feature extraction, is characterized in that implementing step comprises: 1)获取输入的提交数据;1) Get the input submission data; 2)将提交数据通过预先训练好的机器学习分类模型得到起对应的特性分类,并针对提交数据进行提交内容生成得到对应的提交内容;2) Obtain the corresponding feature classification of the submitted data through the pre-trained machine learning classification model, and generate the submission content for the submitted data to obtain the corresponding submission content; 3)将提交数据的特性分类、提交内容生成提交摘要;3) Categorize the characteristics of the submitted data and generate a submission summary; 4)根据提交摘要生成自然语言式的Fork摘要。4) Generate a natural language fork summary based on the submission summary. 2.根据权利要求1所述的基于特征提取的开源社区Fork摘要自动生成方法,其特征在于,步骤2)之前还包括训练机器学习分类模型的步骤,详细步骤包括:2. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 1, wherein the step 2) further comprises the step of training a machine learning classification model, and the detailed steps include: S1)进行数据预处理:首先分别对问题数据中含有链接的数据、重复问题数据、非标准格式数据进行清洗,对含有指定特殊字段的问题数据进行标记和停止单词删除;然后对剩余的问题数据标记为问题标签feature、没有问题标签bug、贡献contribution三种特性分类标签;S1) Data preprocessing: first, clean the linked data, repeated question data, and non-standard format data in the question data, mark and stop word deletion of the question data containing the specified special fields; then delete the remaining question data Labeled as issue label feature, no issue label bug, contribution contribution three feature classification labels; S2)将预处理后的问题数据转换为多维向量;S2) Convert the preprocessed problem data into a multi-dimensional vector; S3)将转换得到的多维向量及其对应的特性分类标签训练机器学习分类模型。S3) Train a machine learning classification model with the converted multi-dimensional vector and its corresponding feature classification label. 3.根据权利要求2所述的基于特征提取的开源社区Fork摘要自动生成方法,其特征在于,步骤S2)将预处理后的数据转换为多维向量的详细步骤包括:3. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 2, wherein step S2) the detailed step of converting the preprocessed data into a multi-dimensional vector comprises: S2.1)对预处理后的问题数据进行文本特征提取得到数据中单词的单词频率计数矩阵;S2.1) Perform text feature extraction on the preprocessed problem data to obtain the word frequency count matrix of the words in the data; S2.2)采用词频统计方法TF-IDF评估单词频率计数矩阵中每一个单词的权重,将权重将单词频率矩阵转换得到TF-IDF矩阵形式的多维向量。S2.2) Use the word frequency statistical method TF-IDF to evaluate the weight of each word in the word frequency count matrix, and convert the weight to the word frequency matrix to obtain a multi-dimensional vector in the form of a TF-IDF matrix. 4.根据权利要求1所述的基于特征提取的开源社区Fork摘要自动生成方法,其特征在于,所述机器学习分类模型为基于随机森林的机器学习分类模型。4 . The method for automatically generating open source community Fork abstracts based on feature extraction according to claim 1 , wherein the machine learning classification model is a random forest-based machine learning classification model. 5 . 5.根据权利要求1所述的基于特征提取的开源社区Fork摘要自动生成方法,其特征在于,步骤2)中针对提交数据进行提交内容生成得到对应的提交内容具体是指将提交数据采用提取关键词算法生成对应的提交内容。5. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 1, wherein in step 2), generating the submission content for the submission data to obtain the corresponding submission content specifically refers to using the extraction key for the submission data. The word algorithm generates the corresponding submissions. 6.根据权利要求1所述的基于特征提取的开源社区Fork摘要自动生成方法,其特征在于,步骤3)中根据提交特征的分类、生成的提交内容生成提交摘要具体是指采用指定的模板来生成包含提交特征的分类、生成的提交内容的提交摘要,所述指定的模板包括下述信息:@ commiti表示Fork中的第i个提交;@author表示提交者;@feature是得到的提交特征的分类,包含问题标签feature、没有问题标签bug、贡献contribution三种特性分类标签;@content是得到的提交内容;@status是从提交中提取的状态信息;@change是从提交中提取的改变信息。6. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 1, wherein in step 3), generating a submission summary according to the classification of the submission features and the generated submission content specifically refers to using a specified template to generate a submission summary. Generate a submission summary including the classification of the submission features and the generated submission content, and the specified template includes the following information: @commiti represents the i-th submission in the Fork; @author represents the submitter; @feature is the obtained submission feature Classification, including issue label feature, no issue label bug, and contribution contribution three feature classification labels; @content is the obtained submission content; @status is the status information extracted from the submission; @change is the change information extracted from the submission. 7.根据权利要求1所述的基于特征提取的开源社区Fork摘要自动生成方法,其特征在于,步骤4)的详细步骤包括:7. The method for automatically generating an open source community Fork summary based on feature extraction according to claim 1, wherein the detailed steps of step 4) include: 3.1)将多个提交摘要数据打散分类、重新统计,分别得到问题标签feature、没有问题标签bug、贡献contribution三种特性分类标签所对应的提交摘要的内容和数量;3.1) Disaggregate and classify multiple submitted abstract data and re-statistics, respectively, to obtain the content and quantity of submitted abstracts corresponding to the three feature classification labels of problem label feature, no problem label bug, and contribution contribution; 3.2)按照预设的规则将得到特征标签feature、问题标签bug、贡献contribution三种特性分类标签所对应的提交摘要放在Fork摘要模板的相应位置并获得最终的Fork摘要。3.2) According to the preset rules, place the submission abstract corresponding to the three feature classification labels of feature label feature, problem label bug, and contribution contribution in the corresponding position of the Fork abstract template and obtain the final Fork abstract. 8.一种基于特征提取的开源社区Fork摘要自动生成系统,其特征在于包括:8. An open source community Fork summary automatic generation system based on feature extraction, is characterized in that comprising: 输入程序单元,用于获取输入的提交数据;Input program unit for obtaining input submission data; 输入处理程序单元,用于将提交数据通过预先训练好的机器学习分类模型得到起对应的特性分类,并针对提交数据进行提交内容生成得到对应的提交内容;The input processing program unit is used to obtain the corresponding feature classification of the submitted data through the pre-trained machine learning classification model, and generate the submitted content for the submitted data to obtain the corresponding submitted content; 提交摘要生成程序单元,用于将提交数据的特性分类、提交内容生成提交摘要;The submission summary generating program unit is used to classify the characteristics of the submitted data and the submission content to generate the submission summary; Fork摘要生成程序单元,用于根据提交摘要生成自然语言式的Fork摘要。The fork summary generator unit is used to generate a natural language fork summary based on the submission summary. 9.一种基于特征提取的开源社区Fork摘要自动生成系统,包括计算机设备,其特征在于,该计算机设备被编程或配置以执行权利要求1~7中任意一项所述基于特征提取的开源社区Fork摘要自动生成方法的步骤,或该计算机设备的存储器上存储有被编程或配置以执行权利要求1~7中任意一项所述基于特征提取的开源社区Fork摘要自动生成方法的计算机程序。9. A system for automatically generating an open source community Fork abstract based on feature extraction, comprising computer equipment, characterized in that the computer equipment is programmed or configured to execute the feature extraction-based open source community according to any one of claims 1 to 7 The steps of the method for automatically generating a fork abstract, or a computer program programmed or configured to execute the method for automatically generating a fork abstract in an open source community based on feature extraction according to any one of claims 1 to 7 is stored in the memory of the computer device. 10.一种计算机可读存储介质,其特征在于,该计算机可读存储介质上存储有被编程或配置以执行权利要求1~7中任意一项所述基于特征提取的开源社区Fork摘要自动生成方法的计算机程序。10. A computer-readable storage medium, characterized in that, the computer-readable storage medium stores a program or configuration to execute the feature extraction-based automatic generation of an open source community Fork abstract according to any one of claims 1 to 7. A computer program of the method.
CN201911338392.1A 2019-12-23 2019-12-23 Automatic open source community Fork abstract generation method, system and medium based on feature extraction Active CN111061864B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911338392.1A CN111061864B (en) 2019-12-23 2019-12-23 Automatic open source community Fork abstract generation method, system and medium based on feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911338392.1A CN111061864B (en) 2019-12-23 2019-12-23 Automatic open source community Fork abstract generation method, system and medium based on feature extraction

Publications (2)

Publication Number Publication Date
CN111061864A true CN111061864A (en) 2020-04-24
CN111061864B CN111061864B (en) 2022-10-18

Family

ID=70300836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911338392.1A Active CN111061864B (en) 2019-12-23 2019-12-23 Automatic open source community Fork abstract generation method, system and medium based on feature extraction

Country Status (1)

Country Link
CN (1) CN111061864B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025126324A1 (en) * 2023-12-12 2025-06-19 ファナック株式会社 User assistance device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101541170B1 (en) * 2014-10-21 2015-08-03 (주)센솔로지 Apparatus and method for summarizing text
CN107102986A (en) * 2017-04-23 2017-08-29 四川用联信息技术有限公司 Multi-threaded keyword extraction techniques in document
CN107391542A (en) * 2017-05-16 2017-11-24 浙江工业大学 A kind of open source software community expert recommendation method based on document knowledge collection of illustrative plates
CN108459874A (en) * 2018-03-05 2018-08-28 中国人民解放军国防科技大学 Code automation summarization method integrating deep learning and natural language processing
CN108563433A (en) * 2018-03-20 2018-09-21 北京大学 A kind of device based on LSTM auto-complete codes
US20180373507A1 (en) * 2016-02-03 2018-12-27 Cocycles System for generating functionality representation, indexing, searching, componentizing, and analyzing of source code in codebases and method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101541170B1 (en) * 2014-10-21 2015-08-03 (주)센솔로지 Apparatus and method for summarizing text
US20180373507A1 (en) * 2016-02-03 2018-12-27 Cocycles System for generating functionality representation, indexing, searching, componentizing, and analyzing of source code in codebases and method thereof
CN107102986A (en) * 2017-04-23 2017-08-29 四川用联信息技术有限公司 Multi-threaded keyword extraction techniques in document
CN107391542A (en) * 2017-05-16 2017-11-24 浙江工业大学 A kind of open source software community expert recommendation method based on document knowledge collection of illustrative plates
CN108459874A (en) * 2018-03-05 2018-08-28 中国人民解放军国防科技大学 Code automation summarization method integrating deep learning and natural language processing
CN108563433A (en) * 2018-03-20 2018-09-21 北京大学 A kind of device based on LSTM auto-complete codes

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025126324A1 (en) * 2023-12-12 2025-06-19 ファナック株式会社 User assistance device

Also Published As

Publication number Publication date
CN111061864B (en) 2022-10-18

Similar Documents

Publication Publication Date Title
CN104699611B (en) A kind of defect information extracting method that pattern is changed based on open source software defect code
CN109492106B (en) An automatic classification method of defect causes combined with text codes
CN116703328B (en) Project review method and system
CN111274817A (en) An intelligent software cost measurement method based on natural language processing technology
CN112000802A (en) Software defect positioning method based on similarity integration
CN105824791B (en) A kind of bibliography format checking method
CN118606425A (en) A method for extracting information from bidding documents based on large language model
CN114860873A (en) Method, device and storage medium for generating text abstract
CN118313348A (en) Document format typesetting method, device, computer equipment, storage medium and product
Nikiforova et al. User-Oriented Approach to Data Quality Evaluation.
CN104750484B (en) A kind of code abstraction generating method based on maximum entropy model
CN116541071A (en) A Hint-Based Learning Approach to Application Programming Interface Transfer
CN114118098B (en) Contract review method, device and storage medium based on element extraction
CN112036841A (en) Policy analysis system and method based on intelligent semantic recognition
CN118627471B (en) Automatic government affair data labeling method and system based on dependency attention diagram convolution
Wang et al. LLM $\times $ MapReduce-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources
CN111061864A (en) Automatic open source community Fork abstract generation method, system and medium based on feature extraction
CN119067457A (en) A green energy power industry procurement document compliance inspection method and system
Hendriks et al. Recognizing and Linking Entities in Old Dutch Text: A Case Study on VOC Notary Records.
CN117421226A (en) Defect report reconstruction method and system based on large language model
CN114281998B (en) Event labeling system construction method for multi-level labeling person based on crowdsourcing technology
DE202023105413U1 (en) System for automated text generation with error correction
CN114722101B (en) Method, device and storage medium for simplifying flow chart
CN115422078A (en) Method and device for generating description document of test function operation step
CN115760495A (en) Method and device for realizing automatic labeling of legal cases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant