CN107133079B - Automatic generation method of software semantic abstract based on problem report - Google Patents

Automatic generation method of software semantic abstract based on problem report Download PDF

Info

Publication number
CN107133079B
CN107133079B CN201710380665.3A CN201710380665A CN107133079B CN 107133079 B CN107133079 B CN 107133079B CN 201710380665 A CN201710380665 A CN 201710380665A CN 107133079 B CN107133079 B CN 107133079B
Authority
CN
China
Prior art keywords
software
code
semantic
problem report
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710380665.3A
Other languages
Chinese (zh)
Other versions
CN107133079A (en
Inventor
余跃
王涛
尹刚
王怀民
宋晨希
张迅辉
李志星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201710380665.3A priority Critical patent/CN107133079B/en
Publication of CN107133079A publication Critical patent/CN107133079A/en
Application granted granted Critical
Publication of CN107133079B publication Critical patent/CN107133079B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a software semantic abstract automatic generation method based on a problem report, which comprises the steps of firstly constructing an open-source software information database; then, problem report data and code change record data in the project hosting community are obtained and stored in a problem report data table and a code change record data table of a database; extracting problem report id of the code change record by using a regular expression; and then, carrying out position matching on the problem report and the software codes, finally carrying out semantic extraction and clustering, and storing the generated semantic abstract in a software semantic abstract data table of a database. According to the invention, semantic information is automatically marked for the code file or the code segment of the software through the problem report and the code change record of the software, so that the software reusing efficiency of developers can be improved.

Description

Automatic generation method of software semantic abstract based on problem report
Technical Field
The invention relates to the field of software development, in particular to a method for automatically generating a software semantic abstract based on a problem report.
Background
The software multiplexing technology is a solution for reusing the existing resources in software development so as to avoid repeated labor. Through software multiplexing, the software development efficiency and quality can be greatly improved. For example, a famous mobile phone photo sharing application Instagram is only 5 technicians at the beginning of the development, a back-end engineer is less than 3 technicians, more than ten types of source opening software are utilized, the initial Instagram is created within only 8 weeks, a large number of users are attracted through the provided stable service, and the software reuse effect is visible.
Meanwhile, the development of open source software provides rich resources for software reuse. Unlike commercial software, the code of open source software is open and other software developers may choose to reuse the entire software or portions of the code in the software. But before reuse, the developer first knows what functions it works for the entire software, object document, or a piece of code that it is reusing. For the entire software, developers can learn from software profiles and documentation; for fine-grained code segments, developers can know the implementation process of the code through comments in the code segments. However, for a coarse-grained code file or a large-scale code, it is difficult to understand the two ways, because most software documents are explained from the aspects of overall functions and using ways, and the annotations are very targeted, and the splicing of multiple annotations is not helpful for understanding the large-scale code functions. The manual code reading mode is very inefficient, which is not in accordance with the original purpose of software reuse. Therefore, semantic annotation of large-granularity code files or code segments becomes an urgent problem to be solved.
The project hosting platform (such as GitHub, GitLab and the like) of the open-source software becomes an important role in the development and maintenance process of the open-source software, provides multiple functions of project hosting and development process management for developers, and greatly facilitates the distributed development of the open-source software. The project hosting platform stores various data generated in the software development process, such as submission records of developers, pull-request data, problem report information and the like, and the data contains a large amount of semantic information. In particular, a plurality of issues are contained therein, and these issues have detailed title and description information. These issues often correspond to the addition of certain functions or the repair of certain defects. For a software project with good function, the whole process of submission, solution and closing of issue is very standard. Wherein the core specification is that the developer writes the number of the problem to be solved, or the content of the problem, into the commit's information when submitting the code. Therefore, such information can help us associate the semantic information in the software development history (e.g., modification of files, modules, code) and the problem report, thereby labeling the semantics of the software module.
Disclosure of Invention
In order to achieve the above purpose, the invention provides a software semantic abstract automatic generation method based on a problem report, which comprises the following steps:
s1, constructing an open source software information database, wherein the open source software information database comprises a problem report data table, a code change record data table and a software semantic abstract data table;
s2, acquiring problem report data and code change record data in the project hosting community, and storing the problem report data and the code change record data in a problem report data table and a code change record data table of a database;
s3, extracting the problem report id of the occurrence of the problem report from the code change record by using a regular expression;
s4, matching the problem report with the software code position: associating the description information in the problem report with the code position through the code change record data specifically comprises the following steps:
s401, matching the problem report with the code change record: finding a problem report of the relevant id by using the # id appearing in the code change record;
s402, semantic information combination: merging the title and description in the problem report and the description information of the code change record, recording the merged title and description as d, using the merged d as the original semantic information of the file or the code segment f, and writing the original semantic information into a software semantic abstract data table of a database;
s5, semantic extraction and clustering: and for the original semantic information d, generating a plurality of subject words or phrases by using a document subject generating model, and storing the generated subject words or phrases as semantic abstracts of the files or code segments f in a software semantic abstract data table of the database.
Further, in step S1, the problem report data table is stored in the format of [ title, description, # id ], the code change record data table is stored in the format of [ description, change information, change position ], and the software semantic digest data table is stored in the format of [ code position, original semantic information, semantic digest ].
Further, in step S2, the software problem report information includes: the title, description, and ID of the issue report; the information of the code change record includes: description, change information, and change location.
In step S2, the community issue report and code change log data may be obtained through an official API or obtained through a general web crawler.
In step S2, the developer records the problem report related to the code change in the description information of the submitted code change record in the form of "close # id" or "fix # id".
Further, in step S402, the code position and the original semantic information are stored in a software semantic digest data table of the database in a format of [ code position, original semantic information ].
Further, in step S402, the semantic digest information is stored in a software semantic digest data table of the database in a [ code location, semantic digest ] format.
Aiming at a large number of open source software codes existing in a project hosting platform of open source software, the invention can extract software semantics from a problem report in the project hosting platform, automatically generate a code file semantic abstract, provide help for software developers when selecting multiplexing resources and improve the efficiency of the developers for multiplexing the software.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a method for automatically generating a software semantic abstract based on a problem report, which includes the following steps:
s1, constructing an open source software information database, wherein the open source software information database comprises a problem report data table, a code change record data table and a software semantic abstract data table. The storage format of the problem report data table is [ title, description, # id ], the storage format of the code change record data table is [ description, change information, change position ], and the storage format of the software semantic abstract data table is [ code position, original semantic information, semantic abstract ].
And S2, acquiring the problem report data and the code change record data in the project hosting community, and storing the problem report data and the code change record data in a problem report data table and a code change record data table of the database. The community problem report and the code change record data can be obtained through an official API or a universal web crawler; the software problem reporting information includes: the title, description, and ID of the issue report; the information of the code change record includes: description, change information and change location; the developer notes the problem report related to the code change in the description information of the submitted code change record, and the problem report is recorded in the form of "close # id" or "fix # id".
And S3, extracting the problem report id of the occurrence of the problem from the code change record by using the regular expression.
Taking the Rails project in the project hosting platform GitHub as an example, description information of one record in code change records of Rails is as follows: "Make Time travel word with subiclass of Time/Date/DateTime. Closes #27614. provisionaly while calling 'now' on subiclass of e.g. 'Time' it will be called an instance of 'Time' instance of calling an instance of the subiclass. this way, we will be called the next the correct class". The code record associated issue report number #27614 is extracted using regular matching.
S4, matching the problem report with the software code position: associating the description information in the problem report with the code position through the code change record data specifically comprises the following steps:
s401, matching the problem report with the code change record: the # id appearing in the code change record is used to find the issue report for the relevant id. And storing the code position and the original semantic information in a software semantic abstract data table of a database in a format of [ code position, original semantic information ].
The following explains the implementation process of step S401 with reference to a specific example. In step S3, the relevant problem report #27614 is extracted from a code change log in step S401, first, a 27614-numbered problem report is found from the problem reports in the Rails entry, which is entitled "Time tracking using Time aids pages of times of the invention". And then merging the title and description of the problem report and the description information in the code change record as original semantic information and writing the original semantic information into a software semantic abstract data table of a database.
S402, semantic information combination: merging the title and description in the problem report and the description information of the code change record, recording the merged title and description as d, using the merged d as the original semantic information of the file or the code segment f, and writing the original semantic information into a software semantic abstract data table of a database; the semantic abstract information is stored in a software semantic abstract data table of a database in a format of [ code position, semantic abstract ].
S5, semantic extraction and clustering: and for the original semantic information d, generating a plurality of subject words or phrases by using a document subject generating model, and storing the generated subject words or phrases as semantic abstracts of the files or code segments f in a software semantic abstract data table of the database. The document theme generation model may be implemented using the open source tool JGibbLDA in the Java language.
In summary, the invention can automatically label semantic information for the code file or code segment of the software through the problem report and code change record of the software, thereby improving the software reuse efficiency of developers.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the statement that an element defined by the phrase "comprises an element defined by … … does not exclude the presence of other like elements in the process, method, article, or apparatus that comprises the element.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (7)

1. A method for automatically generating a software semantic abstract based on a problem report is characterized by comprising the following steps:
s1, constructing an open source software information database, wherein the open source software information database comprises a problem report data table, a code change record data table and a software semantic abstract data table;
s2, acquiring problem report data and code change record data in the project hosting community, and storing the problem report data and the code change record data in a problem report data table and a code change record data table of a database;
s3, extracting the problem report id of the occurrence of the problem report from the code change record by using a regular expression;
s4, matching the problem report with the software code position: associating the description information in the problem report with the code position through the code change record data specifically comprises the following steps:
s401, matching the problem report with the code change record: finding a problem report of the relevant id by using the # id appearing in the code change record;
s402, semantic information combination: merging the title and description in the problem report and the description information of the code change record, recording the merged title and description as d, using the merged d as the original semantic information of the file or the code segment f, and writing the original semantic information into a software semantic abstract data table of a database;
s5, semantic extraction and clustering: and for the original semantic information d, generating a plurality of subject words or phrases by using a document subject generating model, and storing the generated subject words or phrases as semantic abstracts of the files or code segments f in a software semantic abstract data table of the database.
2. The method according to claim 1, wherein in step S1, the problem report data table is stored in a format of [ title, description, # id ], the code change record data table is stored in a format of [ description, change information, # id ], and the software semantic digest data table is stored in a format of [ code position, original semantic information, # semantic digest ].
3. The method of claim 1, wherein in step S2, the software problem reporting information includes: the title, description, and ID of the issue report; the information of the code change record includes: description, change information, and change location.
4. The method of claim 1, wherein in step S2, the community issue report and code change log data are available through an official API or through a general web crawler.
5. The method of claim 1, wherein in step S2, the developer records the problem report related to the code change in the description information of the submitted code change record, and records the problem report in the form of "close # id" or "fix # id".
6. The method of claim 1, wherein in step S402, the code location and the original semantic information are stored in a software semantic digest data table of a database in a format of [ code location, original semantic information ].
7. The method of claim 1, wherein in step S402, the semantic digest information is stored in a software semantic digest data table of the database in a [ code location, semantic digest ] format.
CN201710380665.3A 2017-05-25 2017-05-25 Automatic generation method of software semantic abstract based on problem report Active CN107133079B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710380665.3A CN107133079B (en) 2017-05-25 2017-05-25 Automatic generation method of software semantic abstract based on problem report

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710380665.3A CN107133079B (en) 2017-05-25 2017-05-25 Automatic generation method of software semantic abstract based on problem report

Publications (2)

Publication Number Publication Date
CN107133079A CN107133079A (en) 2017-09-05
CN107133079B true CN107133079B (en) 2019-12-20

Family

ID=59732885

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710380665.3A Active CN107133079B (en) 2017-05-25 2017-05-25 Automatic generation method of software semantic abstract based on problem report

Country Status (1)

Country Link
CN (1) CN107133079B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108459874B (en) * 2018-03-05 2021-03-26 中国人民解放军国防科技大学 Code automatic summarization method integrating deep learning and natural language processing
CN108491459B (en) * 2018-03-05 2021-10-26 中国人民解放军国防科技大学 Optimization method for software code abstract automatic generation model
CN108519890B (en) * 2018-04-08 2021-07-20 武汉大学 Robust code abstract generation method based on self-attention mechanism
CN109857648B (en) * 2019-01-14 2021-12-28 复旦大学 API misuse change pattern mining method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991858A (en) * 2015-06-12 2015-10-21 扬州大学 Method for automatically generating outline and label for code modification
CN106202203A (en) * 2016-06-23 2016-12-07 扬州大学 The method for building up of bug knowledge base based on lifelong topic model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104991858A (en) * 2015-06-12 2015-10-21 扬州大学 Method for automatically generating outline and label for code modification
CN106202203A (en) * 2016-06-23 2016-12-07 扬州大学 The method for building up of bug knowledge base based on lifelong topic model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"How Should We Read and Analyze Bug Reports: An Interactive Visualization using Extractive Summaries and Topic Evolution";Shamima Yeasmin等;《CASCON"15 Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering》;20151104;第171-180页 *
"RCLinker: Automated Linking of Issue Reports and Commits Leveraging Rich Contextual Information";Tien-Duy B. Le等;《ICPC "15 Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension》;20150524;第36-47页 *
"基于LDA的软件代码主题摘要自动生成方法";李文鹏 等;《计算机科学》;20170430;第44卷(第4期);第35-38页 *

Also Published As

Publication number Publication date
CN107133079A (en) 2017-09-05

Similar Documents

Publication Publication Date Title
CN107133079B (en) Automatic generation method of software semantic abstract based on problem report
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
US7571092B1 (en) Method and apparatus for on-demand localization of files
AU2016210590B2 (en) Method and System for Entity Relationship Model Generation
US7636656B1 (en) Method and apparatus for synthesizing multiple localizable formats into a canonical format
US9965472B2 (en) Content revision using question and answer generation
CN107968959B (en) Knowledge point segmentation method for teaching video
CN103198828B (en) The construction method of speech corpus and system
Shen et al. On automatic summarization of what and why information in source code changes
CN111651198B (en) Automatic code abstract generation method and device
US20130066818A1 (en) Automatic Crowd Sourcing for Machine Learning in Information Extraction
CN110232177B (en) Bidding document generation system and method in government field
CN101833555B (en) Information extracting method and device
Ockeloen et al. BiographyNet: Managing Provenance at Multiple Levels and from Different Perspectives.
WO2022226716A1 (en) Deep learning-based java program internal annotation generation method and system
CN103778200A (en) Method for extracting information source of message and system thereof
Kelley et al. A framework for creating knowledge graphs of scientific software metadata
CN103455589A (en) Product data migration method, device and system in product factory pattern
CN117272982A (en) Protocol text detection method and device based on large language model
CN117112767A (en) Question and answer result generation method, commercial query big model training method and device
CA3104292A1 (en) Systems and methods for identifying and linking events in structured proceedings
CN111488737B (en) Text recognition method, device and equipment
CN114491209A (en) Method and system for mining enterprise business label based on internet information capture
Gruzıtis Multilayer corpus and toolchain for Full-Stack NLU in Latvian
CN110515653A (en) Document structure tree method, apparatus, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant