CN113064628B - Traceable and verifiable software engineering data archiving method - Google Patents

Traceable and verifiable software engineering data archiving method Download PDF

Info

Publication number
CN113064628B
CN113064628B CN202110367226.5A CN202110367226A CN113064628B CN 113064628 B CN113064628 B CN 113064628B CN 202110367226 A CN202110367226 A CN 202110367226A CN 113064628 B CN113064628 B CN 113064628B
Authority
CN
China
Prior art keywords
data
file
script
data unit
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110367226.5A
Other languages
Chinese (zh)
Other versions
CN113064628A (en
Inventor
朱家鑫
陈伟
吴国全
窦文生
魏峻
叶丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202110367226.5A priority Critical patent/CN113064628B/en
Publication of CN113064628A publication Critical patent/CN113064628A/en
Application granted granted Critical
Publication of CN113064628B publication Critical patent/CN113064628B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Human Computer Interaction (AREA)
  • Stored Programmes (AREA)

Abstract

The invention provides a traceable and verifiable software engineering data archiving method, which relates to the field of software engineering data, organizes software engineering data based on data units and data volumes, saves the derivative relationship among the data units and derivative environment construction scripts and data derivative scripts, and provides an automatic data re-derivation and consistency verification mechanism. The invention realizes the automatic tracing and verification of the software engineering data to be archived and archived, and can help related developers and researchers to obtain high-reliability software engineering data.

Description

Traceable and verifiable software engineering data archiving method
Technical Field
The invention relates to the field of software engineering data, in particular to a traceable and verifiable software engineering data archiving method.
Background
The software engineering data related by the invention is various data generated by related supporting tools in the software development and operation and maintenance processes, such as version control data, defect tracking data and the like. The data can be used for developing a plurality of software engineering related researches and helping to provide the efficiency of software development, operation and maintenance and the quality of software products.
The data of software engineering is various, the context generated by the data is complex, and a plurality of data processing processes are opaque, so that a plurality of data users have large deviation in understanding the data, and further the validity of the related data analysis result is influenced.
Many shared items of software engineering data have appeared, such as the GHTorrent item (https:// lightrrent. org /) which shares GitHub data, the comprehensive Promise item (http:// promisse. site. utottawa. ca/Serepository /) and so on. The existing software engineering data sharing project mainly focuses on the problems of data availability such as data uploading, storage, retrieval and downloading, a data tracing and verifying mechanism, particularly an automatic mechanism is not established, and the problems of data misuse and data quality cannot be avoided.
Disclosure of Invention
Aiming at the problem that the existing software engineering data sharing project does not have a mechanism for avoiding data misuse and data quality problems, the invention provides a traceable and verifiable software engineering data archiving method.
In order to achieve the purpose, the invention adopts the following technical scheme:
a retrospectively and verifiable software engineering data archiving method comprises the following steps: creating a data unit and a data volume, and archiving the software engineering data according to the data unit and the data volume; the data unit is used as a minimum unit for data archiving and comprises five types of files, namely a data unit description file, a data unit document file, an environment file and a script file; the data volume is used as a data set which is made facing a certain data use requirement, comprises two types of files, namely a data volume description file and a data volume document file, and references the data units contained in the data volume description file through the data unit indexes in the data volume description file; wherein the content of the first and second substances,
each of the data unit description files contains 14 fields: index number, name, complete description, short description, author, version number, creation time, license, data source type, data source index, environment index, script entry index, previous version number, next version number; wherein, the short description is the abstract of the complete description and is used for quick browsing and retrieval by a data user; the data source type comprises an original data type and a data unit type, wherein the original data is directly generated and stored by a software development tool; the data source index comprises a data source index of an original data type and a data unit type, the data source index of the original data type uses a URL, and the data source index of the data unit type uses a data unit index number; the environment index is the relative address of the environment file; the script entry index is the relative address of the script entry;
the data file is used for storing main data;
the data unit document file is used for describing relevant backgrounds, data formats, using methods and using examples of data unit main body data;
the environment file is used for describing the configuration of the environment and the environment construction step;
the script file includes four classes: the method comprises the following steps of (1) obtaining an environment construction script file, an environment construction script entry file, a data derivative script file and a data derivative script entry file;
each of the data volume description files contains 11 fields: index number, name, full description, short description, author, version number, creation time, license, data unit index number, previous version number, next version number; wherein, the short description is the abstract of the complete description and is used for quick browsing and retrieval by a data user;
the data volume document file is used for comprehensively and systematically describing application problems, data processing flows and processing results to be solved.
Furthermore, one data unit has and only one data unit description file; a data unit has one or more data files; a data unit has one or more data unit document files; one data unit has one and only one environment file; one data unit has one or more environment construction script files and only one environment construction script entry file; one data unit has one or more data derivative script files and has only one data derivative script entry file; one data volume has one and only one data volume description file; a data volume has one or more data volume document files.
Further, the method for creating the data unit and the data volume comprises the following steps: filling in a data unit description file and a data volume description file; adding a data file, a data unit document file, an environment file and a script file to form a data unit; and adding the file files of the data volume to form the data volume.
Furthermore, before archiving or using the software engineering data, data tracing verification needs to be performed, and the verification method comprises the following steps: for a data unit, constructing a script entry file according to a script entry index execution environment in a data unit description file, completing environment construction, executing a data derivative script entry file according to the script entry index in the data unit description file, acquiring upstream data according to a data source index in the data unit description file, performing data processing, completing data regeneration, comparing a digital fingerprint of a regenerated data file with a digital fingerprint of an original data file, if the digital fingerprint is consistent with the digital fingerprint of the original data file, passing traceability verification, otherwise not passing; for a data volume, each referenced data unit may be verified separately according to the foregoing steps.
Further, when data tracing is performed, a source data unit is searched iteratively according to a data source index in the data unit description file until the data source type is original data.
Further, when the data unit and the data volume are retrieved, the retrieval is carried out through an index number, a complete description or a short description.
Furthermore, one data unit is stored in one folder and named by the index number of the data unit; storing a data volume in a folder, and naming the data volume by the index number of the data volume; the folder is compressed into a package.
Further, the environment file in the data unit is a dokcerfle file in the docker technology.
Further, the environment construction script entry file is an executable script file, a Linux shell script is used for writing, the environment construction script file is called, and the environment construction script file can be written by any script language.
Further, the data derivative script entry file is an executable script file, and is compiled by using a Linux shell script, and the data derivative script file is called and can be compiled by using any script language.
Compared with the prior art, the invention has the following advantages:
the invention adopts a two-stage (data unit and data volume) standardized organization mode for software engineering data, reduces data redundancy and improves the use efficiency. The archived data units have derivative relations, so that the most original data can be traced all the time, and a data user can be helped to accurately understand the whole process of data generation; the archived data unit has a complete derivative environment and scripts, and can automatically verify whether the data is correct. The invention provides a data derivation environment by adopting a container technology and verifies data by adopting a digital fingerprint technology.
Drawings
Fig. 1 is a flowchart of an implementation of a traceable and verifiable software engineering data archiving method according to an embodiment.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic diagram illustrating archiving, retrieving and using software engineering data according to a traceable and verifiable software engineering data archiving method of the present invention.
A data unit or a data volume is stored in a folder, named by the index number of the data unit or data volume, which is compressed into a package, preferably a compressed package in zip format. Data units and data volumes may employ JSON to describe various components. When organizing data units and data volumes, the parts are placed under the folders of the data units or data volumes according to the content of the description file.
The environment file in the data unit builds a file for the container in the container technology, preferably, a Dokcerfile in the docker technology is used. The environment construction script entry file is an executable script file, preferably written by using a Linux shell script, and the environment construction script file is called and can be written by using any script language. The environment construction is completed based on the Dokcerfile.
The data derivative script entry file is an executable script file, preferably written by using a Linux shell script, and the data derivative script file is called and can be written by using any script language. The data-derived script file is executed in the built environment.
When data tracing is carried out, a source data unit is searched iteratively according to a data source index in the data unit description file until the type of the data source is original data.
When data verification is carried out, firstly, the execution environment construction script builds a data derivative environment, then, the data derivative script is executed, data is generated again, whether fingerprints of the generated data file and fingerprints of existing data files are consistent or not is compared, preferably, the MD5 algorithm is used for calculating the fingerprints of the data files, if the fingerprints are consistent, verification is successful, and if the fingerprints are not consistent, verification is failed.
Only data units that can be traced back to the original data and that pass the verification can be archived, others rejected. Data volumes can only reference archived data units.
The verification can be traced again before use, ensuring credibility.
The following specific examples are provided to demonstrate that the method of the present invention comprises the steps of:
1) selecting 5 researchers with software engineering development and research experience, and introducing the composition and use of the data unit and the data volume to the researchers;
2) selecting GitHub as a data source;
3) making 5 persons selected in the step 1 develop the compiling and organizing work of various files such as data unit description files, script files and the like in a plurality of turns according to the past research requirements to make data units and file, and integrating the data units and the files into data volumes and files;
4) tracing and verifying the archived data units: and the execution environment construction script constructs a data derivative environment, then executes the data derivative script, generates data again, and compares whether the fingerprints of the generated data file and the existing data file are consistent or not.
5) Tracing and verifying the archived data volume: and respectively tracing and verifying the data units referenced by the data volumes.
The experimental results are as follows:
data making a first round:
the participant 1 creates a data unit A, and the main data of the data unit is the issue data of all the Rails project on the GitHub; the participant 2 creates a data unit B, and the main data of the data unit is pull-request data of all Rails project on the GitHub; the participant 3 creates a data unit C, and the main data of the data unit is the issue data of the jQuery project on the GitHub; the participant 4 creates a data unit D, the main data of which is the commit data of the Rails project on the GitHub; the participant 5 creates a data unit E whose main data is the issue data of the RxJava project on the GitHub;
and a second round of data production:
participant 1 created data cell F, G, H based on data cell A, C, E, the data body being each item of issue data with a category label added thereto;
the participant 2 creates a data unit I based on the data unit A, B, D, and the data body is data fused by associating the Rails item issue, pull-request and commit;
the third round of data production:
participant 1 makes a data volume alpha, referencing data cell F, G, H.
The data source and data unit derivation relationships are as follows:
GitHub→A→F
GitHub→B→G
GitHub→C→H
GitHub→(A B C)→I
experimental results show that the method can ensure that the archived data can be traced back to the source of data generation, namely a software development tool, and the data can be verified through a reproduction derivation process.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A traceable and verifiable software engineering data archiving method is characterized by comprising the following steps: creating a data unit and a data volume, and archiving the software engineering data according to the data unit and the data volume; the data unit is used as a minimum unit for data archiving and comprises five types of files, namely a data unit description file, a data unit document file, an environment file and a script file; the data volume is used as a data set which is produced facing to the data use requirement and comprises two types of files, namely a data volume description file and a data volume document file; the data volume refers to the data unit contained by the data volume through the data unit index in the data volume description file; wherein the content of the first and second substances,
each of the data unit description files includes fields having: index number, name, complete description, short description, author, version number, creation time, license, data source type, data source index, environment index, script entry index, previous version number, and next version number; wherein, the short description is the abstract of the complete description and is used for the quick browsing and retrieval of the data user; the data source type comprises an original data type and a data unit type, wherein the original data is directly generated and stored by a software development tool; the data source index comprises a data source index of an original data type and a data unit type, the data source index of the original data type uses a URL, and the data source index of the data unit type uses a data unit index number; the environment index is the relative address of the environment file; the script entry index is the relative address of the script entry;
the data file is used for storing main data;
the data unit document file is used for describing relevant backgrounds, data formats, using methods and using examples of data unit main body data;
the environment file is used for describing the configuration of the environment and the environment construction step;
the script file includes: the system comprises an environment construction script file, an environment construction script inlet file, a data derivative script file and a data derivative script inlet file;
each of the data volume description files includes fields having: index number, name, full description, short description, author, version number, creation time, license, data unit index number, previous version number, next version number; wherein, the short description is the abstract of the complete description and is used for browsing and searching by a data user;
the data volume document file is used for describing application problems to be solved, data processing flows and processing results.
2. The method of claim 1, wherein a data unit has only one data unit description file; a data unit has one or more data files; a data unit has one or more data unit document files; one data unit has only one environment file; one data unit has one or more environment construction script files and only one environment construction script entry file; one data unit has one or more data derivative script files and only one data derivative script entry file; a data volume has only one data volume description file; a data volume has one or more data volume document files.
3. The method of claim 1, wherein the data units and data volumes are created by: filling in a data unit description file and a data volume description file; adding a data file, a data unit document file, an environment file and a script file to form a data unit; and adding the file files of the data volume to form the data volume.
4. The method of claim 1, wherein before archiving or using the software engineering data, data tracing verification is required, and the verification method comprises:
for the data unit, constructing a script entry file according to a script entry index execution environment in the data unit description file to complete environment construction; executing a data derivation script entry file according to the script entry index in the data unit description file; acquiring upstream data according to a data source index in the data unit description file, and performing data processing to complete data regeneration; comparing the digital fingerprint of the regenerated data file with the digital fingerprint of the original data file, if the digital fingerprint of the regenerated data file is consistent with the digital fingerprint of the original data file, passing the tracing verification, otherwise not passing the tracing verification;
for the data volume, the referenced data units are verified separately according to the previous steps.
5. The method of claim 4, wherein when performing data tracing, iteratively searching for a source data unit according to a data source index in a data unit description file until a data source type is original data.
6. The method of claim 1, wherein when retrieving data units, data volumes, retrieval is by index number, full description or short description.
7. The method of claim 1, wherein a data unit is stored in a folder named by the index number of the data unit; storing a data volume in a folder, and naming the data volume by the index number of the data volume; the folder is compressed into a package.
8. The method of claim 1, wherein the environment file in the data unit is a Docker file in a docker technology.
9. The method of claim 1, wherein the environment building script portal file is an executable script file written using Linux shell scripts for invoking the environment building script file.
10. The method of claim 1, wherein the data-derived script portal file is an executable script file written using Linux shell scripts for invoking the data-derived script file.
CN202110367226.5A 2021-04-06 2021-04-06 Traceable and verifiable software engineering data archiving method Active CN113064628B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110367226.5A CN113064628B (en) 2021-04-06 2021-04-06 Traceable and verifiable software engineering data archiving method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110367226.5A CN113064628B (en) 2021-04-06 2021-04-06 Traceable and verifiable software engineering data archiving method

Publications (2)

Publication Number Publication Date
CN113064628A CN113064628A (en) 2021-07-02
CN113064628B true CN113064628B (en) 2022-06-17

Family

ID=76565974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110367226.5A Active CN113064628B (en) 2021-04-06 2021-04-06 Traceable and verifiable software engineering data archiving method

Country Status (1)

Country Link
CN (1) CN113064628B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040122872A1 (en) * 2002-12-20 2004-06-24 Pandya Yogendra C. System and method for electronic archival and retrieval of data
CN109690487A (en) * 2016-09-09 2019-04-26 华睿泰科技有限责任公司 System and method for executing the real-time migration of software container
US10754819B1 (en) * 2017-05-05 2020-08-25 Jpmorgan Chase Bank, N.A. Method and system for implementing an automated archiving tool
CN112527388A (en) * 2019-09-17 2021-03-19 中国科学院软件研究所 GitHub large-scale open source code-oriented quick code file tracing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040122872A1 (en) * 2002-12-20 2004-06-24 Pandya Yogendra C. System and method for electronic archival and retrieval of data
CN109690487A (en) * 2016-09-09 2019-04-26 华睿泰科技有限责任公司 System and method for executing the real-time migration of software container
US10754819B1 (en) * 2017-05-05 2020-08-25 Jpmorgan Chase Bank, N.A. Method and system for implementing an automated archiving tool
CN112527388A (en) * 2019-09-17 2021-03-19 中国科学院软件研究所 GitHub large-scale open source code-oriented quick code file tracing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于需求管理工具的软件文档追溯管理;王定涛 等;《科技风》;20170430(第08期);276-277 *

Also Published As

Publication number Publication date
CN113064628A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN110309071B (en) Test code generation method and module, and test method and system
Benmoussa et al. A new model for the selection of web development frameworks: application to PHP frameworks
US20220318130A1 (en) Auto test generator
CN101866315B (en) Test method and system of software development tool
CN101236493A (en) Program developing apparatus and program developing method
Liu et al. Just-in-time obsolete comment detection and update
Tran et al. VbTrace: using view-based and model-driven development to support traceability in process-driven SOAs
CN116226159A (en) Metadata blood-edge relationship analysis method, system, equipment and storage medium
Zhang et al. A Survey on Large Language Models for Software Engineering
CN113064628B (en) Traceable and verifiable software engineering data archiving method
CN117312270A (en) Change management method for automatic construction and deployment of database
Grunzke et al. A data driven science gateway for computational workflows
Cicchetti et al. A Solution for Concurrent Versioning of Metamodels and Models.
Ajam et al. Scout-bot: Leveraging API community knowledge for exploration and discovery of API learning resources
CN117897710A (en) Artificial intelligence method for solving industrial data conversion problem
Dhakal et al. Library Tweets Conversion
Polack et al. Unit testing model management operations
CN114116664A (en) Database table building statement processing method and device, computer equipment and storage medium
CN113901025A (en) Database management method, device, equipment and storage medium
CN113504904A (en) User-defined function implementation method and device, computer equipment and storage medium
Arachchi et al. System Implementation Failures in the ERP Development Process
Rauber et al. Repeatability and Re-usability in Scientific Processes: Process Context, Data Identification and Verification.
CN104503992A (en) Question bank construction method
Anuar et al. Revisiting web application development with integrated records management important aspect using Re-CRUD
Fraternali et al. Almost rerere: An approach for automating conflict resolution from similar resolved conflicts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant