CN113064628B

CN113064628B - Traceable and verifiable software engineering data archiving method

Info

Publication number: CN113064628B
Application number: CN202110367226.5A
Authority: CN
Inventors: 朱家鑫; 陈伟; 吴国全; 窦文生; 魏峻; 叶丹
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2022-06-17
Anticipated expiration: 2041-04-06
Also published as: CN113064628A

Abstract

The invention provides a traceable and verifiable software engineering data archiving method, which relates to the field of software engineering data, organizes software engineering data based on data units and data volumes, saves the derivative relationship among the data units and derivative environment construction scripts and data derivative scripts, and provides an automatic data re-derivation and consistency verification mechanism. The invention realizes the automatic tracing and verification of the software engineering data to be archived and archived, and can help related developers and researchers to obtain high-reliability software engineering data.

Description

Traceable and verifiable software engineering data archiving method

Technical Field

The invention relates to the field of software engineering data, in particular to a traceable and verifiable software engineering data archiving method.

Background

The software engineering data related by the invention is various data generated by related supporting tools in the software development and operation and maintenance processes, such as version control data, defect tracking data and the like. The data can be used for developing a plurality of software engineering related researches and helping to provide the efficiency of software development, operation and maintenance and the quality of software products.

The data of software engineering is various, the context generated by the data is complex, and a plurality of data processing processes are opaque, so that a plurality of data users have large deviation in understanding the data, and further the validity of the related data analysis result is influenced.

Many shared items of software engineering data have appeared, such as the GHTorrent item (https:// lightrrent. org /) which shares GitHub data, the comprehensive Promise item (http:// promisse. site. utottawa. ca/Serepository /) and so on. The existing software engineering data sharing project mainly focuses on the problems of data availability such as data uploading, storage, retrieval and downloading, a data tracing and verifying mechanism, particularly an automatic mechanism is not established, and the problems of data misuse and data quality cannot be avoided.

Disclosure of Invention

Aiming at the problem that the existing software engineering data sharing project does not have a mechanism for avoiding data misuse and data quality problems, the invention provides a traceable and verifiable software engineering data archiving method.

In order to achieve the purpose, the invention adopts the following technical scheme:

a retrospectively and verifiable software engineering data archiving method comprises the following steps: creating a data unit and a data volume, and archiving the software engineering data according to the data unit and the data volume; the data unit is used as a minimum unit for data archiving and comprises five types of files, namely a data unit description file, a data unit document file, an environment file and a script file; the data volume is used as a data set which is made facing a certain data use requirement, comprises two types of files, namely a data volume description file and a data volume document file, and references the data units contained in the data volume description file through the data unit indexes in the data volume description file; wherein the content of the first and second substances,

each of the data unit description files contains 14 fields: index number, name, complete description, short description, author, version number, creation time, license, data source type, data source index, environment index, script entry index, previous version number, next version number; wherein, the short description is the abstract of the complete description and is used for quick browsing and retrieval by a data user; the data source type comprises an original data type and a data unit type, wherein the original data is directly generated and stored by a software development tool; the data source index comprises a data source index of an original data type and a data unit type, the data source index of the original data type uses a URL, and the data source index of the data unit type uses a data unit index number; the environment index is the relative address of the environment file; the script entry index is the relative address of the script entry;

the data file is used for storing main data;

the data unit document file is used for describing relevant backgrounds, data formats, using methods and using examples of data unit main body data;

the environment file is used for describing the configuration of the environment and the environment construction step;

the script file includes four classes: the method comprises the following steps of (1) obtaining an environment construction script file, an environment construction script entry file, a data derivative script file and a data derivative script entry file;

each of the data volume description files contains 11 fields: index number, name, full description, short description, author, version number, creation time, license, data unit index number, previous version number, next version number; wherein, the short description is the abstract of the complete description and is used for quick browsing and retrieval by a data user;

the data volume document file is used for comprehensively and systematically describing application problems, data processing flows and processing results to be solved.

Furthermore, one data unit has and only one data unit description file; a data unit has one or more data files; a data unit has one or more data unit document files; one data unit has one and only one environment file; one data unit has one or more environment construction script files and only one environment construction script entry file; one data unit has one or more data derivative script files and has only one data derivative script entry file; one data volume has one and only one data volume description file; a data volume has one or more data volume document files.

Further, the method for creating the data unit and the data volume comprises the following steps: filling in a data unit description file and a data volume description file; adding a data file, a data unit document file, an environment file and a script file to form a data unit; and adding the file files of the data volume to form the data volume.

Furthermore, before archiving or using the software engineering data, data tracing verification needs to be performed, and the verification method comprises the following steps: for a data unit, constructing a script entry file according to a script entry index execution environment in a data unit description file, completing environment construction, executing a data derivative script entry file according to the script entry index in the data unit description file, acquiring upstream data according to a data source index in the data unit description file, performing data processing, completing data regeneration, comparing a digital fingerprint of a regenerated data file with a digital fingerprint of an original data file, if the digital fingerprint is consistent with the digital fingerprint of the original data file, passing traceability verification, otherwise not passing; for a data volume, each referenced data unit may be verified separately according to the foregoing steps.

Further, when data tracing is performed, a source data unit is searched iteratively according to a data source index in the data unit description file until the data source type is original data.

Further, when the data unit and the data volume are retrieved, the retrieval is carried out through an index number, a complete description or a short description.

Furthermore, one data unit is stored in one folder and named by the index number of the data unit; storing a data volume in a folder, and naming the data volume by the index number of the data volume; the folder is compressed into a package.

Further, the environment file in the data unit is a dokcerfle file in the docker technology.

Further, the environment construction script entry file is an executable script file, a Linux shell script is used for writing, the environment construction script file is called, and the environment construction script file can be written by any script language.

Further, the data derivative script entry file is an executable script file, and is compiled by using a Linux shell script, and the data derivative script file is called and can be compiled by using any script language.

Compared with the prior art, the invention has the following advantages:

the invention adopts a two-stage (data unit and data volume) standardized organization mode for software engineering data, reduces data redundancy and improves the use efficiency. The archived data units have derivative relations, so that the most original data can be traced all the time, and a data user can be helped to accurately understand the whole process of data generation; the archived data unit has a complete derivative environment and scripts, and can automatically verify whether the data is correct. The invention provides a data derivation environment by adopting a container technology and verifies data by adopting a digital fingerprint technology.

Drawings

Fig. 1 is a flowchart of an implementation of a traceable and verifiable software engineering data archiving method according to an embodiment.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic diagram illustrating archiving, retrieving and using software engineering data according to a traceable and verifiable software engineering data archiving method of the present invention.

A data unit or a data volume is stored in a folder, named by the index number of the data unit or data volume, which is compressed into a package, preferably a compressed package in zip format. Data units and data volumes may employ JSON to describe various components. When organizing data units and data volumes, the parts are placed under the folders of the data units or data volumes according to the content of the description file.

The environment file in the data unit builds a file for the container in the container technology, preferably, a Dokcerfile in the docker technology is used. The environment construction script entry file is an executable script file, preferably written by using a Linux shell script, and the environment construction script file is called and can be written by using any script language. The environment construction is completed based on the Dokcerfile.

The data derivative script entry file is an executable script file, preferably written by using a Linux shell script, and the data derivative script file is called and can be written by using any script language. The data-derived script file is executed in the built environment.

When data tracing is carried out, a source data unit is searched iteratively according to a data source index in the data unit description file until the type of the data source is original data.

When data verification is carried out, firstly, the execution environment construction script builds a data derivative environment, then, the data derivative script is executed, data is generated again, whether fingerprints of the generated data file and fingerprints of existing data files are consistent or not is compared, preferably, the MD5 algorithm is used for calculating the fingerprints of the data files, if the fingerprints are consistent, verification is successful, and if the fingerprints are not consistent, verification is failed.

Only data units that can be traced back to the original data and that pass the verification can be archived, others rejected. Data volumes can only reference archived data units.

The verification can be traced again before use, ensuring credibility.

The following specific examples are provided to demonstrate that the method of the present invention comprises the steps of:

1) selecting 5 researchers with software engineering development and research experience, and introducing the composition and use of the data unit and the data volume to the researchers;

2) selecting GitHub as a data source;

3) making 5 persons selected in the step 1 develop the compiling and organizing work of various files such as data unit description files, script files and the like in a plurality of turns according to the past research requirements to make data units and file, and integrating the data units and the files into data volumes and files;

4) tracing and verifying the archived data units: and the execution environment construction script constructs a data derivative environment, then executes the data derivative script, generates data again, and compares whether the fingerprints of the generated data file and the existing data file are consistent or not.

5) Tracing and verifying the archived data volume: and respectively tracing and verifying the data units referenced by the data volumes.

The experimental results are as follows:

data making a first round:

the participant 1 creates a data unit A, and the main data of the data unit is the issue data of all the Rails project on the GitHub; the participant 2 creates a data unit B, and the main data of the data unit is pull-request data of all Rails project on the GitHub; the participant 3 creates a data unit C, and the main data of the data unit is the issue data of the jQuery project on the GitHub; the participant 4 creates a data unit D, the main data of which is the commit data of the Rails project on the GitHub; the participant 5 creates a data unit E whose main data is the issue data of the RxJava project on the GitHub;

and a second round of data production:

participant 1 created data cell F, G, H based on data cell A, C, E, the data body being each item of issue data with a category label added thereto;

the participant 2 creates a data unit I based on the data unit A, B, D, and the data body is data fused by associating the Rails item issue, pull-request and commit;

the third round of data production:

participant 1 makes a data volume alpha, referencing data cell F, G, H.

The data source and data unit derivation relationships are as follows:

GitHub→A→F

GitHub→B→G

GitHub→C→H

GitHub→(A B C)→I

experimental results show that the method can ensure that the archived data can be traced back to the source of data generation, namely a software development tool, and the data can be verified through a reproduction derivation process.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A traceable and verifiable software engineering data archiving method is characterized by comprising the following steps: creating a data unit and a data volume, and archiving the software engineering data according to the data unit and the data volume; the data unit is used as a minimum unit for data archiving and comprises five types of files, namely a data unit description file, a data unit document file, an environment file and a script file; the data volume is used as a data set which is produced facing to the data use requirement and comprises two types of files, namely a data volume description file and a data volume document file; the data volume refers to the data unit contained by the data volume through the data unit index in the data volume description file; wherein the content of the first and second substances,

each of the data unit description files includes fields having: index number, name, complete description, short description, author, version number, creation time, license, data source type, data source index, environment index, script entry index, previous version number, and next version number; wherein, the short description is the abstract of the complete description and is used for the quick browsing and retrieval of the data user; the data source type comprises an original data type and a data unit type, wherein the original data is directly generated and stored by a software development tool; the data source index comprises a data source index of an original data type and a data unit type, the data source index of the original data type uses a URL, and the data source index of the data unit type uses a data unit index number; the environment index is the relative address of the environment file; the script entry index is the relative address of the script entry;

the data file is used for storing main data;

the script file includes: the system comprises an environment construction script file, an environment construction script inlet file, a data derivative script file and a data derivative script inlet file;

each of the data volume description files includes fields having: index number, name, full description, short description, author, version number, creation time, license, data unit index number, previous version number, next version number; wherein, the short description is the abstract of the complete description and is used for browsing and searching by a data user;

the data volume document file is used for describing application problems to be solved, data processing flows and processing results.

2. The method of claim 1, wherein a data unit has only one data unit description file; a data unit has one or more data files; a data unit has one or more data unit document files; one data unit has only one environment file; one data unit has one or more environment construction script files and only one environment construction script entry file; one data unit has one or more data derivative script files and only one data derivative script entry file; a data volume has only one data volume description file; a data volume has one or more data volume document files.

3. The method of claim 1, wherein the data units and data volumes are created by: filling in a data unit description file and a data volume description file; adding a data file, a data unit document file, an environment file and a script file to form a data unit; and adding the file files of the data volume to form the data volume.

4. The method of claim 1, wherein before archiving or using the software engineering data, data tracing verification is required, and the verification method comprises:

for the data unit, constructing a script entry file according to a script entry index execution environment in the data unit description file to complete environment construction; executing a data derivation script entry file according to the script entry index in the data unit description file; acquiring upstream data according to a data source index in the data unit description file, and performing data processing to complete data regeneration; comparing the digital fingerprint of the regenerated data file with the digital fingerprint of the original data file, if the digital fingerprint of the regenerated data file is consistent with the digital fingerprint of the original data file, passing the tracing verification, otherwise not passing the tracing verification;

for the data volume, the referenced data units are verified separately according to the previous steps.

5. The method of claim 4, wherein when performing data tracing, iteratively searching for a source data unit according to a data source index in a data unit description file until a data source type is original data.

6. The method of claim 1, wherein when retrieving data units, data volumes, retrieval is by index number, full description or short description.

7. The method of claim 1, wherein a data unit is stored in a folder named by the index number of the data unit; storing a data volume in a folder, and naming the data volume by the index number of the data volume; the folder is compressed into a package.

8. The method of claim 1, wherein the environment file in the data unit is a Docker file in a docker technology.

9. The method of claim 1, wherein the environment building script portal file is an executable script file written using Linux shell scripts for invoking the environment building script file.

10. The method of claim 1, wherein the data-derived script portal file is an executable script file written using Linux shell scripts for invoking the data-derived script file.