CN115934857A

CN115934857A - Data asset classification management and storage method suitable for engineering field

Info

Publication number: CN115934857A
Application number: CN202211584645.5A
Authority: CN
Inventors: 梁斌; 李天淇; 熊浩; 陈新喜; 吴光辉; 郭志鑫; 李赟
Original assignee: China Construction Eighth Engineering Division Co Ltd
Current assignee: China Construction Eighth Engineering Division Co Ltd
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2023-04-07

Abstract

The invention discloses a data asset classification management and storage method applicable to the field of engineering, which comprises the following steps: determining a category of a data source; converting the data into SQL statements; through SQL semantic analysis, an SQL sentence for creating a data form is split into data cataloguing documents of the data form, and source, grading, classification and scope information are added; converting the classification of the classified fields into data form naming, and creating a new data form for cataloguing for automation; extracting data form fields, calculating the similarity of the data fields, and classifying similar results; according to different classifications, creating a form building statement and building data forms of different classifications; and after the data form is created, automatically storing the name of the data form into the corresponding data catalog, and establishing the mapping relation between the catalog and the entity data form. The method solves the problems of scattered and trivial data source tables, realizes the management and control of data assets and exerts the value of data.

Description

Data asset classification management and storage method suitable for engineering field

Technical Field

The invention relates to the technical field of informatization management in the engineering field, in particular to a data asset classification management and storage method suitable for the engineering field.

Background

The building industry enters a stable growth stage, the traditional mode cannot meet the high-quality development requirement of the industry, and transformation and upgrading are imperative. As a contract for the building industry to grasp the convergence development of digitalization, informatization and intellectualization, the development road of the convergence of the building industry and the Internet is clear, the direction of digital technology enabling the high-quality development of the building industry is realized, and the building industry is surely promoted to be comprehensively upgraded to digitalization, informatization and intellectualization.

The general data classification management method in the prior art cannot meet the subdivision requirements of the current building construction industry.

Specifically, the prior art is disadvantageous in that a data classification and storage method with finer granularity is lacking, so that data are stored in a database in time at present, the data are still scattered in each business department, employees need to log in each business system to enter information, leaders also need to log in different systems to examine and examine information, a comprehensive main data management platform is lacking, the data quality is low, effective data analysis cannot be performed, and a decision of a data asset auxiliary manager of an enterprise cannot be formed.

Disclosure of Invention

Aiming at the problems, the data asset classification management and storage method applicable to the engineering field provided by the invention solves the problems that the data quality of the building construction industry is low and effective data analysis cannot be carried out, and provides a data classification and storage method with fine granularity to assist analysis decision.

The invention is realized by the following technical scheme:

a data asset classification management and storage method suitable for the engineering field comprises the following steps:

firstly, determining the types of data sources according to different business processes, and marking the different business processes;

secondly, converting the data into SQL sentences for creating a data form capable of automatically extracting the SQL sentences and the form names;

thirdly, through SQL semantic analysis, the SQL sentence for creating the data form is divided into data cataloguing documents of the data form, and source, grading, classification and scope information are added; processing the data cataloguing document into a data form, classifying and labeling scope, storing the standard of the filling content of the data field in a regular expression mode, and storing the standard in a database;

fourthly, after the labeling work is finished, converting classification of the classified fields into data form naming, and creating a new data form for automatic cataloguing;

fifthly, extracting data form fields, calculating the similarity of the data fields, and classifying similar results;

sixthly, establishing a table building statement according to different classifications, and establishing data forms of different classifications;

and seventhly, after the data form is created, automatically storing the name of the data form into the corresponding data catalog, and establishing the mapping relation between the catalog and the entity data form.

In an embodiment of the present invention, the categories of data sources in the first step are classified into artificial data, management data, evaluation data, process data, and result data.

In an embodiment of the present invention, the converting of classifying the classified field into data form naming in the fourth step includes:

the classification fields are stored in an XX.XX.XX.XX, and the levels of classification are marked by using a separator in a right word;

and converting the hierarchy naming into the naming of the data table, wherein the naming embodies hierarchy management.

In the embodiment of the present invention, the similar results are classified into three types in the fifth step: reference data, homologous data, and entity data.

In the embodiment of the invention, for the data form created by the reference data in the sixth step, the first form is selected as an entity to build a table, and other fields are all associated with the fields of the first form by external keys; for a data form created by homologous data, establishing a table according to an entity mode, and simultaneously establishing an additional association relation table for storing similar fields and similarity numbers; for entity data, a data form is created directly.

By adopting the technical scheme, the invention has the following beneficial effects:

on the basis of fine granularity management of data classification, the subdivision dimensionality of data is enriched, standard management of data asset cataloging is increased, meanwhile, the method for corresponding the standard of the asset cataloging to the database storage forms a unique technical scheme that the engineering field is more subdivided, and cataloging and storage can be separately managed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a model architecture diagram of a data asset classification management and storage method suitable for the engineering field according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. In addition, the technical features involved in the respective embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Prior to this, the following resolution was provided for terms appearing in the text:

SQL statement: a Structured Query Language (Structured Query Language), which is a database Query and programming Language for accessing data and querying, updating, and managing a relational database system; in short, SQL statements are a language in which to operate on a database.

Cataloguing: is the process of establishing a database connection from a client to a server, either locally or remotely. The purpose is to obtain inventory information, i.e., to generate a catalog for accessing the database. The system database directory contains a list and pointers through which DB2 (relational database management system) can find known databases, whether they are on a local system or a remote system.

SQL semantic analysis: the semantic analysis is a logic stage of the SQL parsing process, and the main task is to perform context-related property examination on the basis of correct syntax, complete the legality judgment of elements such as table names, operators and types in the SQL parsing process, and detect semantic ambiguity.

Text similarity calculation: providing semantic similarity calculation capability between two short texts, wherein the output similarity is a real numerical value between 0 and 1, and the larger the output numerical value is, the higher the semantic similarity is represented; in this embodiment, the text similarity may be calculated by using TF-IDF, and the steps are as follows:

finding out key words of two articles by using TF-IDF algorithm

1. Finding out key words of the two articles by using a TF-IDF algorithm;

2. taking out a plurality of keywords (for example, 20 keywords) from each article, combining the keywords into a set, and calculating the word frequency of each article for the words in the set (to avoid the difference of the article lengths, the relative word frequency can be used);

3. generating respective word frequency vectors of the two articles;

4. and calculating cosine similarity of the two vectors, wherein the larger the value is, the more similar the two vectors are.

Project management informatization is widely used in the engineering field, a large amount of structured and unstructured data are generated in links from project establishment, design, construction, purchase, material management to acceptance, but the problem is that information can be conveniently taken and stored according to the structure of the project information in a computer. If the application software requires storage, the information is too dispersed, an information island is easy to form, and if the information is convenient to take for design, the problem of compatibility with the application software can occur.

Aiming at the problems, the invention designs a data asset classification mode and a storage and taking method suitable for the engineering field, which realize secondary cataloguing of stored data, integration of the concept of data capitalization, construction of data standards, classification modes and storage and taking methods based on SQL sentences of a database, and effectively solve the defects of unclear data resources, non-uniform data standards, insufficient visualization degree and the like in the conventional method. The data utilization rate is improved, and the data use difficulty is reduced.

Referring to fig. 1, the data asset classification management and storage method applicable to the engineering field of the present invention mainly includes the following steps:

firstly, determining the types of data sources according to different business processes, and marking the different business processes; in this embodiment, the data sources are mainly classified into artificial data, management data, evaluation data, process data, and result data.

And secondly, the data are embodied in the database as a single form, and the data are converted into SQL statements for creating a data form capable of automatically extracting the SQL statements and the form names.

Thirdly, through SQL semantic analysis, the SQL sentence for creating the data form is divided into an original form of the data form (becoming a data cataloging document), and source, grading, classification and scope information are added; processing the data cataloguing document into a data form, classifying and labeling scope, storing the standard of the filling content of the data field in a regular expression mode, and storing the standard in a database; this step implements the process of automatically creating catalogued documents and restoring data sheets in a programmatic manner.

And fourthly, after the labeling work is finished, firstly, converting the classification of the classified fields into the naming of the data form. The category field is stored in the xx.xx.xx.xx.xx format, with ". As a separator, labeling the hierarchy of the category. The conversion from the hierarchical naming to the naming of the data table is realized, and the naming embodies the hierarchical management such as the data table 1.1 and the data table 1.1.1. This step catalogs the new data sheet for automation.

And fifthly, extracting data form fields, calculating the similarity of all the data fields by a text similarity method (using an open source algorithm), and dividing similar results into three categories, namely 100%,60% -99% and 0-59%. Classified as reference data, homologous data, and entity data, respectively.

And sixthly, automatically creating a table building statement according to three different classifications. The data table is referred, the first table is selected as an entity table, and other fields are all related to the fields of the table by external keys. And for the table of the homologous data, establishing the table according to an entity mode, simultaneously establishing an additional incidence relation table, and storing similar four segments and similarity numbers. And for the entity data table, directly establishing the table.

Through the list document of the built table, the conversion relation from the original table to the final storage table can be found, the corresponding relation between the physical storage table and the original table can be reversely restored, and differential data investigation is facilitated. Meanwhile, data of data health degree can be output, and operation and maintenance are facilitated.

From the foregoing it will be seen that this invention is one well adapted to attain all the ends and objects set forth above, together with other advantages which are obvious and which are inherent to the structure. It will be understood that certain features and subcombinations are of utility and may be employed without reference to other features and subcombinations. This is within the scope of the invention.

Claims

1. A data asset classification management and storage method suitable for the engineering field is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the data assets classification management and storage method in the engineering field is characterized in that the data sources in the first step are classified into artificial data, management data, evaluation data, process data and result data.

3. The method for classified management and storage of data assets in engineering field as claimed in claim 1, wherein the step four of converting the classified fields into data form names includes:

the classification field is stored in a XX.XX.XX.XX format, and the classification level is marked by using the'. As a separator;

4. The method for classified management and storage of data assets in engineering field as claimed in claim 1, wherein the fifth step classifies the similar results into three categories: reference data, homologous data, and entity data.

5. The data asset classification management and storage method applicable to the engineering field of claim 4, wherein in the sixth step, for the data form created by referring to the data, the first form is selected as an entity build table, and other fields are all associated with fields of the first form by foreign keys; for a data form created by homologous data, establishing a table according to an entity mode, and simultaneously establishing an additional association relation table for storing similar fields and similarity numbers; for entity data, a data form is created directly.