CN110688421A

CN110688421A - Intelligent customizable data management and analysis method

Info

Publication number: CN110688421A
Application number: CN201810633877.2A
Authority: CN
Inventors: 孟涛; 李佳静
Original assignee: Nanjing Network Sense To Inspect Mdt Infotech Ltd
Current assignee: Nanjing Network Sense To Inspect Mdt Infotech Ltd
Priority date: 2018-06-20
Filing date: 2018-06-20
Publication date: 2020-01-14

Abstract

The invention specifically relates to an intelligent customizable data governance and analysis method, which comprises the following steps: step 1: and constructing global main data. Step 2: unstructured data within the application system is structured. And step 3: and fusing data from a plurality of different sources in the application system based on the main data to obtain standard data. And 4, step 4: fields are customized in the standard data as types and tags for the data. And 5: analysis conditions, analysis ranges, and chart formats are customized. The method provided by the invention can be used for intelligently managing the application system data, and comprises the steps of structuring the unstructured application system data, and aligning the data and supplementing the missing data aiming at the multi-source heterogeneous data. Meanwhile, the method supports the user to customize the analysis conditions, defines the analysis range, customizes the data display form, and flexibly realizes the customizable data analysis.

Description

Intelligent customizable data management and analysis method

Technical Field

The invention relates to the field of information extraction and text analysis, in particular to an intelligent customizable data governance and analysis method.

Background

The main data is data used for describing core business entities of the enterprise, such as clients, partners, employees, products, material sheets, accounts and the like; it is data that has high business value, can be reused across various business sectors within an enterprise, and exists in multiple heterogeneous application systems. The main data has no uniform standard, no clear definition and no scope in definition; in the aspect of flow, management flows such as data creation, maintenance and the like are inconsistent; for the quality problem, the data is lack of integrity, consistency and accuracy, and the repeated data is more, so that the main data is difficult to manage; the problems of the unknown who is the main data, the poor sharing way, the difficult access control and the like also cause the difficulty in the main data sharing.

In the multi-source heterogeneous data, due to the fact that alias names, short names, translations, natural expressions and written languages are different, the same concept can have different names, and data alignment needs to be carried out. In addition, there is a problem of data loss, and padding is required. In addition, there are a lot of unstructured data in the application system, such as cases, decision books, and documents, and data analysis cannot be directly performed. These all need to have intelligent data governance methods to solve.

In addition, most of the analysis tools currently give fixed results for given data, and flexible and customizable data analysis is difficult to achieve. Such as conditions that do not support the user to customize the analysis, defining the scope of the analysis, and customizing the presentation form of the data.

Disclosure of Invention

1. The technical problem to be solved is as follows:

aiming at the problems, the invention provides an intelligent customizable data management and analysis method. The method comprises the steps of firstly, constructing global main data, extracting information aiming at unstructured data to enable the unstructured data to be structured, and then finishing data management based on the global main data to obtain standard data; the user can customize the fields in the standard data to classify or label; and finally, displaying according to the analysis condition, the analysis range and the display mode defined by the user.

2. The technical scheme is as follows:

an intelligent customizable data governance and analysis method, characterized by: the method comprises the following steps:

step 1: and constructing global main data.

Step 2: unstructured data within the application system is structured.

And step 3: and fusing data from a plurality of different sources in the application system based on the main data to obtain standard data.

And 4, step 4: fields are customized in the standard data as types and tags for the data.

And 5: analysis conditions, analysis ranges, and chart formats are customized.

Step 6: and generating a data analysis result according to the customization condition of the step 5.

Further, the specific process of establishing the global master data in step 1 is as follows: the method comprises the steps that a database of an application system and website data of related fields are used as main data sources of main data; designing a series of conversion rules, and obtaining main data from an application system database and website data in related fields; the transformation rules include, but are not limited to: converting the table name in the relational mode into a concept name in the main data, converting the relationship between the table and the table into the relationship between the concept and the concept in the main data, and converting the field name in the relational mode into an attribute name of the main data; the method of obtaining master data also includes a manually defined manner.

Further, the method for performing structured processing on the unstructured data in the application system in the step 2 is a method adopting information extraction; wherein the extracted information is the main data but is not limited to the main data.

Further, the step 3 is a method for fusing data from a plurality of different sources based on main data, and the method includes data alignment and missing data completion.

The data alignment is to perform knowledge fusion on main data of a plurality of heterogeneous data sources; detecting main data in different fields by adopting a similarity detection rule for the existence of the same or similar concepts and attributes; the similarity detection rule comprises semantic similarity detection, concept similarity detection, attribute similarity detection and data format similarity detection; after the similarity test is carried out, the same and similar main data in a plurality of heterogeneous data sources can be unified.

The missing data completion is divided into external missing data and internal missing data; for external missing data, acquiring data of an external website through a webpage crawler technology; for internal missing data, completing by using an association rule mining method; the internal rule relation hidden among the attributes in the data set can be found out through association rule mining, and the unknown attribute value can be deduced by using the existing conditional attribute value through the rule, so that the effect of filling the data set is achieved.

Data subjected to data alignment and missing data completion become standard data, and statistics and analysis can be performed on the basis.

Further, in step 4, a field is customized in the standard data, and a method for using the field as a type and a tag of the data is as follows: after the user has customized a new field, the method of generating the data class and label is either a rule-based method or a machine learning-based method.

Further, the customizing of the analysis condition, the analysis range and the report format in the step 5 specifically includes:

conditions for custom analysis: specifying which fields or attributes to extract from.

Scope of custom analysis: for a given field, a value range can be set, and only data within the value range is extracted.

Customizing a presentation form of the data, the presentation form including a list, a pie chart, a trend chart, a histogram, and a relationship chart.

Further, the method for generating the data analysis result in step 6 is as follows:

and automatically generating corresponding SQL sentences according to a report form format customized by a user, inquiring the database, generating a corresponding form of an inquiry result, such as a trend graph, and displaying the inquiry result to the user.

3. Has the advantages that:

the method provided by the invention can be used for intelligently managing the application system data, and comprises the steps of structuring the unstructured application system data, and aligning the data and supplementing the missing data aiming at the multi-source heterogeneous data. Meanwhile, the method supports the user to customize the analysis conditions, defines the analysis range, customizes the data display form, and flexibly realizes the customizable data analysis.

Drawings

FIG. 1 is a flow diagram of an intelligent, customizable data governance and analysis method.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

Fig. 1 shows an intelligent customizable data governance and analysis method, which is characterized in that: the method comprises the following steps:

step 1: and constructing global main data.

Step 2: unstructured data within the application system is structured.

And 5: analysis conditions, analysis ranges, and chart formats are customized.

The specific process of establishing the global master data in the step 1 is as follows: the method comprises the steps that a database of an application system and website data of related fields are used as main data sources of main data; designing a series of conversion rules, and obtaining main data from an application system database and website data in related fields; the transformation rules include, but are not limited to: the table name in the relational schema is converted into the concept name in the main data, the relationship between the table and the table is converted into the relationship between the concept and the concept in the main data, and the field name in the relational schema is converted into the attribute name of the main data. For example, in a hospital information system, the database tables include patient records, bed records, and patient sign records. Establishing patient main data according to the patient record, wherein fields are used as attribute names and comprise a patient identification number, a patient name, a bed number, admission date, main diagnosis, an illness state and the like; there is a "manifestation" relationship between the patient and the sign record, i.e., "patient presents signs," which translates into an inter-conceptual relationship.

Relational databases may have complete data schemas, including complete table structures and integrity constraints. Thus, the relationship names in the database can be converted into concepts in the main data, and the partial field names can be converted into attributes in the main data. The main data can also be obtained in a manually defined manner.

In the step 2, the method for carrying out the structuralization processing on the unstructured data in the application system is a method adopting information extraction; wherein the extracted information is the main data but is not limited to the main data. Unstructured business data, such as cases, documents, official documents, and the like, are converted into structured data by an information extraction method. The extracted information may be, but is not limited to, main data. For example, the case includes main data such as "patient", "symptom", "test result", "treatment method", and medicine, and the values of these main data are extracted from the case and converted into structured data.

And 3, the method for fusing the data of a plurality of different sources based on the main data comprises data alignment and missing data completion.

Data alignment: among the multiple sources of heterogeneous data, the same concept may have different names due to differences in aliases (e.g., acronyms), acronyms, translations, natural expressions, and written languages. For example, "NS" is a abbreviation of "physiological saline," Hospital 301 "is an alias of" general Hospital of the Chinese people's liberation military, "and the common English expressions for" Indications "include" Indications "," Indications and Uses "," major (principal) Indications "," Uses "," actions and use ".

Knowledge fusion of the main data of multiple heterogeneous data sources is therefore required. The data alignment is to perform knowledge fusion on the main data of a plurality of heterogeneous data sources; detecting main data in different fields by adopting a similarity detection rule for the existence of the same or similar concepts and attributes; the similarity detection rule comprises semantic similarity detection, concept similarity detection, attribute similarity detection and data format similarity detection; after the similarity test is carried out, the same and similar main data in a plurality of heterogeneous data sources can be unified.

Each explicitly defined synonym may be found, for example, based on a word vector; or identifying concepts that are synonyms with multiple instances of a concept as new concepts of the same type. One may choose to use google's word2vec model to train learning synonyms and related words. The method comprises the steps of data processing, model training and parameter adjustment.

And (3) data loss completion: to augment and refine the master data, non-relational data needs to be collected and populated. The missing data completion is divided into external missing data and internal missing data. And for external missing data, acquiring data of an external website by using a webpage crawler technology. For example, for sales data of the 'Hanyang district', Baidu encyclopedia websites are acquired through a webpage crawler technology and supplemented into the 'Hanyang district in Wuhan City of Hubei province'.

For internal missing data, completing by using an association rule mining method; the internal rule relation hidden among the attributes in the data set can be found out through association rule mining, and the unknown attribute value can be deduced by using the existing conditional attribute value through the rule, so that the effect of filling the data set is achieved. For example, the age of the user is missing, the year and month of birth can be obtained from the identification number and then filled in.

In the step 4, fields are customized in the standard data, and the method for using the fields as the types and the labels of the data is as follows:

after the user has customized a new field, the method of generating the data class and label is either a rule-based method or a machine learning-based method. For example, "systolic pressure >140, diastolic pressure >90 is hypertension; otherwise it is normal. The blood pressure of the patient may be classified according to the data of systolic and diastolic blood pressure according to the above rules, resulting in a class label. Method based on machine learning: and taking a part of marked data as a training set to train the machine learning method. The newly entered data may then be automatically classified to produce class labels. For example, for the staging of heart failure in a patient, clinical diagnosis of heart failure is difficult due to the lack of a simple and effective model. This situation can be achieved by using machine learning methods to train, for example, an SVM model to a heart failure diagnosis and staging model based on previously standardized cases. For a new case, its staging label can then be automatically generated.

The customizing of the analysis conditions, the analysis range and the report format in the step 5 specifically comprises the following steps:

For example, the user establishes a target place as "wuhan", an organization as "kindergarten", a disease name as "hand-foot-mouth", a time range as "nearly three months", and a result format as "trend graph".

The method for generating the data analysis result in the step 6 comprises the following steps: and automatically generating corresponding SQL sentences according to a report form format customized by a user, inquiring the database, generating a corresponding form of an inquiry result, such as a trend graph, and displaying the inquiry result to the user.

Claims

1. An intelligent customizable data governance and analysis method, characterized by: the method comprises the following steps:

step 1: constructing global main data;

step 2: carrying out structuring processing on unstructured data in an application system;

and step 3: fusing data from a plurality of different sources in an application system based on main data to obtain standard data;

and 4, step 4: customizing fields in the standard data as types and labels of the data;

and 5: customizing analysis conditions, analysis ranges and chart formats;

2. The intelligent customizable data governance and analysis method of claim 1, wherein: the specific process of establishing the global master data in the step 1 is as follows: the method comprises the steps that a database of an application system and website data of related fields are used as main data sources of main data; designing a series of conversion rules, and obtaining main data from an application system database and website data in related fields; the transformation rules include, but are not limited to: converting the table name in the relational mode into a concept name in the main data, converting the relationship between the table and the table into the relationship between the concept and the concept in the main data, and converting the field name in the relational mode into an attribute name of the main data; the method of obtaining master data also includes a manually defined manner.

3. The intelligent customizable data governance and analysis method of claim 1, wherein: in the step 2, the method for carrying out the structuralization processing on the unstructured data in the application system is a method adopting information extraction; wherein the extracted information is the main data but is not limited to the main data.

4. The intelligent customizable data governance and analysis method of claim 1, wherein: the step 3 is a method for fusing data from a plurality of different sources based on main data, and the method comprises data alignment and missing data completion;

the data alignment is to perform knowledge fusion on main data of a plurality of heterogeneous data sources; detecting main data in different fields by adopting a similarity detection rule for the existence of the same or similar concepts and attributes; the similarity detection rule comprises semantic similarity detection, concept similarity detection, attribute similarity detection and data format similarity detection; after the similarity test is carried out, the same and similar main data in a plurality of heterogeneous data sources can be unified;

the missing data completion is divided into external missing data and internal missing data; for external missing data, acquiring data of an external website through a webpage crawler technology; for internal missing data, completing by using an association rule mining method; the internal rule relation hidden among the attributes in the data set can be found out through association rule mining, and the unknown attribute value can be deduced by using the existing conditional attribute value by using the rule, so that the effect of filling the data set is achieved;

5. The intelligent customizable data governance and analysis method of claim 1, wherein: in the step 4, fields are customized in the standard data, and the method for using the fields as the types and the labels of the data is as follows:

after the user has customized a new field, the method of generating the data class and label is either a rule-based method or a machine learning-based method.

6. The intelligent customizable data governance and analysis method of claim 1, wherein: the customizing of the analysis conditions, the analysis range and the report format in the step 5 specifically comprises the following steps:

conditions for custom analysis: specifying from which fields or attributes to extract;

scope of custom analysis: for the appointed field, a value range can be set, and only the data in the value range is extracted;

7. The intelligent customizable data governance and analysis method of claim 1, wherein: the method for generating the data analysis result in the step 6 comprises the following steps: and automatically generating corresponding SQL sentences according to a report form format customized by a user, inquiring the database, generating a corresponding form of an inquiry result, such as a trend graph, and displaying the inquiry result to the user.