CN112667617A

CN112667617A - Visual data cleaning system and method based on natural language

Info

Publication number: CN112667617A
Application number: CN202011617367.XA
Authority: CN
Inventors: 尹源
Original assignee: Nanjing Chengqin Education Technology Co ltd
Current assignee: Nanjing Chengqin Education Technology Co ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-16

Abstract

The invention relates to the technical field of data processing, and in particular to a natural language-based visual data cleaning system and method. The invention specifies connection information related to data sources to be cleaned through a server; obtains the first N pieces of data to be cleaned, and parses its fields Type and format to remove useless fields; set the cleaning module, and choose to trigger data synchronization and cleaning tasks; reversely parse the cleaning rules of the cleaning module into data cleaning scripts, execute the scenario script on the data, and pass the cleaned data into the analysis library, repeat the execution until all data is cleaned, and the cleaning is completed. The invention realizes data cleaning without mastering the development and use methods of data cleaning tools, lowers the technical threshold of big data application services, improves users' experience of big data services, and solves the flexibility and problems of traditional data cleaning systems. The maintainability problem reduces the use cost of the data cleaning staff and improves the efficiency.

Description

Visual data cleaning system and method based on natural language

Technical Field

The invention relates to the technical field of data processing, in particular to a visual data cleaning system and method based on natural language.

Background

With the development of big data technology in recent years, a new analysis technical means is provided for original massive logs, internet records, historical data and the like, a lot of valuable information which cannot be found at ordinary times can be found by analyzing the massive data, big data analysis needs to be carried out, the first step is to collect data scattered at various places, carry out cleaning, and store the cleaned data in a warehouse. This process is called ETL, and involves three steps of extract data extraction, Transformation data conversion and Load data loading.

In the past, different cleaning tools are required to be adopted by means of data cleaning aiming at different data sources, different programs and scripts are required to be written for cleaning of different data sources, and the cleaning means require a user to master the using methods of various cleaning tools and have higher developing capability of the cleaning tools; resulting in high data cleansing system usage thresholds (associated expertise for the data source or cleansing tool needs to be learned) and high maintenance costs for the data cleansing process.

In the invention document No. CN201710011044.8, a data cleaning method and a data cleaning apparatus are disclosed, the data cleaning method including: acquiring original sample data to be cleaned; determining at least one data screening mechanism for cleaning the original sample data, and acquiring a screening value set by a user for each data screening mechanism according to the original sample data; and screening the original sample data according to the at least one data screening mechanism and the screening value set by the user so as to clean the original sample data. According to the technical scheme, the original sample data can be comprehensively cleaned, the dependence of a data cleaning process on operators can be reduced, the accuracy and the stability of a data cleaning result are ensured, and meanwhile, the data cleaning duration can be effectively shortened.

In an invention document with a patent number of CN201810143012, a data cleaning method and a data cleaning system are disclosed. The data cleaning method comprises the following steps: step S10: selecting a data source to be cleaned from heterogeneous data sources through a graphical interface; the heterogeneous data source comprises a text file and database data; step S11: editing a data cleaning rule through a graphical interface; step S12: data cleansing is performed through a graphical interface. According to the data cleaning method, the data source to be cleaned is selected from the heterogeneous data sources through the graphical interface, fusion cleaning of different data sources can be achieved, meanwhile, a user can clean data through simple operation on the graphical interface, the development and use method of a data cleaning tool does not need to be mastered, the technical threshold of big data application service is lowered, and the user experience of the big data service is improved.

In summary, the traditional data cleaning system mostly adopts script writing and a configuration file or control dragging type mode, so that the realization is simple, but the learning and maintenance cost is high, and the flexibility is low.

Disclosure of Invention

Aiming at the defects of the prior art, the invention discloses a visual data cleaning system and method based on natural language, which are used for solving the problems that the traditional data cleaning system mostly adopts script writing and a file or control dragging type mode, is simple to realize, but has higher learning and maintenance cost and lower flexibility.

The invention is realized by the following technical scheme:

in a first aspect, the invention discloses a visual data cleaning method based on natural language, which comprises the following steps:

s1 the system is initialized successfully, and the server designates the related connection information of the data source to be cleaned;

s2, after the data source is successfully connected, acquiring the first N pieces of data to be cleaned, and analyzing the field type and format of the data;

s3, confirming the data fields needing to be accessed through the graphical interface, carrying out the first round of screening, and removing useless fields;

s4, entering natural language cleaning configuration, setting a cleaning module, and selecting triggering data synchronization and cleaning tasks;

s5, reversely analyzing the cleaning rule of the cleaning module into a data cleaning script and executing the scenario script on the data;

and S6, transmitting the cleaned data into an analysis library, and repeating the steps until all data are cleaned, thereby completing cleaning.

Further, in the method, specifying the data source related connection information to be cleaned includes providing server host information, a username password, and a database for the remote data source; corresponding directory and file paths are provided for the local data sources.

Furthermore, in the method, each time a cleaning module is added, the system gives a natural language prompt according to the analyzed source field information, and assists the user in configuring the cleaning rule by using the natural language.

Furthermore, in the method, the abnormal attribute in the data set is identified, the corresponding weight is given to each attribute, then the average value and the standard deviation of the field value of each attribute are counted, a confidence interval is set for each attribute according to the average value and the standard deviation, and whether the attribute is abnormal or not is judged according to whether the attribute value is in the confidence interval or not.

Furthermore, the method uses a reduction algorithm based on attribute importance as a logic rule to clean the data attributes, performs distinguishable identification array calculation on the decision table S ═ { U, Q, V, F }, and assigns the core attributes in the distinguishable identification array to the attribute set obtained after attribute reduction, wherein U, Q, V, F are the attributes of the data.

Furthermore, in the method, all the attribute combination items left by subtracting the core attribute in the distinguishable identification array are removed; calculating the occurrence frequency of each condition attribute, performing descending order on all attribute frequencies, selecting the attribute with the highest attribute frequency as a, and deleting the combination item containing the condition attribute a from all combination items of the variable matrix; and judging whether the distinguishable matrix is empty, if not, continuing to delete the combination item containing the condition attribute a, and if so, ending, wherein Red is the finally obtained reduction result.

Furthermore, in the method, a rule closure set is obtained by a mathematical method for a data logic constraint rule formulated by a reduction algorithm based on attribute importance, and whether a field value violates a rule constraint is automatically judged, so that the correctness of the logic rule is judged.

In a second aspect, the invention discloses a natural language-based visual data cleaning system, which includes a visual cleaning process canvas, a natural language conversion module, a server and a memory storing execution instructions, wherein when the server executes the execution instructions stored in the memory, the server executes the natural language visual data cleaning method of the first aspect.

Furthermore, the visual cleaning process canvas supports the cleaning logic of drawing data in a dragging mode, different data cleaning component blocks are added, and a data circulation path is used for connecting the component blocks.

Furthermore, when the data cleaning component block is added, the natural language conversion module uses natural language to describe cleaning logic, checks and analyzes the statement input by the user, and if the analysis is successful, converts the statement into a corresponding bottom-layer data filtering query statement and transmits the corresponding bottom-layer data filtering query statement to the bottom-layer data cleaning execution module; and if the analysis fails, returning an abnormal state code to the visual cleaning process canvas to display corresponding abnormal information to prompt a user.

The invention has the beneficial effects that:

according to the invention, through the combination of the graphical interface and the natural language engine, a user can clean data through simple operation on the graphical interface without mastering the development and use method of a data cleaning tool, the technical threshold of the big data application service is reduced, the experience of the user on the big data service is improved, the problems of flexibility and maintainability of the traditional data cleaning system are solved, the use cost of data cleaning workers is reduced by using a visualization technology and a natural language interaction mode, the efficiency is improved, and the method has a very strong market application prospect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of the principle steps of a visualization data cleansing method based on natural language.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The embodiment discloses a visualization data cleaning method based on natural language as shown in fig. 1, which includes the following steps:

In this embodiment, the specifying of the relevant connection information of the data source to be cleaned includes, for the remote data source, providing server host information, a user name and a password, and a database; for a local data source, such as Excel, a log file, a corresponding directory and file path are provided.

In this embodiment, each time a cleaning module is added, the system provides a natural language prompt according to the analyzed source field information, and assists the user in configuring cleaning rules using natural language, such as data filtering: only importing data of which the class is not equal to the professional selection course.

In the embodiment, the abnormal attribute in the data set is identified, the corresponding weight is given to each attribute, then the average value and the standard deviation of the field value of each attribute are counted, a confidence interval is set for each attribute according to the average value and the standard deviation, and whether the attribute is abnormal or not is judged according to whether the attribute value is in the confidence interval or not.

Example 2

In this embodiment, a reduction algorithm based on attribute importance is used as a logic rule to clean data attributes, a distinguishable identification array calculation is performed on a decision table S ═ { U, Q, V, F }, and an attribute set is obtained by assigning a core attribute in the distinguishable identification array to attribute reduction, where U, Q, V, and F are data attributes.

In this embodiment, all the remaining attribute combination items are reduced by removing the kernel attribute in the distinguishable identification array; calculating the occurrence frequency of each condition attribute, performing descending order on all attribute frequencies, selecting the attribute with the highest attribute frequency as a, and deleting the combination item containing the condition attribute a from all combination items of the variable matrix; and judging whether the distinguishable matrix is empty, if not, continuing to delete the combination item containing the condition attribute a, and if so, ending, wherein Red is the finally obtained reduction result.

In the embodiment, a rule closure set is obtained by using a mathematical method for a data logic constraint rule formulated by a reduction algorithm based on attribute importance, and whether a field value violates a rule constraint is automatically judged, so that the correctness of the logic rule is judged.

In the present example, from the classification of the condition attributes, by removing the condition attributes cumulatively, and calculating and comparing the relative positive regions, it can be determined whether the core attribute and all the important attributes are removed. And then, adding the qualified condition attributes into the attribute reduction set, and outputting a final attribute reduction set.

In this embodiment, the idea based on the attribute reduction algorithm is improved, the condition attributes are subdivided, and then the attribute reduction set is directly output by comparing the relatively positive region from which the condition attributes are removed. After the algorithm is improved, the algorithm is mainly used for judging whether the core attribute and all important attributes are removed or not by calculating and comparing relative positive regions from the classification of the condition attributes through the accumulated removal of the condition attributes. Finally, the condition attributes meeting the requirements are added into the attribute reduction set, and the final attribute reduction set is output.

Example 3

The embodiment discloses a visual data cleaning system based on natural language, which comprises a visual cleaning process canvas, a natural language conversion module, a server and a memory for storing execution instructions, wherein when the server executes the execution instructions stored in the memory, the server executes a visual data cleaning method of natural language.

This embodiment is different from traditional data cleaning system, and on user interface, the user can directly adopt the washing logic of drawing data on visual washing flow canvas by dragging, namely: adding different data cleaning component blocks and connecting the component blocks by using a data circulation path.

When adding data cleansing component blocks, natural language can be used to describe cleansing logic, such as: inputting 'the creation date is more than 2019', the natural language conversion module can check and analyze the statement input by the user, if the analysis is successful, the statement can be converted into a corresponding bottom layer data filtering query statement, and the corresponding bottom layer data filtering query statement is transmitted to the bottom layer data cleaning execution module; if the analysis fails, an abnormal state code is returned to the canvas module, and the canvas module displays corresponding abnormal information to prompt a user.

In conclusion, by combining the graphical interface and the natural language engine, a user can clean data by simple operation on the graphical interface without mastering the development and use method of a data cleaning tool, the technical threshold of the big data application service is reduced, the experience of the user on the big data service is improved, the problems of flexibility and maintainability of the traditional data cleaning system are solved, the use cost of data cleaning workers is reduced by using a visualization technology and a natural language interaction mode, the efficiency is improved, and the method has a strong market application prospect.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. a visual data cleaning method based on natural language, is characterized in that, described method comprises the following steps:

The S1 system is successfully initialized, and the connection information related to the data source that needs to be cleaned is specified through the server;

After the S2 data source is successfully connected, obtain the first N pieces of data to be cleaned, and parse its field type and format;

S3 confirms the data fields to be accessed through the graphical interface, performs the first round of screening, and removes useless fields;

S4 enters the natural language cleaning configuration, sets the cleaning module, and chooses to trigger data synchronization and cleaning tasks;

S5 reversely parses the cleaning rules of the cleaning module into a data cleaning script, and executes the scenario script on the data;

S6 transfers the cleaned data into the analysis library, and repeats the above steps until all data cleaning is completed, and cleaning is completed.

2. The natural language-based visual data cleaning method according to claim 1, wherein, in the method, specifying the relevant connection information of the data source to be cleaned comprises, providing server host information, username and password for the remote data source and database; provide the corresponding directory and file path for the local data source.

3. The visual data cleaning method based on natural language according to claim 1, is characterized in that, in the described method, every time a cleaning module is added, the system provides natural language prompts according to the parsed source field information, and assists users to utilize Natural language configuration cleaning rules.

4. The natural language-based visual data cleaning method according to claim 1, wherein, in the method, for identifying abnormal attributes in the data set, it first assigns a corresponding weight to each attribute, and then counts each attribute. The average value and standard deviation of each attribute field value, according to which a confidence interval is set for each attribute, and whether the attribute is abnormal is judged according to whether the attribute value is within the confidence interval.

5. The natural language-based visual data cleaning method according to claim 1, wherein the method uses a reduction algorithm based on attribute importance as a logical rule to clean data attributes, and the decision table S={U ,Q,V,F} is used to calculate the discriminative identification matrix, by assigning the kernel attributes in the discriminable identification matrix to the attribute set obtained after attribute reduction, in which U, Q, V, F are the attributes of the data.

6. The natural language-based visual data cleaning method according to claim 5, characterized in that, in the method, the kernel attributes in the discernible identification array are removed and all remaining attribute combination items are reduced; the frequency of occurrence of each conditional attribute is calculated, Arrange all attribute frequencies in descending order, select the attribute with the highest attribute frequency and denote it as a, RED=RED∪{a}, delete the combination item containing the conditional attribute a from all the combination items of the variable matrix; judge whether the discernible matrix is is empty, if the discriminative matrix is not empty, continue to delete the combination item containing the conditional attribute a, if the discriminative matrix is empty, end, where Red is the final reduction result obtained.

7. The visual data cleaning method based on natural language according to claim 1, is characterized in that, in the described method, to the data logic constraint rule formulated by the reduction algorithm based on attribute importance, utilize mathematical method to obtain rule closed set , and automatically determine whether the field value violates the rule constraints, and then determine the correctness of the logic rules.

8. A visual data cleaning system based on natural language, characterized in that it comprises a visual cleaning process canvas, a natural language conversion module, a server and a memory storing execution instructions, when the server executes the execution stored in the memory. When instructed, the server executes the natural language visual data cleaning method according to any one of claims 1 to 7.

9 . The natural language-based visual data cleaning system according to claim 8 , wherein the visual cleaning process canvas supports the cleaning logic of drawing data by dragging, adding different data cleaning component blocks, and using data flow paths. 10 . Connect the component blocks.

10. The natural language-based visual data cleaning system according to claim 8, wherein the natural language conversion module uses natural language to describe the cleaning logic when adding a data cleaning component block, and verifies the statement input by the user. If the parsing is successful, the statement will be converted into the corresponding underlying data filtering query statement and transmitted to the underlying data cleaning execution module; if the parsing fails, an exception status code will be returned to the visualized cleaning process canvas for display The corresponding exception information is prompted to the user.