CN112667617A - Visual data cleaning system and method based on natural language - Google Patents
Visual data cleaning system and method based on natural language Download PDFInfo
- Publication number
- CN112667617A CN112667617A CN202011617367.XA CN202011617367A CN112667617A CN 112667617 A CN112667617 A CN 112667617A CN 202011617367 A CN202011617367 A CN 202011617367A CN 112667617 A CN112667617 A CN 112667617A
- Authority
- CN
- China
- Prior art keywords
- data
- cleaning
- attribute
- natural language
- cleaned
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Stored Programmes (AREA)
Abstract
The invention relates to the technical field of data processing, in particular to a visual data cleaning system and a visual data cleaning method based on natural language.A server is used for specifying the relevant connection information of a data source to be cleaned; acquiring the first N pieces of data to be cleaned, analyzing the field type and format of the data to be cleaned, and removing useless fields; setting a cleaning module, and selecting trigger data synchronization and cleaning tasks; and reversely analyzing the cleaning rule of the cleaning module into a data cleaning script, executing the data cleaning script, transmitting the cleaned data into an analysis library, and repeatedly executing until all data are cleaned, thereby finishing cleaning. According to the invention, the data is cleaned without mastering the development and use method of a data cleaning tool, the technical threshold of big data application service is reduced, the experience of a user on the big data service is improved, the problems of flexibility and maintainability of a traditional data cleaning system are solved, the use cost of data cleaning workers is reduced, and the efficiency is improved.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a visual data cleaning system and method based on natural language.
Background
With the development of big data technology in recent years, a new analysis technical means is provided for original massive logs, internet records, historical data and the like, a lot of valuable information which cannot be found at ordinary times can be found by analyzing the massive data, big data analysis needs to be carried out, the first step is to collect data scattered at various places, carry out cleaning, and store the cleaned data in a warehouse. This process is called ETL, and involves three steps of extract data extraction, Transformation data conversion and Load data loading.
In the past, different cleaning tools are required to be adopted by means of data cleaning aiming at different data sources, different programs and scripts are required to be written for cleaning of different data sources, and the cleaning means require a user to master the using methods of various cleaning tools and have higher developing capability of the cleaning tools; resulting in high data cleansing system usage thresholds (associated expertise for the data source or cleansing tool needs to be learned) and high maintenance costs for the data cleansing process.
In the invention document No. CN201710011044.8, a data cleaning method and a data cleaning apparatus are disclosed, the data cleaning method including: acquiring original sample data to be cleaned; determining at least one data screening mechanism for cleaning the original sample data, and acquiring a screening value set by a user for each data screening mechanism according to the original sample data; and screening the original sample data according to the at least one data screening mechanism and the screening value set by the user so as to clean the original sample data. According to the technical scheme, the original sample data can be comprehensively cleaned, the dependence of a data cleaning process on operators can be reduced, the accuracy and the stability of a data cleaning result are ensured, and meanwhile, the data cleaning duration can be effectively shortened.
In an invention document with a patent number of CN201810143012, a data cleaning method and a data cleaning system are disclosed. The data cleaning method comprises the following steps: step S10: selecting a data source to be cleaned from heterogeneous data sources through a graphical interface; the heterogeneous data source comprises a text file and database data; step S11: editing a data cleaning rule through a graphical interface; step S12: data cleansing is performed through a graphical interface. According to the data cleaning method, the data source to be cleaned is selected from the heterogeneous data sources through the graphical interface, fusion cleaning of different data sources can be achieved, meanwhile, a user can clean data through simple operation on the graphical interface, the development and use method of a data cleaning tool does not need to be mastered, the technical threshold of big data application service is lowered, and the user experience of the big data service is improved.
In summary, the traditional data cleaning system mostly adopts script writing and a configuration file or control dragging type mode, so that the realization is simple, but the learning and maintenance cost is high, and the flexibility is low.
Disclosure of Invention
Aiming at the defects of the prior art, the invention discloses a visual data cleaning system and method based on natural language, which are used for solving the problems that the traditional data cleaning system mostly adopts script writing and a file or control dragging type mode, is simple to realize, but has higher learning and maintenance cost and lower flexibility.
The invention is realized by the following technical scheme:
in a first aspect, the invention discloses a visual data cleaning method based on natural language, which comprises the following steps:
s1 the system is initialized successfully, and the server designates the related connection information of the data source to be cleaned;
s2, after the data source is successfully connected, acquiring the first N pieces of data to be cleaned, and analyzing the field type and format of the data;
s3, confirming the data fields needing to be accessed through the graphical interface, carrying out the first round of screening, and removing useless fields;
s4, entering natural language cleaning configuration, setting a cleaning module, and selecting triggering data synchronization and cleaning tasks;
s5, reversely analyzing the cleaning rule of the cleaning module into a data cleaning script and executing the scenario script on the data;
and S6, transmitting the cleaned data into an analysis library, and repeating the steps until all data are cleaned, thereby completing cleaning.
Further, in the method, specifying the data source related connection information to be cleaned includes providing server host information, a username password, and a database for the remote data source; corresponding directory and file paths are provided for the local data sources.
Furthermore, in the method, each time a cleaning module is added, the system gives a natural language prompt according to the analyzed source field information, and assists the user in configuring the cleaning rule by using the natural language.
Furthermore, in the method, the abnormal attribute in the data set is identified, the corresponding weight is given to each attribute, then the average value and the standard deviation of the field value of each attribute are counted, a confidence interval is set for each attribute according to the average value and the standard deviation, and whether the attribute is abnormal or not is judged according to whether the attribute value is in the confidence interval or not.
Furthermore, the method uses a reduction algorithm based on attribute importance as a logic rule to clean the data attributes, performs distinguishable identification array calculation on the decision table S ═ { U, Q, V, F }, and assigns the core attributes in the distinguishable identification array to the attribute set obtained after attribute reduction, wherein U, Q, V, F are the attributes of the data.
Furthermore, in the method, all the attribute combination items left by subtracting the core attribute in the distinguishable identification array are removed; calculating the occurrence frequency of each condition attribute, performing descending order on all attribute frequencies, selecting the attribute with the highest attribute frequency as a, and deleting the combination item containing the condition attribute a from all combination items of the variable matrix; and judging whether the distinguishable matrix is empty, if not, continuing to delete the combination item containing the condition attribute a, and if so, ending, wherein Red is the finally obtained reduction result.
Furthermore, in the method, a rule closure set is obtained by a mathematical method for a data logic constraint rule formulated by a reduction algorithm based on attribute importance, and whether a field value violates a rule constraint is automatically judged, so that the correctness of the logic rule is judged.
In a second aspect, the invention discloses a natural language-based visual data cleaning system, which includes a visual cleaning process canvas, a natural language conversion module, a server and a memory storing execution instructions, wherein when the server executes the execution instructions stored in the memory, the server executes the natural language visual data cleaning method of the first aspect.
Furthermore, the visual cleaning process canvas supports the cleaning logic of drawing data in a dragging mode, different data cleaning component blocks are added, and a data circulation path is used for connecting the component blocks.
Furthermore, when the data cleaning component block is added, the natural language conversion module uses natural language to describe cleaning logic, checks and analyzes the statement input by the user, and if the analysis is successful, converts the statement into a corresponding bottom-layer data filtering query statement and transmits the corresponding bottom-layer data filtering query statement to the bottom-layer data cleaning execution module; and if the analysis fails, returning an abnormal state code to the visual cleaning process canvas to display corresponding abnormal information to prompt a user.
The invention has the beneficial effects that:
according to the invention, through the combination of the graphical interface and the natural language engine, a user can clean data through simple operation on the graphical interface without mastering the development and use method of a data cleaning tool, the technical threshold of the big data application service is reduced, the experience of the user on the big data service is improved, the problems of flexibility and maintainability of the traditional data cleaning system are solved, the use cost of data cleaning workers is reduced by using a visualization technology and a natural language interaction mode, the efficiency is improved, and the method has a very strong market application prospect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a diagram of the principle steps of a visualization data cleansing method based on natural language.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
The embodiment discloses a visualization data cleaning method based on natural language as shown in fig. 1, which includes the following steps:
s1 the system is initialized successfully, and the server designates the related connection information of the data source to be cleaned;
s2, after the data source is successfully connected, acquiring the first N pieces of data to be cleaned, and analyzing the field type and format of the data;
s3, confirming the data fields needing to be accessed through the graphical interface, carrying out the first round of screening, and removing useless fields;
s4, entering natural language cleaning configuration, setting a cleaning module, and selecting triggering data synchronization and cleaning tasks;
s5, reversely analyzing the cleaning rule of the cleaning module into a data cleaning script and executing the scenario script on the data;
and S6, transmitting the cleaned data into an analysis library, and repeating the steps until all data are cleaned, thereby completing cleaning.
In this embodiment, the specifying of the relevant connection information of the data source to be cleaned includes, for the remote data source, providing server host information, a user name and a password, and a database; for a local data source, such as Excel, a log file, a corresponding directory and file path are provided.
In this embodiment, each time a cleaning module is added, the system provides a natural language prompt according to the analyzed source field information, and assists the user in configuring cleaning rules using natural language, such as data filtering: only importing data of which the class is not equal to the professional selection course.
In the embodiment, the abnormal attribute in the data set is identified, the corresponding weight is given to each attribute, then the average value and the standard deviation of the field value of each attribute are counted, a confidence interval is set for each attribute according to the average value and the standard deviation, and whether the attribute is abnormal or not is judged according to whether the attribute value is in the confidence interval or not.
Example 2
In this embodiment, a reduction algorithm based on attribute importance is used as a logic rule to clean data attributes, a distinguishable identification array calculation is performed on a decision table S ═ { U, Q, V, F }, and an attribute set is obtained by assigning a core attribute in the distinguishable identification array to attribute reduction, where U, Q, V, and F are data attributes.
In this embodiment, all the remaining attribute combination items are reduced by removing the kernel attribute in the distinguishable identification array; calculating the occurrence frequency of each condition attribute, performing descending order on all attribute frequencies, selecting the attribute with the highest attribute frequency as a, and deleting the combination item containing the condition attribute a from all combination items of the variable matrix; and judging whether the distinguishable matrix is empty, if not, continuing to delete the combination item containing the condition attribute a, and if so, ending, wherein Red is the finally obtained reduction result.
In the embodiment, a rule closure set is obtained by using a mathematical method for a data logic constraint rule formulated by a reduction algorithm based on attribute importance, and whether a field value violates a rule constraint is automatically judged, so that the correctness of the logic rule is judged.
In the present example, from the classification of the condition attributes, by removing the condition attributes cumulatively, and calculating and comparing the relative positive regions, it can be determined whether the core attribute and all the important attributes are removed. And then, adding the qualified condition attributes into the attribute reduction set, and outputting a final attribute reduction set.
In this embodiment, the idea based on the attribute reduction algorithm is improved, the condition attributes are subdivided, and then the attribute reduction set is directly output by comparing the relatively positive region from which the condition attributes are removed. After the algorithm is improved, the algorithm is mainly used for judging whether the core attribute and all important attributes are removed or not by calculating and comparing relative positive regions from the classification of the condition attributes through the accumulated removal of the condition attributes. Finally, the condition attributes meeting the requirements are added into the attribute reduction set, and the final attribute reduction set is output.
Example 3
The embodiment discloses a visual data cleaning system based on natural language, which comprises a visual cleaning process canvas, a natural language conversion module, a server and a memory for storing execution instructions, wherein when the server executes the execution instructions stored in the memory, the server executes a visual data cleaning method of natural language.
This embodiment is different from traditional data cleaning system, and on user interface, the user can directly adopt the washing logic of drawing data on visual washing flow canvas by dragging, namely: adding different data cleaning component blocks and connecting the component blocks by using a data circulation path.
When adding data cleansing component blocks, natural language can be used to describe cleansing logic, such as: inputting 'the creation date is more than 2019', the natural language conversion module can check and analyze the statement input by the user, if the analysis is successful, the statement can be converted into a corresponding bottom layer data filtering query statement, and the corresponding bottom layer data filtering query statement is transmitted to the bottom layer data cleaning execution module; if the analysis fails, an abnormal state code is returned to the canvas module, and the canvas module displays corresponding abnormal information to prompt a user.
In conclusion, by combining the graphical interface and the natural language engine, a user can clean data by simple operation on the graphical interface without mastering the development and use method of a data cleaning tool, the technical threshold of the big data application service is reduced, the experience of the user on the big data service is improved, the problems of flexibility and maintainability of the traditional data cleaning system are solved, the use cost of data cleaning workers is reduced by using a visualization technology and a natural language interaction mode, the efficiency is improved, and the method has a strong market application prospect.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A visual data cleaning method based on natural language is characterized by comprising the following steps:
s1 the system is initialized successfully, and the server designates the related connection information of the data source to be cleaned;
s2, after the data source is successfully connected, acquiring the first N pieces of data to be cleaned, and analyzing the field type and format of the data;
s3, confirming the data fields needing to be accessed through the graphical interface, carrying out the first round of screening, and removing useless fields;
s4, entering natural language cleaning configuration, setting a cleaning module, and selecting triggering data synchronization and cleaning tasks;
s5, reversely analyzing the cleaning rule of the cleaning module into a data cleaning script and executing the scenario script on the data;
and S6, transmitting the cleaned data into an analysis library, and repeating the steps until all data are cleaned, thereby completing cleaning.
2. The visual data cleaning method based on natural language according to claim 1, wherein, the method for specifying the data source related connection information needing cleaning comprises providing server host information, user name and password and a database for the remote data source; corresponding directory and file paths are provided for the local data sources.
3. The visual data cleaning method based on natural language as claimed in claim 1, wherein, in the method, each time a cleaning module is added, the system gives a natural language prompt according to the analyzed source field information to assist the user to configure the cleaning rule by using the natural language.
4. The visual data cleaning method based on natural language as claimed in claim 1, wherein in the method, the identification of abnormal attributes in the data set is performed by firstly giving corresponding weight to each attribute, then counting the average value and standard deviation of each attribute field value, setting a confidence interval for each attribute according to the weight, and judging whether the attribute is abnormal according to whether the attribute value is in the confidence interval.
5. The visualized data cleaning method based on natural language according to claim 1, wherein the method uses a reduction algorithm based on attribute importance as a logic rule to clean the data attributes, performs discriminable array calculation on the decision table S ═ { U, Q, V, F }, and assigns the kernel attributes in the discriminable array to the attribute set obtained by attribute reduction, wherein U, Q, V, F are the attributes of the data.
6. The visual data cleaning method based on natural language according to claim 5, characterized in that, in the method, all attribute combination items left by subtracting the kernel attribute in the recognizable array are removed; calculating the occurrence frequency of each condition attribute, performing descending order on all attribute frequencies, selecting the attribute with the highest attribute frequency as a, and deleting the combination item containing the condition attribute a from all combination items of the variable matrix; and judging whether the distinguishable matrix is empty, if not, continuing to delete the combination item containing the condition attribute a, and if so, ending, wherein Red is the finally obtained reduction result.
7. The visualized data cleaning method based on natural language as claimed in claim 1, wherein in the method, a rule closure set is obtained by a mathematical method for a data logic constraint rule formulated by a reduction algorithm based on attribute importance, and whether a field value violates a rule constraint is automatically judged, so as to judge whether the logic rule is correct or incorrect.
8. A natural language based visual data cleansing system, comprising a visual cleansing flow canvas, a natural language conversion module, a server and a memory storing execution instructions, wherein when the execution instructions stored in the memory are executed by the server, the server executes the natural language visual data cleansing method according to any one of claims 1 to 7.
9. The natural language based visual data cleansing system of claim 8 wherein the visual cleansing process canvas supports cleansing logic for drawing data in a drag-and-drop manner, adding different data cleansing component blocks, and connecting component blocks using data flow paths.
10. The visual data cleaning system based on natural language according to claim 8, wherein the natural language conversion module uses natural language description cleaning logic to check and analyze the statement input by the user when adding the data cleaning component block, and if the analysis is successful, the statement is converted into a corresponding bottom layer data filtering query statement and transmitted to the bottom layer data cleaning execution module; and if the analysis fails, returning an abnormal state code to the visual cleaning process canvas to display corresponding abnormal information to prompt a user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011617367.XA CN112667617A (en) | 2020-12-30 | 2020-12-30 | Visual data cleaning system and method based on natural language |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011617367.XA CN112667617A (en) | 2020-12-30 | 2020-12-30 | Visual data cleaning system and method based on natural language |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112667617A true CN112667617A (en) | 2021-04-16 |
Family
ID=75412038
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011617367.XA Pending CN112667617A (en) | 2020-12-30 | 2020-12-30 | Visual data cleaning system and method based on natural language |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112667617A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113138982A (en) * | 2021-05-25 | 2021-07-20 | 黄柱挺 | Big data cleaning method |
-
2020
- 2020-12-30 CN CN202011617367.XA patent/CN112667617A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113138982A (en) * | 2021-05-25 | 2021-07-20 | 黄柱挺 | Big data cleaning method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Baier et al. | Matching events and activities by integrating behavioral aspects and label analysis | |
CN103827853B (en) | Method and system for minimizing rule sets | |
CN109918437A (en) | Distributed data processing method, apparatus and data assets management system | |
EP2250589A2 (en) | Systems and methods for mapping enterprise data | |
CN112558931A (en) | Intelligent model construction and operation method for user workflow mode | |
CN108052542B (en) | Multidimensional data analysis method based on presto data | |
CN112000656A (en) | Intelligent data cleaning method and device based on metadata | |
US20140324908A1 (en) | Method and system for increasing accuracy and completeness of acquired data | |
CN117667702A (en) | Knowledge graph-based software testing method, device, equipment and storage medium | |
CN114996331B (en) | Data mining control method and system | |
CN111695979A (en) | Method, device and equipment for analyzing relation between raw material and finished product | |
CN112416800A (en) | Intelligent contract testing method, device, equipment and storage medium | |
CN116244367A (en) | Visual big data analysis platform based on multi-model custom algorithm | |
Baier et al. | Matching of events and activities-an approach using declarative modeling constraints | |
CN115454702A (en) | Log fault analysis method and device, storage medium and electronic equipment | |
CN115657890A (en) | PRA robot customizable method | |
CN115576834A (en) | Software test multiplexing method, system, terminal and medium for supporting fault recovery | |
CN117333012A (en) | Financial risk tracking management system, device and storage medium based on data mining | |
US11423045B2 (en) | Augmented analytics techniques for generating data visualizations and actionable insights | |
CN112667617A (en) | Visual data cleaning system and method based on natural language | |
CN117519656A (en) | Software development system based on intelligent manufacturing | |
CN117792882A (en) | Communication network fault log analysis method based on large language model assistance | |
Yano et al. | A practical approach to automated business process discovery | |
US20200327125A1 (en) | Systems and methods for hierarchical process mining | |
CN112416918A (en) | Data management system and working method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |