CN112463782B

CN112463782B - Data cleaning method and system based on optimized edit distance

Info

Publication number: CN112463782B
Application number: CN202011406088.9A
Authority: CN
Inventors: 金震; 李明; 王兆君; 曹朝辉; 杨海建
Original assignee: Beijing SunwayWorld Science and Technology Co Ltd
Current assignee: Beijing SunwayWorld Science and Technology Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2022-03-18
Anticipated expiration: 2040-12-03
Also published as: CN112463782A

Abstract

The invention provides a data cleaning method and system based on optimized edit distance, wherein the method comprises the following steps: acquiring a plurality of original data through a preset method, and performing semantic recognition on each original data; receiving a query keyword input by a user; performing online Excel data cleaning on each original data based on the query keywords according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result; during the data cleaning process, the user can perform manual intervention on the data cleaning process. The data cleaning method and system based on the optimized edit distance greatly improve the cleaning efficiency, reduce the data cleaning threshold and improve the accuracy of data cleaning.

Description

Data cleaning method and system based on optimized edit distance

Technical Field

The invention relates to the technical field of data cleaning, in particular to a data cleaning method and system based on optimized edit distance.

Background

At present, most of traditional data cleaning methods adopt an approximate character string matching algorithm based on an editing distance, most of the traditional data cleaning methods are based on signatures, and an index structure is adopted to support approximate character string matching; however, as the amount of information to be cleaned increases, the matching error problem often occurs by using the approximate character string matching algorithm, so that the data cleaning efficiency is reduced.

Disclosure of Invention

The invention aims to provide a data cleaning method and system based on optimized edit distance.

The data cleaning method based on the optimized edit distance provided by the embodiment of the invention comprises the following steps:

acquiring a plurality of original data through a preset method, and performing semantic recognition on each original data;

receiving a query keyword input by a user;

performing online Excel data cleaning on each original data based on the query keyword according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result;

and in the data cleaning process, receiving a setting instruction input by a user and executing corresponding operation.

Preferably, the obtaining of the plurality of original data by a preset method specifically includes:

extracting the original data from the preset service system,

and/or the presence of a gas in the gas,

and accessing one or more of an ODBC data source, an XML data source, an Excel table and a text report to obtain the original data.

Preferably, the online Excel data cleaning is performed on each original data based on the query keyword according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result, and specifically includes:

judging whether each original data is matched with the query keyword according to a preset algorithm, and if so, outputting;

and/or the presence of a gas in the gas,

judging whether each original data is matched with the query keyword according to a preset mapping relation, and if so, outputting;

and/or the presence of a gas in the gas,

judging whether each original data is matched with the query keyword or not according to a semantic recognition result, and outputting the result if the original data is matched with the query keyword;

the processes all adopt an Excel-like processing mode.

Preferably, the receiving a setting instruction input by a user and executing a corresponding operation specifically includes:

in the data cleaning process, receiving an instruction which is input by a user and used for setting the value list range and the upper and lower limit values of the standard template attribute, and executing corresponding operation;

and/or the presence of a gas in the gas,

receiving a setting instruction of a user for designating a category and assigning a responsible person for the original data in batches, and executing corresponding operation;

and/or the presence of a gas in the gas,

and receiving an instruction set by a user for one or more combinations of the cleaning rule, the matching rule and the matching strategy, and executing corresponding operation.

Preferably, the data cleansing method based on the optimized edit distance further includes:

receiving a definition operation input by a user to generate a check item list or generating the check item list according to a preset generation rule at intervals of a preset time;

the list of check items includes: one or more of a preset algorithm, a preset mapping relation and a semantic recognition result are combined;

selecting any inspection item from the inspection item list as a target inspection item;

acquiring a historical operating record in a preset historical database;

traversing each group of records in the historical operating records, and selecting a record combination related to the target inspection item as a record to be sorted;

calculating the sorting index of each record of the records to be sorted:

σ_ifor the ranking index of the ith record in the records to be ranked, t_iThe service time corresponding to the target inspection item in the ith record in the record to be sorted, T_iFor the total time length corresponding to the ith record in the records to be sorted, e_iFor the total number of items involved in the examination in the ith record of the records to be sorted, r₀The experience weight value corresponding to the target inspection item, n is the total number of all records in the record to be sorted, k is the total number of all records in the historical operating record, j is₁、j₂、j₃And j₄Is a preset weight value, and tau is a preset error coefficient;

sorting all records in the records to be sorted from large to small according to the sorting index, and selecting the first gamma record combinations in the records to be sorted as sequences to be checked;

calculating the check value of each record of the sequence to be checked:

wherein, mu_cFor the test value, p, recorded at the c-th in the sequence to be tested_cFor the total length of time of the cleaning of the data in the c-th record of the sequence to be examined, E_cFor the total number of items related to the examination in the c-th record in the sequence to be examined, rho is a preset determination coefficient, A_cFor the number of treatments carried out in the c-th record of the sequence to be examined, B_c ^mThe time taken for the mth processing using the target inspection item in the c-th record in the sequence to be inspected, D is the total number of times of processing using the target inspection item in the c-th record in the sequence to be inspected, L_cThe size of original data in the c record in the sequence to be detected is obtained;

when the number of corresponding records which are more than or equal to a preset detection threshold value in the sequence to be detected is more than or equal to the preset number threshold value, listing the detection item into a selectable list, otherwise, listing the detection item into an unselected list;

sorting the inspection items in the inspection item list from large to small according to the corresponding inspection values, and selecting the first mu inspection item combinations as an acquisition recommendation list;

and in the data cleaning process, if the manual intervention operation performed by the user exists in the non-selectable list, reminding the user, and pushing the recommendation list to the user for reference.

Preferably, the number threshold is adjusted according to the following preset method, including:

wherein V is the adjusted number threshold, int is an integer function, epsilon is a preset adjusting coefficient, V₀For the number threshold before adjustment, N is the total number of the inspection items in the inspection item list, k is the total number of all records in the historical operating records, and Z_hThe number of groups of h-th inspection item in the inspection item list related to records in the historical running record is set, w is a preset separation threshold value, G is a preset comparison threshold value, max is a maximum value taking function, min is a minimum value taking function, and is sum, or is or.

The embodiment of the invention provides a data cleaning system based on optimized edit distance, which comprises:

the acquisition module acquires a plurality of original data through a preset method;

the semantic recognition module is used for performing semantic recognition on each original data;

the first receiving module is used for receiving the query keywords input by the user;

the data cleaning module is used for cleaning each original data on line Excel data based on the query keywords according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result;

and the second receiving module is used for receiving a setting instruction input by a user and executing corresponding operation in the data cleaning process.

Preferably, the obtaining module performs operations including:

extracting the original data from the preset service system,

and/or the presence of a gas in the gas,

Preferably, the data cleansing module performs operations including:

and/or the presence of a gas in the gas,

the processes all adopt an Excel-like processing mode.

Preferably, the second receiving module performs operations including:

and/or the presence of a gas in the gas,

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a method for data cleansing based on optimized edit distance in an embodiment of the present invention;

FIG. 2 is a diagram illustrating an optimized edit distance based data cleansing system according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

The embodiment of the invention provides a data cleaning method based on optimized edit distance, as shown in fig. 1, comprising the following steps:

receiving a query keyword input by a user;

The working principle of the technical scheme is as follows:

importing (acquiring) each original data through a preset method; firstly, performing semantic recognition on each original data; secondly, inputting a query keyword by a user through an operation client; then, searching data matched with the query keyword input by the user in each original data according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result, and then outputting (namely, data cleaning), wherein the data cleaning process is carried out by adopting an Excel-like operation mode; meanwhile, in the data cleaning process, a setting instruction input by a user is received and corresponding operation is executed, and the user can set parameters of the data cleaning process through an operation client; during semantic recognition, comparing each piece of original data with a model in a preset semantic database to recognize semantics, and supporting semantic recognition of structured data and unstructured data; the preset algorithm specifically comprises the following steps: the V-chunk-gram algorithm based on variable length signatures divides different signatures for the character string and the query character string in the character string set and simultaneously uses a variable length signature dictionary, optimizes the traditional chunk-gram algorithm based on editing distance, and can obtain higher-quality chunks; the preset mapping relationship is specifically as follows: for example, the system establishes a mapping relationship between "east city of Beijing city" and "east city of Beijing city", and when the query keyword is "east city of Beijing city", the system outputs "east city of Beijing city" matching therewith; the online Excel is specifically used for online data processing by means similar to an Excel table, and is similar to an online Excel tool in Tencent documents.

The beneficial effects of the above technical scheme are: according to the approximate character string matching technology based on the optimized edit distance, the object can be matched more accurately and the data cleaning accuracy rate is increased when the information quantity is continuously increased through the V-chunk-gram algorithm based on the variable-length signature, meanwhile, the data cleaning process utilizes the operation modes of online Excel and complete Excel, the cleaning efficiency is greatly improved, the data cleaning threshold is reduced, in addition, the parameters of the data cleaning process can be set by a user in the data cleaning process, the user experience is improved, the intelligent level of the data cleaning is improved, and the operation complexity of the data cleaning is reduced.

The embodiment of the invention provides a data cleaning method based on optimized edit distance, which obtains a plurality of original data through a preset method, and specifically comprises the following steps:

extracting the original data from the preset service system,

and/or the presence of a gas in the gas,

The working principle of the technical scheme is as follows:

original data can be directly extracted from a business system; the ODBC (Open Database Connectivity) data source can be directly accessed to obtain the original data; the XML data source can be directly accessed to obtain the original data; the Excel table can be directly accessed to obtain original data; the text report can be directly accessed to obtain the original data; the raw data can also be obtained by a combination of the above methods.

The beneficial effects of the above technical scheme are: the embodiment of the invention can obtain the original data through a plurality of methods, can meet more application scenes and improve the user experience.

The embodiment of the invention provides an optimized edit distance-based data cleaning method, which is used for performing online Excel data cleaning on each original data based on a query keyword according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result, and specifically comprises the following steps:

and/or the presence of a gas in the gas,

the processes all adopt an Excel-like processing mode.

The working principle of the technical scheme is as follows:

calculating the similarity of each original data and the query keyword input by the user according to the preset algorithm, and sequencing the original data from large to small according to the similarity and then displaying the data as a query result; whether each original data is matched with the query keyword input by the user or not can be determined according to the mapping relation, and if the original data is matched with the query keyword, the original data is output and displayed as a query result; and determining whether each original data is matched with the query keyword input by the user according to the semantic recognition result, and if so, outputting and displaying the result as a query result.

The beneficial effects of the above technical scheme are: according to the embodiment of the invention, the online Excel data cleaning is carried out on each original data based on the query keywords according to the preset algorithm and/or the preset mapping relation and/or the semantic recognition result, the data cleaning efficiency is improved, and the data cleaning threshold is reduced.

The embodiment of the invention provides a data cleaning method based on optimized edit distance, which receives a setting instruction input by a user and executes corresponding operation, and specifically comprises the following steps:

and/or the presence of a gas in the gas,

The working principle of the technical scheme is as follows:

in the process of data cleaning, a user can supplement a value list range, an upper limit value and a lower limit value of the standard template attribute; the method comprises the following steps of supporting a user to assign categories and responsible persons to original data in batches, supporting the user to define cleaning rules in batches, realizing batch cleaning of the data by operating the cleaning rules, and supporting the user to define matching strategies, matching rules, data characteristics and the like; the value list range and the upper and lower limit values of the standard template attribute are specifically as follows: when the Excel-like data processing is carried out on line, the attributes of a data processing template are defined (for example, rows and columns are adopted, a list range is the number of the rows and the columns, an upper limit value and a lower limit value are the maximum value and the minimum value of the rows and the columns), the category of original data is specified (for example, the original data belongs to address data, the original data belongs to weather data and the like), a responsible person is assigned (provided by XXX company and the like), a cleaning rule (a preset algorithm, a preset mapping relation and a semantic recognition result) can be selected, a matching rule (for example, 1-to-1 matching or 1-to-many matching) and a matching strategy (for example, 2 cleaning rules are used for data cleaning, namely two matching modes are used for matching).

The beneficial effects of the above technical scheme are: the embodiment of the invention can support the user to set each parameter of the data cleaning process in the data cleaning process, further improves the data cleaning efficiency and improves the user experience.

The embodiment of the invention provides a data cleaning method based on optimized edit distance, which further comprises the following steps:

acquiring a historical operating record in a preset historical database;

calculating the sorting index of each record of the records to be sorted:

calculating the check value of each record of the sequence to be checked:

The working principle of the technical scheme is as follows:

the user can select a checking item list which is required to be checked by operating the client; the system can also automatically generate a list of inspection items to be inspected; the inspection items may be: cleaning rules (according to a preset algorithm, according to a mapping relation and according to semantic recognition results), matching rules and the like; the preset historical database is responsible for recording the record of data cleaning of the system; the start time length is specifically, for example: in the data cleaning process, a user replaces manual intervention with data cleaning according to a preset algorithm, and the time length from the data cleaning starting moment is the starting time length; the total number of items involved in the record is specifically, for example: in the process of data cleaning, two methods according to a preset algorithm and a mapping relation are respectively used, and the total number is 2; sequencing all records in the historical operating records by calculating a sequencing index, and selecting the first gamma record combinations as sequences to be detected; the treatment specifically comprises the following steps: comparing an original data with the query keyword input by the user by using the inspection item, and judging whether the original data is matched with the query keyword; the time for processing is the time taken for judging whether the matching is carried out or not; when the number of corresponding records which are more than or equal to the preset inspection threshold value in the sequence to be inspected is more than the preset number threshold value, the inspection item is proved to pass the inspection and is worthy of recommendation, and the inspection item is listed in the selectable list, otherwise, the inspection item is listed in the non-selectable list; meanwhile, all the inspection items in the inspection item list are sorted from large to small according to corresponding inspection values to be combined into a recommendation list; in the data cleaning process, when a user performs manual intervention on the user through an operation client, if a check item in the non-selectable list is defined (selected), namely the manual intervention operation of the user exists in the non-selectable list, the user is reminded and a recommendation list is pushed to the user.

The beneficial effects of the above technical scheme are: the embodiment of the invention checks whether each check item in the system can be selected and ranked by a user or automatically checking the system, reminds the user when the user is about to select the check item outside the selectable list and pushes a recommendation list to the user when the user manually intervenes the check item in the data cleaning process, and suggests the user to select the front check item in the recommendation list to intervene the data cleaning process.

The embodiment of the invention provides a data cleaning method based on optimized edit distance, wherein the number threshold is adjusted according to the following preset method, and the method comprises the following steps:

The working principle of the technical scheme is as follows:

as time goes on, the historical record database will be expanded continuously; the whole number threshold value is self-adapted according to the continuous update of the historical database (namely the increase of the number of record groups in the historical records and the increase of the total number of related items of the check items in each record); the number threshold before adjustment is generally the number threshold currently set by the system.

The beneficial effects of the above technical scheme are: according to the embodiment of the invention, the number threshold value set by the system is adaptively adjusted according to the increase of the number of record groups in the historical records and the increase of the total number of the inspection items in each record, so that the accuracy of self-inspection of the system is improved, the working efficiency of the system is improved, and meanwhile, the system is more intelligent.

The working principle of the technical scheme is as follows:

The embodiment of the invention provides a data cleaning system based on optimized edit distance, wherein an acquisition module executes the following operations:

extracting the original data from the preset service system,

and/or the presence of a gas in the gas,

The working principle of the technical scheme is as follows:

The embodiment of the invention provides a data cleaning system based on optimized edit distance, wherein the data cleaning module executes the following operations:

and/or the presence of a gas in the gas,

the processes all adopt an Excel-like processing mode.

The working principle of the technical scheme is as follows:

The embodiment of the invention provides a data cleaning system based on optimized edit distance, and the second receiving module executes the following operations:

and/or the presence of a gas in the gas,

The working principle of the technical scheme is as follows:

The embodiment of the invention provides a data cleaning system based on optimized edit distance, which further comprises:

the self-adaptive detection module is used for checking whether the inspection item is selectable;

the adaptive detection module performs operations comprising:

acquiring a historical operating record in a preset historical database;

calculating the sorting index of each record of the records to be sorted:

calculating the check value of each record of the sequence to be checked:

wherein, mu_cFor the test value, p, recorded at the c-th in the sequence to be tested_cFor the total length of time of the cleaning of the data in the c-th record of the sequence to be examined, E_cFor the total number of items related to the examination in the c-th record in the sequence to be examined, rho is a preset determination coefficient, A_cFor the number of treatments carried out in the c-th record of the sequence to be examined, B_c ^mThe time taken for the mth processing using the target inspection item in the c-th record in the sequence to be inspected, D is the total number of times of processing using the target inspection item in the c-th record in the sequence to be inspected, L_cFor the original in the c-th record in the sequence to be examinedThe size of the data;

The working principle of the technical scheme is as follows:

The embodiment of the invention provides a data cleaning system based on optimized edit distance, wherein an adaptive detection module adjusts the number threshold according to the following preset method, and the method comprises the following steps:

wherein V is the adjusted number threshold, int is an integer function, epsilon is a preset adjusting coefficient, V₀For the number threshold before adjustment, N is the total number of the inspection items in the inspection item list, k is the total number of all records in the historical operating records, and Z_hFor h-th check item in the check item listThe number of groups related to records in the history running record, w is a preset separation threshold, G is a preset comparison threshold, max is a maximum function, min is a minimum function, and is sum, or is or.

The working principle of the technical scheme is as follows:

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A data cleaning method based on optimized edit distance is characterized by comprising the following steps:

receiving a query keyword input by a user;

in the data cleaning process, receiving a setting instruction input by a user and executing corresponding operation;

acquiring a historical operating record in a preset historical database;

calculating the sorting index of each record of the records to be sorted:

calculating the check value of each record of the sequence to be checked:

wherein, mu_cFor examination of the c-th record in the sequence to be examinedValue, p_cFor the total length of time of the cleaning of the data in the c-th record of the sequence to be examined, E_cFor the total number of items related to the examination in the c-th record in the sequence to be examined, rho is a preset determination coefficient, A_cFor the number of treatments carried out in the c-th record of the sequence to be examined, B_c ^mThe time taken for the mth processing using the target inspection item in the c-th record in the sequence to be inspected, D is the total number of times of processing using the target inspection item in the c-th record in the sequence to be inspected, L_cThe size of original data in the c record in the sequence to be detected is obtained;

2. The method for cleaning data based on optimized edit distance according to claim 1, wherein the obtaining of the plurality of original data by the presetting method specifically includes:

extracting the original data from the preset service system,

and/or the presence of a gas in the gas,

3. The data cleaning method based on the optimized edit distance as claimed in claim 1, wherein the online Excel-like data cleaning is performed on each original data based on the query keyword according to a preset algorithm and/or a preset mapping relationship and/or a semantic recognition result, specifically comprising:

and/or the presence of a gas in the gas,

the processes all adopt an Excel-like processing mode.

4. The data cleaning method based on the optimized edit distance as claimed in claim 1, wherein the receiving a setting instruction input by a user and executing a corresponding operation specifically includes:

and/or the presence of a gas in the gas,

5. The optimized edit distance-based data cleansing method according to claim 1, wherein the number threshold is adjusted according to a preset method comprising:

6. An optimized edit distance based data cleansing system comprising:

the second receiving module is used for receiving a setting instruction input by a user and executing corresponding operation in the data cleaning process;

the adaptive detection module performs operations comprising:

acquiring a historical operating record in a preset historical database;

calculating the sorting index of each record of the records to be sorted:

calculating the check value of each record of the sequence to be checked:

wherein, mu_cFor the test value, p, recorded at the c-th in the sequence to be tested_cFor the total length of time of the cleaning of the data in the c-th record of the sequence to be examined, E_cFor the total number of items related to the examination in the c-th record in the sequence to be examined, rho is a preset determination coefficient, A_cFor the number of treatments carried out in the c-th record of the sequence to be examined, B_c ^mFor the c-th record in the sequence to be examinedThe time taken for the mth processing using the target inspection item, D is the total number of times of processing using the target inspection item in the c-th record in the sequence to be inspected, L_cThe size of original data in the c record in the sequence to be detected is obtained;

7. The optimized edit distance-based data cleansing system of claim 6, wherein the obtaining module performs operations comprising:

extracting the original data from the preset service system,

and/or the presence of a gas in the gas,

8. The optimized edit distance-based data cleansing system of claim 6, wherein the data cleansing module performs operations comprising:

and/or the presence of a gas in the gas,

the processes all adopt an Excel-like processing mode.

9. The optimized edit distance-based data cleansing system of claim 6, wherein the second receiving module performs operations comprising:

and/or the presence of a gas in the gas,