CN112463782B - Data cleaning method and system based on optimized edit distance - Google Patents

Data cleaning method and system based on optimized edit distance Download PDF

Info

Publication number
CN112463782B
CN112463782B CN202011406088.9A CN202011406088A CN112463782B CN 112463782 B CN112463782 B CN 112463782B CN 202011406088 A CN202011406088 A CN 202011406088A CN 112463782 B CN112463782 B CN 112463782B
Authority
CN
China
Prior art keywords
record
preset
data
records
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011406088.9A
Other languages
Chinese (zh)
Other versions
CN112463782A (en
Inventor
金震
李明
王兆君
曹朝辉
杨海建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SunwayWorld Science and Technology Co Ltd
Original Assignee
Beijing SunwayWorld Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SunwayWorld Science and Technology Co Ltd filed Critical Beijing SunwayWorld Science and Technology Co Ltd
Priority to CN202011406088.9A priority Critical patent/CN112463782B/en
Publication of CN112463782A publication Critical patent/CN112463782A/en
Application granted granted Critical
Publication of CN112463782B publication Critical patent/CN112463782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results

Abstract

The invention provides a data cleaning method and system based on optimized edit distance, wherein the method comprises the following steps: acquiring a plurality of original data through a preset method, and performing semantic recognition on each original data; receiving a query keyword input by a user; performing online Excel data cleaning on each original data based on the query keywords according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result; during the data cleaning process, the user can perform manual intervention on the data cleaning process. The data cleaning method and system based on the optimized edit distance greatly improve the cleaning efficiency, reduce the data cleaning threshold and improve the accuracy of data cleaning.

Description

Data cleaning method and system based on optimized edit distance
Technical Field
The invention relates to the technical field of data cleaning, in particular to a data cleaning method and system based on optimized edit distance.
Background
At present, most of traditional data cleaning methods adopt an approximate character string matching algorithm based on an editing distance, most of the traditional data cleaning methods are based on signatures, and an index structure is adopted to support approximate character string matching; however, as the amount of information to be cleaned increases, the matching error problem often occurs by using the approximate character string matching algorithm, so that the data cleaning efficiency is reduced.
Disclosure of Invention
The invention aims to provide a data cleaning method and system based on optimized edit distance.
The data cleaning method based on the optimized edit distance provided by the embodiment of the invention comprises the following steps:
acquiring a plurality of original data through a preset method, and performing semantic recognition on each original data;
receiving a query keyword input by a user;
performing online Excel data cleaning on each original data based on the query keyword according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result;
and in the data cleaning process, receiving a setting instruction input by a user and executing corresponding operation.
Preferably, the obtaining of the plurality of original data by a preset method specifically includes:
extracting the original data from the preset service system,
and/or the presence of a gas in the gas,
and accessing one or more of an ODBC data source, an XML data source, an Excel table and a text report to obtain the original data.
Preferably, the online Excel data cleaning is performed on each original data based on the query keyword according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result, and specifically includes:
judging whether each original data is matched with the query keyword according to a preset algorithm, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword according to a preset mapping relation, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword or not according to a semantic recognition result, and outputting the result if the original data is matched with the query keyword;
the processes all adopt an Excel-like processing mode.
Preferably, the receiving a setting instruction input by a user and executing a corresponding operation specifically includes:
in the data cleaning process, receiving an instruction which is input by a user and used for setting the value list range and the upper and lower limit values of the standard template attribute, and executing corresponding operation;
and/or the presence of a gas in the gas,
receiving a setting instruction of a user for designating a category and assigning a responsible person for the original data in batches, and executing corresponding operation;
and/or the presence of a gas in the gas,
and receiving an instruction set by a user for one or more combinations of the cleaning rule, the matching rule and the matching strategy, and executing corresponding operation.
Preferably, the data cleansing method based on the optimized edit distance further includes:
receiving a definition operation input by a user to generate a check item list or generating the check item list according to a preset generation rule at intervals of a preset time;
the list of check items includes: one or more of a preset algorithm, a preset mapping relation and a semantic recognition result are combined;
selecting any inspection item from the inspection item list as a target inspection item;
acquiring a historical operating record in a preset historical database;
traversing each group of records in the historical operating records, and selecting a record combination related to the target inspection item as a record to be sorted;
calculating the sorting index of each record of the records to be sorted:
Figure BDA0002814244610000031
σifor the ranking index of the ith record in the records to be ranked, tiThe service time corresponding to the target inspection item in the ith record in the record to be sorted, TiFor the total time length corresponding to the ith record in the records to be sorted, eiFor the total number of items involved in the examination in the ith record of the records to be sorted, r0The experience weight value corresponding to the target inspection item, n is the total number of all records in the record to be sorted, k is the total number of all records in the historical operating record, j is1、j2、j3And j4Is a preset weight value, and tau is a preset error coefficient;
sorting all records in the records to be sorted from large to small according to the sorting index, and selecting the first gamma record combinations in the records to be sorted as sequences to be checked;
calculating the check value of each record of the sequence to be checked:
Figure BDA0002814244610000032
wherein, mucFor the test value, p, recorded at the c-th in the sequence to be testedcFor the total length of time of the cleaning of the data in the c-th record of the sequence to be examined, EcFor the total number of items related to the examination in the c-th record in the sequence to be examined, rho is a preset determination coefficient, AcFor the number of treatments carried out in the c-th record of the sequence to be examined, Bc mThe time taken for the mth processing using the target inspection item in the c-th record in the sequence to be inspected, D is the total number of times of processing using the target inspection item in the c-th record in the sequence to be inspected, LcThe size of original data in the c record in the sequence to be detected is obtained;
when the number of corresponding records which are more than or equal to a preset detection threshold value in the sequence to be detected is more than or equal to the preset number threshold value, listing the detection item into a selectable list, otherwise, listing the detection item into an unselected list;
sorting the inspection items in the inspection item list from large to small according to the corresponding inspection values, and selecting the first mu inspection item combinations as an acquisition recommendation list;
and in the data cleaning process, if the manual intervention operation performed by the user exists in the non-selectable list, reminding the user, and pushing the recommendation list to the user for reference.
Preferably, the number threshold is adjusted according to the following preset method, including:
Figure BDA0002814244610000041
Figure BDA0002814244610000042
wherein V is the adjusted number threshold, int is an integer function, epsilon is a preset adjusting coefficient, V0For the number threshold before adjustment, N is the total number of the inspection items in the inspection item list, k is the total number of all records in the historical operating records, and ZhThe number of groups of h-th inspection item in the inspection item list related to records in the historical running record is set, w is a preset separation threshold value, G is a preset comparison threshold value, max is a maximum value taking function, min is a minimum value taking function, and is sum, or is or.
The embodiment of the invention provides a data cleaning system based on optimized edit distance, which comprises:
the acquisition module acquires a plurality of original data through a preset method;
the semantic recognition module is used for performing semantic recognition on each original data;
the first receiving module is used for receiving the query keywords input by the user;
the data cleaning module is used for cleaning each original data on line Excel data based on the query keywords according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result;
and the second receiving module is used for receiving a setting instruction input by a user and executing corresponding operation in the data cleaning process.
Preferably, the obtaining module performs operations including:
extracting the original data from the preset service system,
and/or the presence of a gas in the gas,
and accessing one or more of an ODBC data source, an XML data source, an Excel table and a text report to obtain the original data.
Preferably, the data cleansing module performs operations including:
judging whether each original data is matched with the query keyword according to a preset algorithm, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword according to a preset mapping relation, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword or not according to a semantic recognition result, and outputting the result if the original data is matched with the query keyword;
the processes all adopt an Excel-like processing mode.
Preferably, the second receiving module performs operations including:
in the data cleaning process, receiving an instruction which is input by a user and used for setting the value list range and the upper and lower limit values of the standard template attribute, and executing corresponding operation;
and/or the presence of a gas in the gas,
receiving a setting instruction of a user for designating a category and assigning a responsible person for the original data in batches, and executing corresponding operation;
and/or the presence of a gas in the gas,
and receiving an instruction set by a user for one or more combinations of the cleaning rule, the matching rule and the matching strategy, and executing corresponding operation.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a method for data cleansing based on optimized edit distance in an embodiment of the present invention;
FIG. 2 is a diagram illustrating an optimized edit distance based data cleansing system according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
The embodiment of the invention provides a data cleaning method based on optimized edit distance, as shown in fig. 1, comprising the following steps:
acquiring a plurality of original data through a preset method, and performing semantic recognition on each original data;
receiving a query keyword input by a user;
performing online Excel data cleaning on each original data based on the query keyword according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result;
and in the data cleaning process, receiving a setting instruction input by a user and executing corresponding operation.
The working principle of the technical scheme is as follows:
importing (acquiring) each original data through a preset method; firstly, performing semantic recognition on each original data; secondly, inputting a query keyword by a user through an operation client; then, searching data matched with the query keyword input by the user in each original data according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result, and then outputting (namely, data cleaning), wherein the data cleaning process is carried out by adopting an Excel-like operation mode; meanwhile, in the data cleaning process, a setting instruction input by a user is received and corresponding operation is executed, and the user can set parameters of the data cleaning process through an operation client; during semantic recognition, comparing each piece of original data with a model in a preset semantic database to recognize semantics, and supporting semantic recognition of structured data and unstructured data; the preset algorithm specifically comprises the following steps: the V-chunk-gram algorithm based on variable length signatures divides different signatures for the character string and the query character string in the character string set and simultaneously uses a variable length signature dictionary, optimizes the traditional chunk-gram algorithm based on editing distance, and can obtain higher-quality chunks; the preset mapping relationship is specifically as follows: for example, the system establishes a mapping relationship between "east city of Beijing city" and "east city of Beijing city", and when the query keyword is "east city of Beijing city", the system outputs "east city of Beijing city" matching therewith; the online Excel is specifically used for online data processing by means similar to an Excel table, and is similar to an online Excel tool in Tencent documents.
The beneficial effects of the above technical scheme are: according to the approximate character string matching technology based on the optimized edit distance, the object can be matched more accurately and the data cleaning accuracy rate is increased when the information quantity is continuously increased through the V-chunk-gram algorithm based on the variable-length signature, meanwhile, the data cleaning process utilizes the operation modes of online Excel and complete Excel, the cleaning efficiency is greatly improved, the data cleaning threshold is reduced, in addition, the parameters of the data cleaning process can be set by a user in the data cleaning process, the user experience is improved, the intelligent level of the data cleaning is improved, and the operation complexity of the data cleaning is reduced.
The embodiment of the invention provides a data cleaning method based on optimized edit distance, which obtains a plurality of original data through a preset method, and specifically comprises the following steps:
extracting the original data from the preset service system,
and/or the presence of a gas in the gas,
and accessing one or more of an ODBC data source, an XML data source, an Excel table and a text report to obtain the original data.
The working principle of the technical scheme is as follows:
original data can be directly extracted from a business system; the ODBC (Open Database Connectivity) data source can be directly accessed to obtain the original data; the XML data source can be directly accessed to obtain the original data; the Excel table can be directly accessed to obtain original data; the text report can be directly accessed to obtain the original data; the raw data can also be obtained by a combination of the above methods.
The beneficial effects of the above technical scheme are: the embodiment of the invention can obtain the original data through a plurality of methods, can meet more application scenes and improve the user experience.
The embodiment of the invention provides an optimized edit distance-based data cleaning method, which is used for performing online Excel data cleaning on each original data based on a query keyword according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result, and specifically comprises the following steps:
judging whether each original data is matched with the query keyword according to a preset algorithm, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword according to a preset mapping relation, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword or not according to a semantic recognition result, and outputting the result if the original data is matched with the query keyword;
the processes all adopt an Excel-like processing mode.
The working principle of the technical scheme is as follows:
calculating the similarity of each original data and the query keyword input by the user according to the preset algorithm, and sequencing the original data from large to small according to the similarity and then displaying the data as a query result; whether each original data is matched with the query keyword input by the user or not can be determined according to the mapping relation, and if the original data is matched with the query keyword, the original data is output and displayed as a query result; and determining whether each original data is matched with the query keyword input by the user according to the semantic recognition result, and if so, outputting and displaying the result as a query result.
The beneficial effects of the above technical scheme are: according to the embodiment of the invention, the online Excel data cleaning is carried out on each original data based on the query keywords according to the preset algorithm and/or the preset mapping relation and/or the semantic recognition result, the data cleaning efficiency is improved, and the data cleaning threshold is reduced.
The embodiment of the invention provides a data cleaning method based on optimized edit distance, which receives a setting instruction input by a user and executes corresponding operation, and specifically comprises the following steps:
in the data cleaning process, receiving an instruction which is input by a user and used for setting the value list range and the upper and lower limit values of the standard template attribute, and executing corresponding operation;
and/or the presence of a gas in the gas,
receiving a setting instruction of a user for designating a category and assigning a responsible person for the original data in batches, and executing corresponding operation;
and/or the presence of a gas in the gas,
and receiving an instruction set by a user for one or more combinations of the cleaning rule, the matching rule and the matching strategy, and executing corresponding operation.
The working principle of the technical scheme is as follows:
in the process of data cleaning, a user can supplement a value list range, an upper limit value and a lower limit value of the standard template attribute; the method comprises the following steps of supporting a user to assign categories and responsible persons to original data in batches, supporting the user to define cleaning rules in batches, realizing batch cleaning of the data by operating the cleaning rules, and supporting the user to define matching strategies, matching rules, data characteristics and the like; the value list range and the upper and lower limit values of the standard template attribute are specifically as follows: when the Excel-like data processing is carried out on line, the attributes of a data processing template are defined (for example, rows and columns are adopted, a list range is the number of the rows and the columns, an upper limit value and a lower limit value are the maximum value and the minimum value of the rows and the columns), the category of original data is specified (for example, the original data belongs to address data, the original data belongs to weather data and the like), a responsible person is assigned (provided by XXX company and the like), a cleaning rule (a preset algorithm, a preset mapping relation and a semantic recognition result) can be selected, a matching rule (for example, 1-to-1 matching or 1-to-many matching) and a matching strategy (for example, 2 cleaning rules are used for data cleaning, namely two matching modes are used for matching).
The beneficial effects of the above technical scheme are: the embodiment of the invention can support the user to set each parameter of the data cleaning process in the data cleaning process, further improves the data cleaning efficiency and improves the user experience.
The embodiment of the invention provides a data cleaning method based on optimized edit distance, which further comprises the following steps:
receiving a definition operation input by a user to generate a check item list or generating the check item list according to a preset generation rule at intervals of a preset time;
the list of check items includes: one or more of a preset algorithm, a preset mapping relation and a semantic recognition result are combined;
selecting any inspection item from the inspection item list as a target inspection item;
acquiring a historical operating record in a preset historical database;
traversing each group of records in the historical operating records, and selecting a record combination related to the target inspection item as a record to be sorted;
calculating the sorting index of each record of the records to be sorted:
Figure BDA0002814244610000091
σifor the ranking index of the ith record in the records to be ranked, tiThe service time corresponding to the target inspection item in the ith record in the record to be sorted, TiFor the total time length corresponding to the ith record in the records to be sorted, eiFor the total number of items involved in the examination in the ith record of the records to be sorted, r0The experience weight value corresponding to the target inspection item, n is the total number of all records in the record to be sorted, k is the total number of all records in the historical operating record, j is1、j2、j3And j4Is a preset weight value, and tau is a preset error coefficient;
sorting all records in the records to be sorted from large to small according to the sorting index, and selecting the first gamma record combinations in the records to be sorted as sequences to be checked;
calculating the check value of each record of the sequence to be checked:
Figure BDA0002814244610000101
wherein, mucFor the test value, p, recorded at the c-th in the sequence to be testedcFor the total length of time of the cleaning of the data in the c-th record of the sequence to be examined, EcFor the total number of items related to the examination in the c-th record in the sequence to be examined, rho is a preset determination coefficient, AcFor the number of treatments carried out in the c-th record of the sequence to be examined, Bc mThe time taken for the mth processing using the target inspection item in the c-th record in the sequence to be inspected, D is the total number of times of processing using the target inspection item in the c-th record in the sequence to be inspected, LcThe size of original data in the c record in the sequence to be detected is obtained;
when the number of corresponding records which are more than or equal to a preset detection threshold value in the sequence to be detected is more than or equal to the preset number threshold value, listing the detection item into a selectable list, otherwise, listing the detection item into an unselected list;
sorting the inspection items in the inspection item list from large to small according to the corresponding inspection values, and selecting the first mu inspection item combinations as an acquisition recommendation list;
and in the data cleaning process, if the manual intervention operation performed by the user exists in the non-selectable list, reminding the user, and pushing the recommendation list to the user for reference.
The working principle of the technical scheme is as follows:
the user can select a checking item list which is required to be checked by operating the client; the system can also automatically generate a list of inspection items to be inspected; the inspection items may be: cleaning rules (according to a preset algorithm, according to a mapping relation and according to semantic recognition results), matching rules and the like; the preset historical database is responsible for recording the record of data cleaning of the system; the start time length is specifically, for example: in the data cleaning process, a user replaces manual intervention with data cleaning according to a preset algorithm, and the time length from the data cleaning starting moment is the starting time length; the total number of items involved in the record is specifically, for example: in the process of data cleaning, two methods according to a preset algorithm and a mapping relation are respectively used, and the total number is 2; sequencing all records in the historical operating records by calculating a sequencing index, and selecting the first gamma record combinations as sequences to be detected; the treatment specifically comprises the following steps: comparing an original data with the query keyword input by the user by using the inspection item, and judging whether the original data is matched with the query keyword; the time for processing is the time taken for judging whether the matching is carried out or not; when the number of corresponding records which are more than or equal to the preset inspection threshold value in the sequence to be inspected is more than the preset number threshold value, the inspection item is proved to pass the inspection and is worthy of recommendation, and the inspection item is listed in the selectable list, otherwise, the inspection item is listed in the non-selectable list; meanwhile, all the inspection items in the inspection item list are sorted from large to small according to corresponding inspection values to be combined into a recommendation list; in the data cleaning process, when a user performs manual intervention on the user through an operation client, if a check item in the non-selectable list is defined (selected), namely the manual intervention operation of the user exists in the non-selectable list, the user is reminded and a recommendation list is pushed to the user.
The beneficial effects of the above technical scheme are: the embodiment of the invention checks whether each check item in the system can be selected and ranked by a user or automatically checking the system, reminds the user when the user is about to select the check item outside the selectable list and pushes a recommendation list to the user when the user manually intervenes the check item in the data cleaning process, and suggests the user to select the front check item in the recommendation list to intervene the data cleaning process.
The embodiment of the invention provides a data cleaning method based on optimized edit distance, wherein the number threshold is adjusted according to the following preset method, and the method comprises the following steps:
Figure BDA0002814244610000111
Figure BDA0002814244610000121
wherein V is the adjusted number threshold, int is an integer function, epsilon is a preset adjusting coefficient, V0For the number threshold before adjustment, N is the total number of the inspection items in the inspection item list, k is the total number of all records in the historical operating records, and ZhThe number of groups of h-th inspection item in the inspection item list related to records in the historical running record is set, w is a preset separation threshold value, G is a preset comparison threshold value, max is a maximum value taking function, min is a minimum value taking function, and is sum, or is or.
The working principle of the technical scheme is as follows:
as time goes on, the historical record database will be expanded continuously; the whole number threshold value is self-adapted according to the continuous update of the historical database (namely the increase of the number of record groups in the historical records and the increase of the total number of related items of the check items in each record); the number threshold before adjustment is generally the number threshold currently set by the system.
The beneficial effects of the above technical scheme are: according to the embodiment of the invention, the number threshold value set by the system is adaptively adjusted according to the increase of the number of record groups in the historical records and the increase of the total number of the inspection items in each record, so that the accuracy of self-inspection of the system is improved, the working efficiency of the system is improved, and meanwhile, the system is more intelligent.
The embodiment of the invention provides a data cleaning system based on optimized edit distance, which comprises:
the acquisition module acquires a plurality of original data through a preset method;
the semantic recognition module is used for performing semantic recognition on each original data;
the first receiving module is used for receiving the query keywords input by the user;
the data cleaning module is used for cleaning each original data on line Excel data based on the query keywords according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result;
and the second receiving module is used for receiving a setting instruction input by a user and executing corresponding operation in the data cleaning process.
The working principle of the technical scheme is as follows:
importing (acquiring) each original data through a preset method; firstly, performing semantic recognition on each original data; secondly, inputting a query keyword by a user through an operation client; then, searching data matched with the query keyword input by the user in each original data according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result, and then outputting (namely, data cleaning), wherein the data cleaning process is carried out by adopting an Excel-like operation mode; meanwhile, in the data cleaning process, a setting instruction input by a user is received and corresponding operation is executed, and the user can set parameters of the data cleaning process through an operation client; during semantic recognition, comparing each piece of original data with a model in a preset semantic database to recognize semantics, and supporting semantic recognition of structured data and unstructured data; the preset algorithm specifically comprises the following steps: the V-chunk-gram algorithm based on variable length signatures divides different signatures for the character string and the query character string in the character string set and simultaneously uses a variable length signature dictionary, optimizes the traditional chunk-gram algorithm based on editing distance, and can obtain higher-quality chunks; the preset mapping relationship is specifically as follows: for example, the system establishes a mapping relationship between "east city of Beijing city" and "east city of Beijing city", and when the query keyword is "east city of Beijing city", the system outputs "east city of Beijing city" matching therewith; the online Excel is specifically used for online data processing by means similar to an Excel table, and is similar to an online Excel tool in Tencent documents.
The beneficial effects of the above technical scheme are: according to the approximate character string matching technology based on the optimized edit distance, the object can be matched more accurately and the data cleaning accuracy rate is increased when the information quantity is continuously increased through the V-chunk-gram algorithm based on the variable-length signature, meanwhile, the data cleaning process utilizes the operation modes of online Excel and complete Excel, the cleaning efficiency is greatly improved, the data cleaning threshold is reduced, in addition, the parameters of the data cleaning process can be set by a user in the data cleaning process, the user experience is improved, the intelligent level of the data cleaning is improved, and the operation complexity of the data cleaning is reduced.
The embodiment of the invention provides a data cleaning system based on optimized edit distance, wherein an acquisition module executes the following operations:
extracting the original data from the preset service system,
and/or the presence of a gas in the gas,
and accessing one or more of an ODBC data source, an XML data source, an Excel table and a text report to obtain the original data.
The working principle of the technical scheme is as follows:
original data can be directly extracted from a business system; the ODBC (Open Database Connectivity) data source can be directly accessed to obtain the original data; the XML data source can be directly accessed to obtain the original data; the Excel table can be directly accessed to obtain original data; the text report can be directly accessed to obtain the original data; the raw data can also be obtained by a combination of the above methods.
The beneficial effects of the above technical scheme are: the embodiment of the invention can obtain the original data through a plurality of methods, can meet more application scenes and improve the user experience.
The embodiment of the invention provides a data cleaning system based on optimized edit distance, wherein the data cleaning module executes the following operations:
judging whether each original data is matched with the query keyword according to a preset algorithm, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword according to a preset mapping relation, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword or not according to a semantic recognition result, and outputting the result if the original data is matched with the query keyword;
the processes all adopt an Excel-like processing mode.
The working principle of the technical scheme is as follows:
calculating the similarity of each original data and the query keyword input by the user according to the preset algorithm, and sequencing the original data from large to small according to the similarity and then displaying the data as a query result; whether each original data is matched with the query keyword input by the user or not can be determined according to the mapping relation, and if the original data is matched with the query keyword, the original data is output and displayed as a query result; and determining whether each original data is matched with the query keyword input by the user according to the semantic recognition result, and if so, outputting and displaying the result as a query result.
The beneficial effects of the above technical scheme are: according to the embodiment of the invention, the online Excel data cleaning is carried out on each original data based on the query keywords according to the preset algorithm and/or the preset mapping relation and/or the semantic recognition result, the data cleaning efficiency is improved, and the data cleaning threshold is reduced.
The embodiment of the invention provides a data cleaning system based on optimized edit distance, and the second receiving module executes the following operations:
in the data cleaning process, receiving an instruction which is input by a user and used for setting the value list range and the upper and lower limit values of the standard template attribute, and executing corresponding operation;
and/or the presence of a gas in the gas,
receiving a setting instruction of a user for designating a category and assigning a responsible person for the original data in batches, and executing corresponding operation;
and/or the presence of a gas in the gas,
and receiving an instruction set by a user for one or more combinations of the cleaning rule, the matching rule and the matching strategy, and executing corresponding operation.
The working principle of the technical scheme is as follows:
in the process of data cleaning, a user can supplement a value list range, an upper limit value and a lower limit value of the standard template attribute; the method comprises the following steps of supporting a user to assign categories and responsible persons to original data in batches, supporting the user to define cleaning rules in batches, realizing batch cleaning of the data by operating the cleaning rules, and supporting the user to define matching strategies, matching rules, data characteristics and the like; the value list range and the upper and lower limit values of the standard template attribute are specifically as follows: when the Excel-like data processing is carried out on line, the attributes of a data processing template are defined (for example, rows and columns are adopted, a list range is the number of the rows and the columns, an upper limit value and a lower limit value are the maximum value and the minimum value of the rows and the columns), the category of original data is specified (for example, the original data belongs to address data, the original data belongs to weather data and the like), a responsible person is assigned (provided by XXX company and the like), a cleaning rule (a preset algorithm, a preset mapping relation and a semantic recognition result) can be selected, a matching rule (for example, 1-to-1 matching or 1-to-many matching) and a matching strategy (for example, 2 cleaning rules are used for data cleaning, namely two matching modes are used for matching).
The beneficial effects of the above technical scheme are: the embodiment of the invention can support the user to set each parameter of the data cleaning process in the data cleaning process, further improves the data cleaning efficiency and improves the user experience.
The embodiment of the invention provides a data cleaning system based on optimized edit distance, which further comprises:
the self-adaptive detection module is used for checking whether the inspection item is selectable;
the adaptive detection module performs operations comprising:
receiving a definition operation input by a user to generate a check item list or generating the check item list according to a preset generation rule at intervals of a preset time;
the list of check items includes: one or more of a preset algorithm, a preset mapping relation and a semantic recognition result are combined;
selecting any inspection item from the inspection item list as a target inspection item;
acquiring a historical operating record in a preset historical database;
traversing each group of records in the historical operating records, and selecting a record combination related to the target inspection item as a record to be sorted;
calculating the sorting index of each record of the records to be sorted:
Figure BDA0002814244610000161
σifor the ranking index of the ith record in the records to be ranked, tiThe service time corresponding to the target inspection item in the ith record in the record to be sorted, TiFor the total time length corresponding to the ith record in the records to be sorted, eiFor the total number of items involved in the examination in the ith record of the records to be sorted, r0The experience weight value corresponding to the target inspection item, n is the total number of all records in the record to be sorted, k is the total number of all records in the historical operating record, j is1、j2、j3And j4Is a preset weight value, and tau is a preset error coefficient;
sorting all records in the records to be sorted from large to small according to the sorting index, and selecting the first gamma record combinations in the records to be sorted as sequences to be checked;
calculating the check value of each record of the sequence to be checked:
Figure BDA0002814244610000162
wherein, mucFor the test value, p, recorded at the c-th in the sequence to be testedcFor the total length of time of the cleaning of the data in the c-th record of the sequence to be examined, EcFor the total number of items related to the examination in the c-th record in the sequence to be examined, rho is a preset determination coefficient, AcFor the number of treatments carried out in the c-th record of the sequence to be examined, Bc mThe time taken for the mth processing using the target inspection item in the c-th record in the sequence to be inspected, D is the total number of times of processing using the target inspection item in the c-th record in the sequence to be inspected, LcFor the original in the c-th record in the sequence to be examinedThe size of the data;
when the number of corresponding records which are more than or equal to a preset detection threshold value in the sequence to be detected is more than or equal to the preset number threshold value, listing the detection item into a selectable list, otherwise, listing the detection item into an unselected list;
sorting the inspection items in the inspection item list from large to small according to the corresponding inspection values, and selecting the first mu inspection item combinations as an acquisition recommendation list;
and in the data cleaning process, if the manual intervention operation performed by the user exists in the non-selectable list, reminding the user, and pushing the recommendation list to the user for reference.
The working principle of the technical scheme is as follows:
the user can select a checking item list which is required to be checked by operating the client; the system can also automatically generate a list of inspection items to be inspected; the inspection items may be: cleaning rules (according to a preset algorithm, according to a mapping relation and according to semantic recognition results), matching rules and the like; the preset historical database is responsible for recording the record of data cleaning of the system; the start time length is specifically, for example: in the data cleaning process, a user replaces manual intervention with data cleaning according to a preset algorithm, and the time length from the data cleaning starting moment is the starting time length; the total number of items involved in the record is specifically, for example: in the process of data cleaning, two methods according to a preset algorithm and a mapping relation are respectively used, and the total number is 2; sequencing all records in the historical operating records by calculating a sequencing index, and selecting the first gamma record combinations as sequences to be detected; the treatment specifically comprises the following steps: comparing an original data with the query keyword input by the user by using the inspection item, and judging whether the original data is matched with the query keyword; the time for processing is the time taken for judging whether the matching is carried out or not; when the number of corresponding records which are more than or equal to the preset inspection threshold value in the sequence to be inspected is more than the preset number threshold value, the inspection item is proved to pass the inspection and is worthy of recommendation, and the inspection item is listed in the selectable list, otherwise, the inspection item is listed in the non-selectable list; meanwhile, all the inspection items in the inspection item list are sorted from large to small according to corresponding inspection values to be combined into a recommendation list; in the data cleaning process, when a user performs manual intervention on the user through an operation client, if a check item in the non-selectable list is defined (selected), namely the manual intervention operation of the user exists in the non-selectable list, the user is reminded and a recommendation list is pushed to the user.
The beneficial effects of the above technical scheme are: the embodiment of the invention checks whether each check item in the system can be selected and ranked by a user or automatically checking the system, reminds the user when the user is about to select the check item outside the selectable list and pushes a recommendation list to the user when the user manually intervenes the check item in the data cleaning process, and suggests the user to select the front check item in the recommendation list to intervene the data cleaning process.
The embodiment of the invention provides a data cleaning system based on optimized edit distance, wherein an adaptive detection module adjusts the number threshold according to the following preset method, and the method comprises the following steps:
Figure BDA0002814244610000181
Figure BDA0002814244610000182
wherein V is the adjusted number threshold, int is an integer function, epsilon is a preset adjusting coefficient, V0For the number threshold before adjustment, N is the total number of the inspection items in the inspection item list, k is the total number of all records in the historical operating records, and ZhFor h-th check item in the check item listThe number of groups related to records in the history running record, w is a preset separation threshold, G is a preset comparison threshold, max is a maximum function, min is a minimum function, and is sum, or is or.
The working principle of the technical scheme is as follows:
as time goes on, the historical record database will be expanded continuously; the whole number threshold value is self-adapted according to the continuous update of the historical database (namely the increase of the number of record groups in the historical records and the increase of the total number of related items of the check items in each record); the number threshold before adjustment is generally the number threshold currently set by the system.
The beneficial effects of the above technical scheme are: according to the embodiment of the invention, the number threshold value set by the system is adaptively adjusted according to the increase of the number of record groups in the historical records and the increase of the total number of the inspection items in each record, so that the accuracy of self-inspection of the system is improved, the working efficiency of the system is improved, and meanwhile, the system is more intelligent.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (9)

1. A data cleaning method based on optimized edit distance is characterized by comprising the following steps:
acquiring a plurality of original data through a preset method, and performing semantic recognition on each original data;
receiving a query keyword input by a user;
performing online Excel data cleaning on each original data based on the query keyword according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result;
in the data cleaning process, receiving a setting instruction input by a user and executing corresponding operation;
receiving a definition operation input by a user to generate a check item list or generating the check item list according to a preset generation rule at intervals of a preset time;
the list of check items includes: one or more of a preset algorithm, a preset mapping relation and a semantic recognition result are combined;
selecting any inspection item from the inspection item list as a target inspection item;
acquiring a historical operating record in a preset historical database;
traversing each group of records in the historical operating records, and selecting a record combination related to the target inspection item as a record to be sorted;
calculating the sorting index of each record of the records to be sorted:
Figure FDA0003154774150000011
σifor the ranking index of the ith record in the records to be ranked, tiThe service time corresponding to the target inspection item in the ith record in the record to be sorted, TiFor the total time length corresponding to the ith record in the records to be sorted, eiFor the total number of items involved in the examination in the ith record of the records to be sorted, r0The experience weight value corresponding to the target inspection item, n is the total number of all records in the record to be sorted, k is the total number of all records in the historical operating record, j is1、j2、j3And j4Is a preset weight value, and tau is a preset error coefficient;
sorting all records in the records to be sorted from large to small according to the sorting index, and selecting the first gamma record combinations in the records to be sorted as sequences to be checked;
calculating the check value of each record of the sequence to be checked:
Figure FDA0003154774150000021
wherein, mucFor examination of the c-th record in the sequence to be examinedValue, pcFor the total length of time of the cleaning of the data in the c-th record of the sequence to be examined, EcFor the total number of items related to the examination in the c-th record in the sequence to be examined, rho is a preset determination coefficient, AcFor the number of treatments carried out in the c-th record of the sequence to be examined, Bc mThe time taken for the mth processing using the target inspection item in the c-th record in the sequence to be inspected, D is the total number of times of processing using the target inspection item in the c-th record in the sequence to be inspected, LcThe size of original data in the c record in the sequence to be detected is obtained;
when the number of corresponding records which are more than or equal to a preset detection threshold value in the sequence to be detected is more than or equal to the preset number threshold value, listing the detection item into a selectable list, otherwise, listing the detection item into an unselected list;
sorting the inspection items in the inspection item list from large to small according to the corresponding inspection values, and selecting the first mu inspection item combinations as an acquisition recommendation list;
and in the data cleaning process, if the manual intervention operation performed by the user exists in the non-selectable list, reminding the user, and pushing the recommendation list to the user for reference.
2. The method for cleaning data based on optimized edit distance according to claim 1, wherein the obtaining of the plurality of original data by the presetting method specifically includes:
extracting the original data from the preset service system,
and/or the presence of a gas in the gas,
and accessing one or more of an ODBC data source, an XML data source, an Excel table and a text report to obtain the original data.
3. The data cleaning method based on the optimized edit distance as claimed in claim 1, wherein the online Excel-like data cleaning is performed on each original data based on the query keyword according to a preset algorithm and/or a preset mapping relationship and/or a semantic recognition result, specifically comprising:
judging whether each original data is matched with the query keyword according to a preset algorithm, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword according to a preset mapping relation, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword or not according to a semantic recognition result, and outputting the result if the original data is matched with the query keyword;
the processes all adopt an Excel-like processing mode.
4. The data cleaning method based on the optimized edit distance as claimed in claim 1, wherein the receiving a setting instruction input by a user and executing a corresponding operation specifically includes:
in the data cleaning process, receiving an instruction which is input by a user and used for setting the value list range and the upper and lower limit values of the standard template attribute, and executing corresponding operation;
and/or the presence of a gas in the gas,
receiving a setting instruction of a user for designating a category and assigning a responsible person for the original data in batches, and executing corresponding operation;
and/or the presence of a gas in the gas,
and receiving an instruction set by a user for one or more combinations of the cleaning rule, the matching rule and the matching strategy, and executing corresponding operation.
5. The optimized edit distance-based data cleansing method according to claim 1, wherein the number threshold is adjusted according to a preset method comprising:
Figure FDA0003154774150000031
Figure FDA0003154774150000032
wherein V is the adjusted number threshold, int is an integer function, epsilon is a preset adjusting coefficient, V0For the number threshold before adjustment, N is the total number of the inspection items in the inspection item list, k is the total number of all records in the historical operating records, and ZhThe number of groups of h-th inspection item in the inspection item list related to records in the historical running record is set, w is a preset separation threshold value, G is a preset comparison threshold value, max is a maximum value taking function, min is a minimum value taking function, and is sum, or is or.
6. An optimized edit distance based data cleansing system comprising:
the acquisition module acquires a plurality of original data through a preset method;
the semantic recognition module is used for performing semantic recognition on each original data;
the first receiving module is used for receiving the query keywords input by the user;
the data cleaning module is used for cleaning each original data on line Excel data based on the query keywords according to a preset algorithm and/or a preset mapping relation and/or a semantic recognition result;
the second receiving module is used for receiving a setting instruction input by a user and executing corresponding operation in the data cleaning process;
the self-adaptive detection module is used for checking whether the inspection item is selectable;
the adaptive detection module performs operations comprising:
receiving a definition operation input by a user to generate a check item list or generating the check item list according to a preset generation rule at intervals of a preset time;
the list of check items includes: one or more of a preset algorithm, a preset mapping relation and a semantic recognition result are combined;
selecting any inspection item from the inspection item list as a target inspection item;
acquiring a historical operating record in a preset historical database;
traversing each group of records in the historical operating records, and selecting a record combination related to the target inspection item as a record to be sorted;
calculating the sorting index of each record of the records to be sorted:
Figure FDA0003154774150000041
σifor the ranking index of the ith record in the records to be ranked, tiThe service time corresponding to the target inspection item in the ith record in the record to be sorted, TiFor the total time length corresponding to the ith record in the records to be sorted, eiFor the total number of items involved in the examination in the ith record of the records to be sorted, r0The experience weight value corresponding to the target inspection item, n is the total number of all records in the record to be sorted, k is the total number of all records in the historical operating record, j is1、j2、j3And j4Is a preset weight value, and tau is a preset error coefficient;
sorting all records in the records to be sorted from large to small according to the sorting index, and selecting the first gamma record combinations in the records to be sorted as sequences to be checked;
calculating the check value of each record of the sequence to be checked:
Figure FDA0003154774150000051
wherein, mucFor the test value, p, recorded at the c-th in the sequence to be testedcFor the total length of time of the cleaning of the data in the c-th record of the sequence to be examined, EcFor the total number of items related to the examination in the c-th record in the sequence to be examined, rho is a preset determination coefficient, AcFor the number of treatments carried out in the c-th record of the sequence to be examined, Bc mFor the c-th record in the sequence to be examinedThe time taken for the mth processing using the target inspection item, D is the total number of times of processing using the target inspection item in the c-th record in the sequence to be inspected, LcThe size of original data in the c record in the sequence to be detected is obtained;
when the number of corresponding records which are more than or equal to a preset detection threshold value in the sequence to be detected is more than or equal to the preset number threshold value, listing the detection item into a selectable list, otherwise, listing the detection item into an unselected list;
sorting the inspection items in the inspection item list from large to small according to the corresponding inspection values, and selecting the first mu inspection item combinations as an acquisition recommendation list;
and in the data cleaning process, if the manual intervention operation performed by the user exists in the non-selectable list, reminding the user, and pushing the recommendation list to the user for reference.
7. The optimized edit distance-based data cleansing system of claim 6, wherein the obtaining module performs operations comprising:
extracting the original data from the preset service system,
and/or the presence of a gas in the gas,
and accessing one or more of an ODBC data source, an XML data source, an Excel table and a text report to obtain the original data.
8. The optimized edit distance-based data cleansing system of claim 6, wherein the data cleansing module performs operations comprising:
judging whether each original data is matched with the query keyword according to a preset algorithm, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword according to a preset mapping relation, and if so, outputting;
and/or the presence of a gas in the gas,
judging whether each original data is matched with the query keyword or not according to a semantic recognition result, and outputting the result if the original data is matched with the query keyword;
the processes all adopt an Excel-like processing mode.
9. The optimized edit distance-based data cleansing system of claim 6, wherein the second receiving module performs operations comprising:
in the data cleaning process, receiving an instruction which is input by a user and used for setting the value list range and the upper and lower limit values of the standard template attribute, and executing corresponding operation;
and/or the presence of a gas in the gas,
receiving a setting instruction of a user for designating a category and assigning a responsible person for the original data in batches, and executing corresponding operation;
and/or the presence of a gas in the gas,
and receiving an instruction set by a user for one or more combinations of the cleaning rule, the matching rule and the matching strategy, and executing corresponding operation.
CN202011406088.9A 2020-12-03 2020-12-03 Data cleaning method and system based on optimized edit distance Active CN112463782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011406088.9A CN112463782B (en) 2020-12-03 2020-12-03 Data cleaning method and system based on optimized edit distance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011406088.9A CN112463782B (en) 2020-12-03 2020-12-03 Data cleaning method and system based on optimized edit distance

Publications (2)

Publication Number Publication Date
CN112463782A CN112463782A (en) 2021-03-09
CN112463782B true CN112463782B (en) 2022-03-18

Family

ID=74805457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011406088.9A Active CN112463782B (en) 2020-12-03 2020-12-03 Data cleaning method and system based on optimized edit distance

Country Status (1)

Country Link
CN (1) CN112463782B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114036919B (en) * 2021-11-23 2022-05-24 北京三维天地科技股份有限公司 Electronic experiment record book management method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559225A (en) * 2013-10-21 2014-02-05 北京航空航天大学 Cleaning method for Web service resource library data and server
CN110321466A (en) * 2019-06-14 2019-10-11 广发证券股份有限公司 A kind of security information duplicate checking method and system based on semantic analysis
CN111694823A (en) * 2020-05-15 2020-09-22 平安科技(深圳)有限公司 Organization standardization method and device, electronic equipment and storage medium
CN111858728A (en) * 2020-06-29 2020-10-30 国家计算机网络与信息安全管理中心 Data extraction method, device and equipment for different data sources and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9430463B2 (en) * 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559225A (en) * 2013-10-21 2014-02-05 北京航空航天大学 Cleaning method for Web service resource library data and server
CN110321466A (en) * 2019-06-14 2019-10-11 广发证券股份有限公司 A kind of security information duplicate checking method and system based on semantic analysis
CN111694823A (en) * 2020-05-15 2020-09-22 平安科技(深圳)有限公司 Organization standardization method and device, electronic equipment and storage medium
CN111858728A (en) * 2020-06-29 2020-10-30 国家计算机网络与信息安全管理中心 Data extraction method, device and equipment for different data sources and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王业.《 基于编辑距离的近似字符串匹配及其优化技术》.《中国优秀硕士学位论文全文数据库 信息科技辑》.2015, *

Also Published As

Publication number Publication date
CN112463782A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN108153876B (en) Intelligent question and answer method and system
US9348900B2 (en) Generating an answer from multiple pipelines using clustering
US8335787B2 (en) Topic word generation method and system
CN103425727B (en) Context speech polling expands method and system
US20040163035A1 (en) Method for automatic and semi-automatic classification and clustering of non-deterministic texts
CN112800170A (en) Question matching method and device and question reply method and device
CN111611356B (en) Information searching method, device, electronic equipment and readable storage medium
CN101119326A (en) Method and device for managing instant communication conversation recording
CN105955976A (en) Automatic answering system and method
CN102968419B (en) Disambiguation method for interactive Internet entity name
US10387805B2 (en) System and method for ranking news feeds
CN109255012B (en) Method and device for machine reading understanding and candidate data set size reduction
CN109063171B (en) Resource matching method based on semantics
CN113076423A (en) Data processing method and device and data query method and device
CN108509597B (en) Method and system for evaluating success rate of character trademark registration
CN108734159A (en) The detection method and system of sensitive information in a kind of image
CN112463782B (en) Data cleaning method and system based on optimized edit distance
CN111079427A (en) Junk mail identification method and system
CN112989215B (en) Sparse user behavior data-based knowledge graph enhanced recommendation system
CN116644339B (en) Information classification method and system
CN111104422B (en) Training method, device, equipment and storage medium of data recommendation model
CN111522945A (en) Poetry style analysis method based on chi-square test
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
CN106570058A (en) Searching method and search engine
CN112749530B (en) Text encoding method, apparatus, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant