CN111597178A - Method, system, equipment and medium for cleaning repeating data - Google Patents

Method, system, equipment and medium for cleaning repeating data Download PDF

Info

Publication number
CN111597178A
CN111597178A CN202010419288.1A CN202010419288A CN111597178A CN 111597178 A CN111597178 A CN 111597178A CN 202010419288 A CN202010419288 A CN 202010419288A CN 111597178 A CN111597178 A CN 111597178A
Authority
CN
China
Prior art keywords
data
similarity
neighbor
split
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010419288.1A
Other languages
Chinese (zh)
Inventor
刘国梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Genersoft Information Technology Co Ltd
Original Assignee
Shandong Inspur Genersoft Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Genersoft Information Technology Co Ltd filed Critical Shandong Inspur Genersoft Information Technology Co Ltd
Priority to CN202010419288.1A priority Critical patent/CN111597178A/en
Publication of CN111597178A publication Critical patent/CN111597178A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation

Abstract

The invention discloses a method, a system, equipment and a storage medium for cleaning repeating data, wherein the method comprises the following steps: splitting data to be queried, and matching in a database according to a plurality of discontinuous keywords in the split data to obtain a neighbor data set; calculating the similarity between each data in the neighbor data set and the data to be inquired, and judging whether the data with the similarity larger than a threshold exists in the neighbor data set or not; responding to the data with the similarity larger than the threshold in the neighbor data set, splitting the data with the similarity larger than the threshold, and matching each split phrase with the data to be inquired; and deleting the data in response to each phrase of the data with the similarity larger than the threshold being successfully matched. The method, the system, the equipment and the medium for cleaning the repeated data judge the neighbor of the data to be inquired through the key attribute, and carry out similarity check in the neighbor, so that the efficiency can be improved, and the similar or repeated data can be quickly and accurately positioned.

Description

Method, system, equipment and medium for cleaning repeating data
Technical Field
The present invention relates to the field of data processing, and more particularly, to a method, a system, a computer device, and a readable medium for cleaning duplicate data.
Background
In recent years, as hardware facilities and software technologies are continuously developed and data analysis plays an increasingly important role in national and enterprise development, the national government and enterprises pay more and more attention to data analysis and processing. However, as technology is developed, a lot of data are stored in different departments or sub-companies, which forms an information barrier, and thus, the unified management of data becomes an urgent problem to be solved. Especially, in the ERP products used by large-scale enterprises, groups and subsidiaries have one or more sets of information systems, and how to get through the information barriers becomes an important step of enterprise datamation.
The main data product developed by the GSP framework becomes a key ring for breaking the data barrier, and by the GSP framework, the main data can quickly define a set of data management system which is adaptive to enterprise requirements according to actual scenes. However, data cleansing is still a key ring in the main data, and the data of each business system must be cleansed to remove dirty data and duplicate data to form the main data. For example: in different service systems, the data of the Chinese national railway group company ltd may be called as a plurality of names such as the Chinese railway group and the middle iron group, and because each information system is independently operated and maintained, the main keys of the information systems are difficult to keep consistent. Therefore, before acquiring the business system data to form the main data, it is necessary to remove similar or repeated data through data cleaning.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method, a system, a computer device, and a computer-readable storage medium for cleaning duplicate data, in which neighbors of data to be queried are determined according to a key attribute, and similarity check is performed in the neighbors, so that efficiency can be improved, similar or duplicate data can be quickly and accurately located, efficiency of data cleaning is improved on the basis of not increasing operation difficulty of a user, and similar or duplicate data can be more conveniently queried.
Based on the above purpose, an aspect of the embodiments of the present invention provides a method for cleaning duplicate data, including the following steps: splitting data to be queried, and matching in a database according to a plurality of discontinuous keywords in the split data to obtain a neighbor data set; calculating the similarity between each data in the neighbor data set and the data to be inquired, and judging whether the data with the similarity larger than a threshold exists in the neighbor data set or not; responding to the data with the similarity larger than the threshold in the neighbor data set, splitting the data with the similarity larger than the threshold, and matching each split phrase with the data to be inquired; and deleting the data in response to each phrase of the data with the similarity larger than the threshold being successfully matched.
In some embodiments, further comprising: in response to no data in the neighbor data set having a similarity greater than a threshold, calculating a similarity for all data in the database.
In some embodiments, the determining whether there is data in the neighbor data set with a similarity greater than a threshold further comprises: and sorting according to the similarity of the data in the neighbor data set from large to small, and judging whether the similarity of the data ranked at the top is larger than a threshold value.
In some embodiments, the matching each split phrase with the data to be queried includes: matching each split phrase with the phrase of the data to be inquired; and responding to the unmatched phrases, splitting each split phrase into basic units, and matching the basic units with the data to be inquired.
In another aspect of the embodiments of the present invention, a system for cleaning duplicate data is further provided, including: the first splitting module is configured to split data to be queried and match the data in the database according to a plurality of discontinuous keywords in the split data to obtain a neighbor data set; the calculation module is configured to calculate the similarity between each data in the neighbor data set and the data to be queried, and judge whether data with the similarity larger than a threshold exists in the neighbor data set; the second splitting module is configured to split the data with the similarity greater than the threshold value in response to the data with the similarity greater than the threshold value existing in the neighbor data set, and match each split phrase with the data to be queried; and the execution module is configured to respond to the fact that each phrase of the data with the similarity larger than the threshold value can be successfully matched, and delete the data.
In some embodiments, further comprising: a second calculation module configured to calculate the similarity of all data in the database in response to no data in the neighbor data set having a similarity greater than a threshold.
In some embodiments, the computing module is further configured to: and sorting according to the similarity of the data in the neighbor data set from large to small, and judging whether the similarity of the data ranked at the top is larger than a threshold value.
In some embodiments, the second splitting module is further configured to: matching each split phrase with the phrase of the data to be inquired; and responding to the unmatched phrases, splitting each split phrase into basic units, and matching the basic units with the data to be inquired.
In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method as above.
In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.
The invention has the following beneficial technical effects: the neighbor of the data to be inquired is judged through the key attribute, similarity verification is carried out in the neighbor, the efficiency can be improved, similar or repeated data can be quickly and accurately positioned, the data cleaning efficiency is improved on the basis that the operation difficulty of a user is not increased, and the similar or repeated data can be inquired more conveniently.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
FIG. 1 is a schematic diagram of an embodiment of a method for scrubbing duplicate data according to the present invention;
fig. 2 is a schematic hardware structure diagram of an embodiment of a computer device for flushing duplicate data according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the above object, in a first aspect of the embodiments of the present invention, an embodiment of a method for flushing duplicate data is provided. Fig. 1 is a schematic diagram illustrating an embodiment of a method for flushing duplicate data according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:
s1, splitting the data to be queried, and matching in a database according to a plurality of discontinuous keywords in the split data to obtain a neighbor data set;
s2, calculating the similarity between each data in the neighbor data set and the data to be inquired, and judging whether the data with the similarity larger than a threshold exists in the neighbor data set;
s3, responding to the data with the similarity larger than the threshold value in the neighbor data set, splitting the data with the similarity larger than the threshold value, and matching each split phrase with the data to be inquired; and
and S4, in response to the fact that each phrase of the data with the similarity larger than the threshold value can be successfully matched, deleting the data.
In a main data product, repeated data is invalid data which is to be avoided and eliminated by main data firstly, the implementation scheme of forming formal main data through data cleaning is the most common scheme, a common data cleaning mode is that Euclidean distance is used for similarity calculation, but when the attribute of one piece of data is excessive and the data amount is excessive, the efficiency of the calculation mode is limited; therefore, the invention analyzes the characteristics of ERP data: generally, ERP data such as materials, customers, organizations, personnel and the like can judge whether the data are repeated data or not through key attributes such as names, models, work numbers and the like, before similarity calculation, discontinuous fuzzy query is used for searching out neighbor data of the data, then similarity check is carried out on the neighbor data, the data larger than a threshold value are directly returned to a foreground for cleaning and selection, and if the data larger than the threshold value do not exist, similarity calculation of all the data is carried out.
And splitting the data to be queried, and matching a plurality of discontinuous keywords in the split data in a database to obtain a neighbor data set. The embodiment of the invention can search the neighbor data by using a discontinuous fuzzy query method. In an ERP product, most data can be distinguished whether the data is repeated data or not by name, but a traditional fuzzy search method is not suitable for such a scenario, for example, "chinese national railway group limited company", different business systems are isolated from each other, which may cause the names of the data to be different, for example, "middle iron group", "chinese railway group", "middle iron", and other various names, and a general fuzzy query method is to directly use a like '% { key }%' statement for query, but this situation may cause the queried data to be incomplete, so when querying neighboring data, a discontinuous fuzzy query method is used. The keywords may be split by character, for example:
for(int i=0;i<keyWordValue.Length;i++)
{
resultMsg+="%"+keyWordValue[i]+"%";
}
carrying out fuzzy query according to the split query statement, so that similar or repeated data corresponding to the data can be searched out as much as possible; and acquiring neighbor data of the data by the discontinuous fuzzy query mode of the keywords, and organizing the neighbor data into a neighbor data set. For example, the data to be queried is "china national railroad group ltd", and this data can be split, for example, into individual chinese characters, and a plurality of discontinuous keywords such as "middle", "iron", and "set" are selected and matched in the database to obtain a neighbor data set.
And calculating the similarity between each data in the neighbor data set and the data to be inquired, and judging whether the similarity of the data in the neighbor data set is greater than a threshold value. The similarity calculation can be performed on the data in the neighbor data set by using the euclidean distance, the similarity between all the data in the neighbor data set and the data to be queried is calculated, and the similarity of the neighbor data set is compared with the threshold value of the similar data set by the user. Compared with the global data set, the neighbor data set is much smaller than the global data volume, so that the efficiency is very high when similarity calculation and verification are carried out, and the accuracy is high according to the characteristics of ERP data.
In some embodiments, the determining whether there is data in the neighbor data set with a similarity greater than a threshold further comprises: and sorting according to the similarity of the data in the neighbor data set from large to small, and judging whether the similarity of the data ranked at the top is larger than a threshold value. And sorting the data in the neighbor data set from large to small according to the similarity of the data, so that whether the data with the similarity larger than a threshold exists can be judged only by comparing the similarity of the data arranged at the head. If the similarity of the first-ranked data is greater than the threshold, the next-ranked data can be sequentially judged until the similarity of the data is less than or equal to the threshold, and the comparison is stopped.
And in response to the fact that the similarity of the data in the neighbor data set is larger than a threshold value, splitting the data with the similarity larger than the threshold value, and matching each split phrase with the data to be inquired. If the data higher than the threshold exists, the data lower than the threshold can be directly discarded and called, and the data higher than the threshold is returned to a foreground for selection by a user; of course, the data with the similarity greater than the threshold may also be split, and each split phrase is matched with the data to be queried, so as to determine whether the data needs to be cleaned. And if the similarity calculated by the whole data set is smaller than the threshold value, directly cleaning the global data in the third step.
In some embodiments, the matching each split phrase with the data to be queried includes: matching each split phrase with the phrase of the data to be inquired; and responding to the unmatched phrases, splitting each split phrase into basic units, and matching the basic units with the data to be inquired. For example, the data with the similarity greater than the threshold is "group of chinese railways", the data is split, for example, the data can be split into "china", "railways" and "group", the split phrases are respectively matched with the data to be queried, and obviously, all the three phrases can be successfully matched. If the matching is not successful, for example, the split phrase includes "steel", then the "steel" needs to be split into "steel" and "iron", and the matching is continued.
And deleting the data in response to each phrase of the data which is larger than the threshold value being successfully matched. If each phrase in the data larger than the threshold value can be successfully matched, the data is proved to have a high probability of being the repeated data of the data to be inquired, and the data can be deleted. In order to avoid the wrong deletion, the deleted data can be checked again, and if the wrong deletion exists, the data can be recovered.
In some embodiments, further comprising: in response to no data in the neighbor data set having a similarity greater than a threshold, calculating a similarity for all data in the database. The verification of the step is performed when similar data does not exist in the verification of the neighbor data set, the calculation of the step is time-consuming when the calculation of the euclidean distance is performed, but the accuracy rate is the highest due to the verification of the global data set, so that the calculation of the step is still necessary when a user cannot perform repeated data judgment through the previous step. The calculation of the step can ensure that repeated or similar data can not enter the main data to form dirty data, so that the data quality of the main data is not influenced.
By the calculation, before global cleaning, fuzzy discontinuous query of keywords is carried out, and most probable repeated or similar data is queried for a user to select whether to remove the repeated data or not; and if the similar data cannot be inquired through the keywords, performing subsequent global cleaning. First, in terms of cleaning efficiency: under the condition of large data volume, the method can effectively improve the time consumed by similarity calculation, reduce the waiting time of a user and provide the user with visual feeling of the product; secondly, in terms of cleaning accuracy: by means of ERP data characteristics, most similar data among different business systems can be distinguished through one keyword and two keywords. Similarity verification is carried out on similar data inquired through the keywords, and data with the highest similarity is found out, so that the accuracy rate is higher; finally, in the user interface, if the fuzzy similar data cannot be found, the global similarity check can be continuously carried out, no additional operation is required by the user, and the complexity of manual operation cannot be increased.
It should be particularly noted that, the steps in the embodiments of the method for washing duplicate data may be mutually intersected, replaced, added, or deleted, and therefore, these methods for washing duplicate data should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiments.
In view of the above object, according to a second aspect of the embodiments of the present invention, there is provided a system for flushing duplicated data, including: the first splitting module is configured to split data to be queried and match the data in the database according to a plurality of discontinuous keywords in the split data to obtain a neighbor data set; the calculation module is configured to calculate the similarity between each data in the neighbor data set and the data to be queried, and judge whether data with the similarity larger than a threshold exists in the neighbor data set; the second splitting module is configured to split the data with the similarity greater than the threshold value in response to the data with the similarity greater than the threshold value existing in the neighbor data set, and match each split phrase with the data to be queried; and the execution module is configured to respond to the fact that each phrase of the data with the similarity larger than the threshold value can be successfully matched, and delete the data.
In some embodiments, further comprising: a second calculation module configured to calculate the similarity of all data in the database in response to no data in the neighbor data set having a similarity greater than a threshold.
In some embodiments, the computing module is further configured to: and sorting according to the similarity of the data in the neighbor data set from large to small, and judging whether the similarity of the data ranked at the top is larger than a threshold value.
In some embodiments, the second splitting module is further configured to: matching each split phrase with the phrase of the data to be inquired; and responding to the unmatched phrases, splitting each split phrase into basic units, and matching the basic units with the data to be inquired.
In view of the above object, a third aspect of the embodiments of the present invention provides a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions being executable by the processor to perform the steps of: s1, splitting the data to be queried, and matching in a database according to a plurality of discontinuous keywords in the split data to obtain a neighbor data set; s2, calculating the similarity between each data in the neighbor data set and the data to be inquired, and judging whether the data with the similarity larger than a threshold exists in the neighbor data set; s3, responding to the data with the similarity larger than the threshold value in the neighbor data set, splitting the data with the similarity larger than the threshold value, and matching each split phrase with the data to be inquired; and S4, in response to each phrase of the data with the similarity larger than the threshold being successfully matched, deleting the data.
In some embodiments, further comprising: in response to no data in the neighbor data set having a similarity greater than a threshold, calculating a similarity for all data in the database.
In some embodiments, the determining whether there is data in the neighbor data set with a similarity greater than a threshold further comprises: and sorting according to the similarity of the data in the neighbor data set from large to small, and judging whether the similarity of the data ranked at the top is larger than a threshold value.
In some embodiments, the matching each split phrase with the data to be queried includes: matching each split phrase with the phrase of the data to be inquired; and responding to the unmatched phrases, splitting each split phrase into basic units, and matching the basic units with the data to be inquired.
Fig. 2 is a schematic hardware structural diagram of an embodiment of the computer apparatus for flushing duplicate data according to the present invention.
Taking the apparatus shown in fig. 2 as an example, the apparatus includes a processor 301 and a memory 302, and may further include: an input device 303 and an output device 304.
The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 2 illustrates the connection by a bus as an example.
The memory 302, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for purging duplicate data in embodiments of the present application. The processor 301 executes various functional applications of the server and data processing by running the nonvolatile software programs, instructions and modules stored in the memory 302, that is, implements the method for washing duplicate data of the above-described method embodiment.
The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the method of flushing duplicate data, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 optionally includes memory located remotely from processor 301, which may be connected to a local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 303 may receive information such as a user name and a password that are input. The output means 304 may comprise a display device such as a display screen.
Program instructions/modules corresponding to one or more methods of cleansing duplicate data are stored in the memory 302 and, when executed by the processor 301, perform the methods of cleansing duplicate data in any of the method embodiments described above.
Any embodiment of the computer device executing the method for flushing the repeated data can achieve the same or similar effects as any corresponding embodiment of the method.
The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the method as above.
Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct related hardware, and the program of the method for washing duplicated data can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.
Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.
Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A method for scrubbing duplicated data, comprising the steps of:
splitting data to be queried, and matching in a database according to a plurality of discontinuous keywords in the split data to obtain a neighbor data set;
calculating the similarity between each data in the neighbor data set and the data to be inquired, and judging whether the data with the similarity larger than a threshold exists in the neighbor data set or not;
responding to the data with the similarity larger than the threshold in the neighbor data set, splitting the data with the similarity larger than the threshold, and matching each split phrase with the data to be inquired; and
and deleting the data in response to each phrase of the data with the similarity larger than the threshold being successfully matched.
2. The method of claim 1, further comprising:
in response to no data in the neighbor data set having a similarity greater than a threshold, calculating a similarity for all data in the database.
3. The method of claim 1, wherein the determining whether there is data in the neighbor data set with a similarity greater than a threshold further comprises:
and sorting according to the similarity of the data in the neighbor data set from large to small, and judging whether the similarity of the data ranked at the top is larger than a threshold value.
4. The method of claim 1, wherein the matching each split phrase with the data to be queried comprises:
matching each split phrase with the phrase of the data to be inquired; and
and responding to the unmatched phrases, splitting each split phrase into basic units, and matching the basic units with the data to be inquired.
5. A system for cleansing duplicate data, comprising:
the first splitting module is configured to split data to be queried and match the data in the database according to a plurality of discontinuous keywords in the split data to obtain a neighbor data set;
the calculation module is configured to calculate the similarity between each data in the neighbor data set and the data to be queried, and judge whether data with the similarity larger than a threshold exists in the neighbor data set;
the second splitting module is configured to split the data with the similarity greater than the threshold value in response to the data with the similarity greater than the threshold value existing in the neighbor data set, and match each split phrase with the data to be queried; and
and the execution module is configured to respond to the successful matching of each phrase of the data with the similarity larger than the threshold value and delete the data.
6. The system of claim 5, further comprising:
a second calculation module configured to calculate the similarity of all data in the database in response to no data in the neighbor data set having a similarity greater than a threshold.
7. The system of claim 5, wherein the computing module is further configured to:
and sorting according to the similarity of the data in the neighbor data set from large to small, and judging whether the similarity of the data ranked at the top is larger than a threshold value.
8. The system of claim 5, wherein the second splitting module is further configured to:
matching each split phrase with the phrase of the data to be inquired; and
and responding to the unmatched phrases, splitting each split phrase into basic units, and matching the basic units with the data to be inquired.
9. A computer device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 4.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
CN202010419288.1A 2020-05-18 2020-05-18 Method, system, equipment and medium for cleaning repeating data Pending CN111597178A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010419288.1A CN111597178A (en) 2020-05-18 2020-05-18 Method, system, equipment and medium for cleaning repeating data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010419288.1A CN111597178A (en) 2020-05-18 2020-05-18 Method, system, equipment and medium for cleaning repeating data

Publications (1)

Publication Number Publication Date
CN111597178A true CN111597178A (en) 2020-08-28

Family

ID=72182947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010419288.1A Pending CN111597178A (en) 2020-05-18 2020-05-18 Method, system, equipment and medium for cleaning repeating data

Country Status (1)

Country Link
CN (1) CN111597178A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783883A (en) * 2021-01-22 2021-05-11 广东电网有限责任公司东莞供电局 Power data standardized cleaning method and device under multi-source data access
CN113656393A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Data processing method, data processing device, electronic equipment and storage medium
CN114942923A (en) * 2022-07-11 2022-08-26 深圳新闻网传媒股份有限公司 Cloud platform-based unified management system for big data calculation and analysis
CN115543979A (en) * 2022-09-29 2022-12-30 广州鼎甲计算机科技有限公司 Method, device, equipment, storage medium and program product for deleting repeated data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008310939A (en) * 2007-05-15 2008-12-25 Sony Corp Data processing apparatus and method, program, and storage medium
US20130013597A1 (en) * 2011-06-17 2013-01-10 Alibaba Group Holding Limited Processing Repetitive Data
CN107832450A (en) * 2017-11-23 2018-03-23 安徽科创智慧知识产权服务有限公司 Method for cleaning Data duplication record
CN109635084A (en) * 2018-11-30 2019-04-16 宁波深擎信息科技有限公司 A kind of real-time quick De-weight method of multi-source data document and system
CN111046227A (en) * 2019-11-29 2020-04-21 腾讯科技(深圳)有限公司 Video duplicate checking method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008310939A (en) * 2007-05-15 2008-12-25 Sony Corp Data processing apparatus and method, program, and storage medium
US20130013597A1 (en) * 2011-06-17 2013-01-10 Alibaba Group Holding Limited Processing Repetitive Data
CN107832450A (en) * 2017-11-23 2018-03-23 安徽科创智慧知识产权服务有限公司 Method for cleaning Data duplication record
CN109635084A (en) * 2018-11-30 2019-04-16 宁波深擎信息科技有限公司 A kind of real-time quick De-weight method of multi-source data document and system
CN111046227A (en) * 2019-11-29 2020-04-21 腾讯科技(深圳)有限公司 Video duplicate checking method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783883A (en) * 2021-01-22 2021-05-11 广东电网有限责任公司东莞供电局 Power data standardized cleaning method and device under multi-source data access
CN113656393A (en) * 2021-08-24 2021-11-16 北京百度网讯科技有限公司 Data processing method, data processing device, electronic equipment and storage medium
CN113656393B (en) * 2021-08-24 2024-01-12 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and storage medium
CN114942923A (en) * 2022-07-11 2022-08-26 深圳新闻网传媒股份有限公司 Cloud platform-based unified management system for big data calculation and analysis
CN115543979A (en) * 2022-09-29 2022-12-30 广州鼎甲计算机科技有限公司 Method, device, equipment, storage medium and program product for deleting repeated data
CN115543979B (en) * 2022-09-29 2023-08-08 广州鼎甲计算机科技有限公司 Method, apparatus, device, storage medium and program product for deleting duplicate data

Similar Documents

Publication Publication Date Title
CN111597178A (en) Method, system, equipment and medium for cleaning repeating data
Dijkman et al. Aligning business process models
US9519862B2 (en) Domains for knowledge-based data quality solution
CN101986296B (en) Noise data cleaning method based on semantic ontology
CN105187242B (en) A kind of user&#39;s anomaly detection method excavated based on variable-length pattern
CN103914444B (en) A kind of error correction method and its device
CN110162522B (en) Distributed data search system and method
US20200320153A1 (en) Method for accessing data records of a master data management system
WO2020098315A1 (en) Information matching method and terminal
TW201915777A (en) Financial analysis system and method for unstructured text data
CN106776703A (en) A kind of multivariate data cleaning technique under virtualized environment
US20080177702A1 (en) Retrieving case-based reasoning information from archive records
WO2020155740A1 (en) Information query method and apparatus, and computer device and storage medium
CN111814458A (en) Rule engine system optimization method and device, computer equipment and storage medium
CN109634949B (en) Mixed data cleaning method based on multiple data versions
CN112199488B (en) Incremental knowledge graph entity extraction method and system for power customer service question and answer
Skandar et al. An efficient duplication record detection algorithm for data cleansing
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
Raıssi et al. Need for speed: Mining sequential patterns in data streams
WO2023178767A1 (en) Enterprise risk detection method and apparatus based on enterprise credit investigation big data knowledge graph
CN106682107B (en) Method and device for determining incidence relation of database table
CN111984625B (en) Database load characteristic processing method and device, medium and electronic equipment
CN114625761A (en) Optimization method, optimization device, electronic equipment and medium
CN115438147A (en) Information retrieval method and system for rail transit field
JP2017010376A (en) Mart-less verification support system and mart-less verification support method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200828