CN113127461A - Data cleaning method and device, electronic equipment and storage medium - Google Patents

Data cleaning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113127461A
CN113127461A CN201911416103.5A CN201911416103A CN113127461A CN 113127461 A CN113127461 A CN 113127461A CN 201911416103 A CN201911416103 A CN 201911416103A CN 113127461 A CN113127461 A CN 113127461A
Authority
CN
China
Prior art keywords
general
special
members
member set
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911416103.5A
Other languages
Chinese (zh)
Other versions
CN113127461B (en
Inventor
张英杰
袁伟
朱礼军
吴思
曹燕
张静
赵辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Scientific And Technical Information Of China
Original Assignee
Institute Of Scientific And Technical Information Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Scientific And Technical Information Of China filed Critical Institute Of Scientific And Technical Information Of China
Priority to CN201911416103.5A priority Critical patent/CN113127461B/en
Publication of CN113127461A publication Critical patent/CN113127461A/en
Application granted granted Critical
Publication of CN113127461B publication Critical patent/CN113127461B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to a data cleaning method, which comprises the following steps: acquiring sample data; determining a set of general members according to the sample data, wherein the set of general members includes all general members in the sample data; determining a set of special members according to the sample data, wherein the set of general members comprises all special members in the sample data; and obtaining a cleaning data set according to the general member set and the special member set.

Description

Data cleaning method and device, electronic equipment and storage medium
Technical Field
The present application belongs to the field of databases, and in particular, relates to a data cleaning method, a data cleaning apparatus, an electronic device, and a storage medium.
Background
The application background of the patent is that the database taking the identity card number of a researcher as the main key is obtained in the existing 26 scientific and technical talent databases by taking the education experiences, research institutions, research projects, articles and patents of the researcher as key words.
Because the data redundancy of the prior scientific and technological talent database is high, the situations of data missing, data repetition and data conflict are common, the data quality can not be intelligently identified by the prior algorithm, and the requirement of meeting the data quality is difficult to meet by data cleaning. And the tuple scale of the database is in hundred million grades, the database is difficult to establish in a short time by the traditional method of manual cleaning and keyword setting screening, and the method for automatically cleaning the scientific and technological talent data based on artificial intelligence is provided.
Disclosure of Invention
The present application is directed to a data cleaning method, a data cleaning apparatus, an electronic device, and a storage medium.
An embodiment of the present application provides a data cleansing method, including: acquiring sample data; determining a set of general members according to the sample data, wherein the set of general members includes all general members in the sample data; determining a set of special members according to the sample data, wherein the set of general members comprises all special members in the sample data; and obtaining a cleaning data set according to the general member set and the special member set.
Optionally, the method may further include: searching members meeting preset selection conditions in the general member set to obtain a general member subset; searching members meeting the preset selection condition in the special member set to obtain a special member subset; the obtaining a cleaning data set according to the general member set and the special member set comprises: a cleaning dataset based on the general member subset and the special member subset.
Optionally, in the method, the general member set package and the special member set may contain at least one same attribute.
Further, the method may further include: determining an evaluation factor for each member in the set of general members; determining the evaluation factor for each member of the set of special members; wherein the evaluation factor is obtained according to a preset linear expression of the at least one same attribute of the member.
Still further, in the method, the preset selection condition may be a comparison result of the evaluation factor and a first threshold.
Another embodiment of the present application also provides a data washing apparatus, including: the sample data acquisition unit is used for acquiring sample data; a general member set establishing unit, configured to establish a general member set according to the sample data; the special member set establishing unit is used for establishing a special member set according to the sample data; and the cleaning data acquisition unit is used for determining a cleaning data set according to the general member set and the special member set.
Optionally, the apparatus may further include: the general member subset establishing unit is connected between the general member set establishing unit and the cleaning data acquiring unit and used for establishing a general member subset according to the general member set; and the special member subset establishing unit is connected between the special member set establishing unit and the cleaning data acquiring unit and is used for establishing a special member subset according to the special member set.
Further, the apparatus may further include: the evaluation factor determination unit is used for determining the evaluation factors of the members in the general member set, the evaluation factors of the members in the general member set are used for establishing the general member subset, the evaluation factor determination unit is also used for determining the evaluation factors of the members in the special member set, and the evaluation factors of the members in the special member set are used for establishing the special member subset.
Another embodiment of the present application further provides an electronic device including a processor and a memory, and a program stored in the memory and executable by the processor, wherein when the program is executed, the processor performs any one of the methods described above.
Another embodiment of the present application also provides a storage medium storing a program executable by a processor, the processor performing any one of the methods described above when the program is executed.
By utilizing the method, the device, the electronic equipment and the storage medium, the purpose of cleaning large-scale data can be simply and quickly realized through interaction between the two sets by decomposing the sample data set into the general member set and the special member set. The method has the advantages of simplicity, rapidness and easy automation.
Drawings
FIG. 1 shows a flow diagram of a data cleansing method according to an embodiment of the present application.
Fig. 2 shows a further schematic flow diagram of the embodiment shown in fig. 1.
Fig. 3 shows a schematic flow chart of a data cleansing method according to another embodiment of the present application.
FIG. 4 shows a schematic block diagram of a data cleansing apparatus according to another embodiment of the present application.
FIG. 5 shows a block diagram of an electronic device according to an example embodiment.
Detailed Description
The following embodiments of the present disclosure relating to a data cleaning method, a data cleaning apparatus, an electronic device, and a storage medium will be described with reference to specific embodiments, and those skilled in the art will understand the advantages and effects of the present disclosure from the disclosure of the present disclosure. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention. The drawings of the present invention are for illustrative purposes only and are not intended to be drawn to scale. The following embodiments will further explain the related art of the present invention in detail, but the disclosure is not intended to limit the scope of the present invention.
The present application is directed to a data cleaning method, a data cleaning apparatus, an electronic device, and a storage medium.
An embodiment of the present application provides a data cleansing method, including: acquiring sample data;
determining a set of general members according to the sample data, wherein the set of general members includes all general members in the sample data; determining a set of special members according to the sample data, wherein the set of general members comprises all special members in the sample data; and obtaining a cleaning data set according to the general member set and the special member set.
Optionally, the method may further include: searching members meeting preset selection conditions in the general member set to obtain a general member subset; searching members meeting the preset selection condition in the special member set to obtain a special member subset; the obtaining a cleaning data set according to the general member set and the special member set comprises: a cleaning dataset based on the general member subset and the special member subset.
Optionally, in the method, the general member set package and the special member set may contain at least one same attribute.
Further, the method may further include: determining an evaluation factor for each member in the set of general members; determining the evaluation factor for each member of the set of special members; wherein the evaluation factor is obtained according to a preset linear expression of the at least one same attribute of the member.
Still further, in the method, the preset selection condition may be a comparison result of the evaluation factor and a first threshold.
Another embodiment of the present application also provides a data washing apparatus, including: the sample data acquisition unit is used for acquiring sample data; a general member set establishing unit, configured to establish a general member set according to the sample data; the special member set establishing unit is used for establishing a special member set according to the sample data; and the cleaning data acquisition unit is used for determining a cleaning data set according to the general member set and the special member set.
Optionally, the apparatus may further include: the general member subset establishing unit is connected between the general member set establishing unit and the cleaning data acquiring unit and used for establishing a general member subset according to the general member set; and the special member subset establishing unit is connected between the special member set establishing unit and the cleaning data acquiring unit and is used for establishing a special member subset according to the special member set.
Further, the apparatus may further include: the evaluation factor determination unit is used for determining the evaluation factors of the members in the general member set, the evaluation factors of the members in the general member set are used for establishing the general member subset, the evaluation factor determination unit is also used for determining the evaluation factors of the members in the special member set, and the evaluation factors of the members in the special member set are used for establishing the special member subset.
Another embodiment of the present application further provides an electronic device including a processor and a memory, and a program stored in the memory and executable by the processor, wherein when the program is executed, the processor performs any one of the methods described above.
Another embodiment of the present application also provides a storage medium storing a program executable by a processor, the processor performing any one of the methods described above when the program is executed.
By utilizing the method, the device, the electronic equipment and the storage medium, the purpose of cleaning large-scale data can be simply and quickly realized through interaction between the two sets by decomposing the sample data set into the general member set and the special member set. The method has the advantages of simplicity, rapidness and easy automation.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the application. As used in the specification and claims of this application, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this application refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
FIG. 1 shows a flow diagram of a data cleansing method according to an embodiment of the present application.
As shown in fig. 1, the method 1000 may include: s110, S120, S130 and S140. In S110, sample data H may be acquired. Alternatively, the sample data H may contain dirty data, which may include at least one of missing data, redundant data, and abnormal data. Alternatively, the sample data H may be one table, or two or more tables.
In S120, a general member set G may be determined according to the sample data H. The general member set G may include all or part of general members in the sample data H. Alternatively, the general member may be each member in the sample data H.
In S130, a special member set S may be determined according to the sample data H. The set of special members S may include all or part of the special members in the sample data H. Alternatively, the special member may be a member of the sample data H that is complete in data.
Alternatively, S130 may be set before S120. S120 may also be performed in parallel with or interspersed with S130.
In S140, a cleaning data set can be obtained according to the aforementioned general member set G and special member set S.
As shown in fig. 2, optionally, the method 1000 may further include: obtaining a general member subset (subset of the general boundary G) according to the general member set (general boundary) G; a subset of special members (subset of special boundaries S) is derived from the set of special members (special boundaries) S. Alternatively, S140 may include deriving a cleaning dataset from the general member set G and the special member set S.
Optionally, obtaining the subset of the general members according to the general member set G may include: and searching the members in the general member set G according to a preset condition, and establishing a general member subset according to the members in the general member set G meeting the preset search condition. Optionally, obtaining the subset of special members according to the set S of special members may include: and searching the members in the special member set S according to a preset condition, and establishing a special member subset according to the members of the special member set meeting the preset search condition. Alternatively, the preset retrieval conditions for the establishment of the general member subset and the retrieval conditions for the establishment of the specific member subset may be the same.
Optionally, the general member set G and the special member set S each include a plurality of attributes, wherein at least one of the plurality of attributes of the general member set G and the plurality of attributes of the special member set S is the same. Further, the attributes of the general member set G and the attributes of the special member set S may both be the same.
Optionally, method 1000 may further include determining an evaluation factor. Alternatively, the method 1000 may include determining a rating factor for each member of the general member set G and determining a rating factor for each member of the special member set S. Alternatively, the method of determining the evaluation factor of each member in the general member set G and the method of determining the evaluation factor of each member in the special member set S may be the same.
Alternatively, the evaluation factor may be obtained from a linear expression of at least one of the same attributes of the aforementioned general member set G and the special member set S. That is, the evaluation factor can be obtained by multiplying the numerical value of the at least one same attribute of each member by a preset coefficient to obtain a product and accumulating a plurality of preset products. Alternatively, the comparison of the evaluation factor with the first threshold value may be calculated as the aforementioned search condition.
Optionally, the method 1000 may further include the steps of: a member d is extracted from the sample data H. It is determined whether member d is a positive case. If d is a positive example, all hypotheses that are inconsistent with member d are removed from the general member set G. For each hypothesis S in the special member set S that is inconsistent with member d, the hypothesis S is removed from the special member set S. Every hypothesis S in the special member set S that is inconsistent with member d is removed from the special member set S. Adding to S all very small generalizations h of S, wherein h satisfies: h is consistent with d, and some member of G is more general than h. All such hypotheses are removed from S: it is more general than the other assumption in S.
If d is a counterexample, all hypotheses that do not agree with d are removed from the special member set S. Removing G from the general member set G for each hypothesis G in the general member set G that is inconsistent with member d. Adding all minimal specialized formulas h of G to the general member set G, wherein h satisfies: h is consistent with member d and a member of the set S of special members is more special than h. All such hypotheses are removed from the general member set G: he is more specific than another hypothesis in the general member set G.
Fig. 3 shows a schematic flow chart of a data cleansing method according to another embodiment of the present application.
As shown in fig. 3, method 2000 may include: s210, S220, S230, S240, S250, and S260.
In S210, sample data H may be acquired. Wherein the sample data may comprise at least one table. Sample data H as shown in the example embodiment may include table 1, table 2, and table 3. And sample data H may be the collection of tables 1-3.
Table 1 educational history table
Name (I) Identity card number Study calendar Starting time Cut-off time
Zhang San 111 This section 2006 2010
Li Si 222 Master's soldier 2011 2014
Wang Wu 333 Doctor (Rooibos) 2013 2016
TABLE 2 study item Table
Research projects Identity card number Content of research Starting time Cut-off time
A 111 AAA 2008 2010
B 222 BBB 2013 2014
C 333 CCC 2016
TABLE 3 Table of the results of the study
Figure BDA0002351234320000071
Figure BDA0002351234320000081
In S220, a general member set D may be determined according to the sample data H. The general members D may include some members derived from the sample data H according to a preset search condition, wherein the members are rows in tables 1 to 3. For example, the search may be performed by using the primary key identification numbers 111, 222, and 333 as search conditions, a subset of the sample data H is obtained, and the subset is used as the general member set D.
A set S of special members can be determined from the sample data H in S230. The special member set S may include some members derived from the sample data H according to a preset retrieval condition. For example, the primary key identification numbers 111, 222, 333 may be used as search conditions to search for a subset of the sample data H, and the subset may be used as the special member set S. Wherein the members of a particular member S are not allowed to have data missing. As shown in the exemplary embodiment, special member set S deletes row C in Table 2 and row F in Table 3.
In S240, an evaluation factor for each member of the general member set D and the special member set S may be determined. One or more attributes may be selected among the common attributes of the general member set D and the special member set S as shown in the exemplary embodiment, and the evaluation factor may be calculated from the expression of the one or more attributes. The evaluation factor can be calculated by the following formula, for example.
J=Σwd*wp
Wherein J is an evaluation factor, and the academic calendar wp and the achievement level wd are two attributes participating in the calculation of the evaluation factor. As shown in the exemplary embodiment: the attribute calendar may be quantized, for example, the values of the calendar wd may be defined as 0.2, 0.5 and 1.0. The attribute achievement level wp can be subjected to quantization processing wp, for example, the value of the achievement level wp can be defined as: the weights of research results wp of the 1-3 grades of papers are 0.8, 0.5 and 0.2 respectively, and the weights of the research results wp of the 1-5 grades of patents are 0.5, 0.4, 0.3, 0.2 and 0.1 respectively.
In S250, a subset of the general members may be determined according to the evaluation factor of each member in the set of general members D. As shown in the example embodiment, members having an evaluation factor greater than 0.5 may be selected from the set of general members D to constitute a subset of general members. The general member subset shown in the exemplary embodiment is shown in table 4.
In S260, a subset of the special members may be determined according to the evaluation factor of each member in the set S of special members. As shown in the exemplary embodiment, members having an evaluation factor greater than 0.5 may be selected from the set S of special members to form a subset of special members. The particular subset of members is an empty set as shown in the exemplary embodiment.
The execution sequence of step S250 and step S260 may be exchanged, may be executed in parallel, or may be executed alternately.
And the target d is to search the research results evaluated according to the weight set by the user, select the identity card number of the researcher with the total evaluation factor of more than 0.5 and form a new table.
Finally, the result is calculated: the result of the search according to the D subset is shown in Table 4, and the result of the search according to the S subset is washed because the study content of 333 ID card researchers in Table 2 is empty.
TABLE 4D subset merge Table
Identity card number Name (I) Study calendar Research projects Type of outcome Achievement level
333 Wang Wu Doctor (Rooibos) Paper (S) 1
FIG. 4 shows a schematic block diagram of a data cleansing apparatus according to another embodiment of the present application.
As shown in fig. 4, the apparatus 3000 may include: the system comprises a sample data acquisition unit 310, a special member set establishing unit 321, a general member establishing unit 322 and a cleaning data acquisition unit 351.
The sample data acquiring unit 310 may be configured to acquire sample data. Alternatively, the sample data H may contain dirty data, which may include at least one of missing data, redundant data, and abnormal data. Alternatively, the sample data H may be one table, or two or more tables.
The special member set creating unit 321 may be connected to the sample data acquiring unit 310, and may determine the special member set according to the sample data H. Alternatively, the special member set creating unit 321 may retrieve, in the sample data H, a special member that meets a preset retrieval condition to form a special member set. Where a particular member set may not contain a data deficiency.
The general member set creating unit 322 may be connected to the sample data acquiring unit 310, and may determine the general member set according to the sample data H. Alternatively, the general member set creating unit 322 may retrieve, in the sample data H, general members that satisfy a preset retrieval condition, and form a general member set.
The cleaning data acquisition unit 351 may be connected to the special member set creation unit 321 and the general member set creation unit 322, respectively. The cleaning data set can be derived from the set of special members and the set of general members.
Optionally, the apparatus 3000 may comprise: the special member subset creating unit 331. Among them, the special member subset establishing unit 331 may be connected between the special member set establishing unit 321 and the cleaning data obtaining unit 351. The special member subset creating unit 331 can create a special member subset from the special member set.
Optionally, the apparatus 3000 may further include a general member subset establishing unit 332. Among them, the general member subset establishing unit 332 may be connected between the general member set establishing unit 322 and the cleaning data acquiring unit 351. The general member subset establishing unit 331 may create a general member subset from the general member set. The further cleaning data acquisition unit 351 may obtain a cleaning data set according to the special member subset and the general member subset.
Optionally, the apparatus 3000 may further comprise an evaluation factor determination unit 341 and an evaluation factor determination unit 342. Among them, the evaluation factor determining unit 341 may be connected between the special member set establishing unit 321 and the special member subset establishing unit 331. The evaluation factor determination unit 341 may be configured to determine an evaluation factor of each member in the special member set creation unit 321. The specific member subset establishing unit 331 may be determined according to the evaluation factor.
The evaluation factor determination unit 342 may be connected between the general member set creation unit 322 and the general member subset creation unit 332. The evaluation factor determination unit 342 may be used to determine the evaluation factor of each member in the general member set creation unit 322. The general member subset establishing unit 332 may be determined according to the evaluation factor.
FIG. 5 shows a block diagram of an electronic device according to an example embodiment.
An electronic device 200 according to this embodiment of the present application is described below with reference to fig. 5. The electronic device 200 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 5, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.
Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform the methods according to various exemplary embodiments of the present application described herein. For example, the processing unit 210 may perform a method as shown in any of fig. 1-3.
The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.
The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The present application also provides an embodiment storage medium storing a program executable by a processor, wherein the processor executes any one of the above methods for managing garbage collection information when the program is executed, or executes any one of the above methods for garbage collection when the program is executed.
By utilizing the method, the device, the electronic equipment and the storage medium, the purpose of cleaning large-scale data can be simply and quickly realized through interaction between the two sets by decomposing the sample data set into the general member set and the special member set. The method has the advantages of simplicity, rapidness and easy automation.
As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or computer program product. Accordingly, this application may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to as a "circuit," module "or" system. Furthermore, the present application may take the form of a computer program product embodied in any tangible expression medium having computer-usable program code embodied in the medium.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the description of the embodiments is only intended to facilitate the understanding of the methods and their core concepts of the present application. Meanwhile, a person skilled in the art should, according to the idea of the present application, change or modify the embodiments and applications of the present application based on the scope of the present application. In view of the above, the description should not be taken as limiting the application.

Claims (10)

1. A method of data cleansing, comprising:
acquiring sample data;
determining a set of general members according to the sample data, wherein the set of general members includes all general members in the sample data;
determining a set of special members according to the sample data, wherein the set of general members comprises all special members in the sample data;
and obtaining a cleaning data set according to the general member set and the special member set.
2. The method of claim 1, further comprising:
searching members meeting preset selection conditions in the general member set to obtain a general member subset;
searching members meeting the preset selection condition in the special member set to obtain a special member subset;
the obtaining a cleaning data set according to the general member set and the special member set comprises:
a cleaning dataset based on the general member subset and the special member subset.
3. The method of claim 1, wherein the general member set package and the special member set include at least one same attribute.
4. The method of claim 3, further comprising:
determining an evaluation factor for each member in the set of general members;
determining the evaluation factor for each member of the set of special members;
wherein the evaluation factor is obtained according to a preset linear expression of the at least one same attribute of the member.
5. The method according to claim 4, wherein the preset selection condition is a comparison result of the evaluation factor with a first threshold value.
6. A data cleansing apparatus comprising:
the sample data acquisition unit is used for acquiring sample data;
a general member set establishing unit, configured to establish a general member set according to the sample data;
the special member set establishing unit is used for establishing a special member set according to the sample data;
and the cleaning data acquisition unit is used for determining a cleaning data set according to the general member set and the special member set.
7. The apparatus of claim 6, further comprising:
the general member subset establishing unit is connected between the general member set establishing unit and the cleaning data acquiring unit and used for establishing a general member subset according to the general member set;
and the special member subset establishing unit is connected between the special member set establishing unit and the cleaning data acquiring unit and is used for establishing a special member subset according to the special member set.
8. The apparatus of claim 7, further comprising:
an evaluation factor determination unit for determining evaluation factors of the members in the general member set, the evaluation factors of the members in the general member set being used for establishing the general member subset,
the evaluation factor determination unit is further configured to determine evaluation factors of the members of the special member set, and the evaluation factors of the members of the special member set are used for establishing the special member subset.
9. An electronic device comprising a processor and a memory, and a program stored in the memory that is executable by the processor, the processor performing the method of any of claims 1-5 when the program is executed.
10. A storage medium storing a program executable by a processor, the processor performing the method of any one of claims 1-5 when the program is executed.
CN201911416103.5A 2019-12-31 2019-12-31 Data cleaning method and device, electronic equipment and storage medium Active CN113127461B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911416103.5A CN113127461B (en) 2019-12-31 2019-12-31 Data cleaning method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911416103.5A CN113127461B (en) 2019-12-31 2019-12-31 Data cleaning method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113127461A true CN113127461A (en) 2021-07-16
CN113127461B CN113127461B (en) 2023-11-24

Family

ID=76769513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911416103.5A Active CN113127461B (en) 2019-12-31 2019-12-31 Data cleaning method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113127461B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879441A (en) * 2022-11-10 2023-03-31 中国科学技术信息研究所 Text novelty detection method and device, electronic equipment and readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104616116A (en) * 2015-02-13 2015-05-13 武汉金锐达科技有限公司 Bank client service system and method
US20160124955A1 (en) * 2014-10-29 2016-05-05 Red Hat, Inc. Dual overlay query processing
US20170032277A1 (en) * 2015-07-29 2017-02-02 International Business Machines Corporation Automated intelligent data navigation and prediction tool
CN107993139A (en) * 2017-11-15 2018-05-04 华融融通(北京)科技有限公司 A kind of anti-fake system of consumer finance based on dynamic regulation database and method
CN109299169A (en) * 2018-10-24 2019-02-01 中国平安人寿保险股份有限公司 Data visualization method, system, terminal and computer readable storage medium
CN109522302A (en) * 2018-11-09 2019-03-26 南京医渡云医学技术有限公司 Medical data processing method, device, electronic equipment and computer-readable medium
CN110134675A (en) * 2019-05-23 2019-08-16 大连海事大学 A kind of data cleaning method and system towards oceanographic data stream
CN110427358A (en) * 2019-02-22 2019-11-08 北京沃东天骏信息技术有限公司 Data cleaning method and device and information recommendation method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160124955A1 (en) * 2014-10-29 2016-05-05 Red Hat, Inc. Dual overlay query processing
CN104616116A (en) * 2015-02-13 2015-05-13 武汉金锐达科技有限公司 Bank client service system and method
US20170032277A1 (en) * 2015-07-29 2017-02-02 International Business Machines Corporation Automated intelligent data navigation and prediction tool
CN107993139A (en) * 2017-11-15 2018-05-04 华融融通(北京)科技有限公司 A kind of anti-fake system of consumer finance based on dynamic regulation database and method
CN109299169A (en) * 2018-10-24 2019-02-01 中国平安人寿保险股份有限公司 Data visualization method, system, terminal and computer readable storage medium
CN109522302A (en) * 2018-11-09 2019-03-26 南京医渡云医学技术有限公司 Medical data processing method, device, electronic equipment and computer-readable medium
CN110427358A (en) * 2019-02-22 2019-11-08 北京沃东天骏信息技术有限公司 Data cleaning method and device and information recommendation method and device
CN110134675A (en) * 2019-05-23 2019-08-16 大连海事大学 A kind of data cleaning method and system towards oceanographic data stream

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
SHIVESH RANJAN ET AL.: "Curriculum Learning Based Approaches for Noise Robust Speaker Recognition", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》, vol. 26, no. 1, pages 197, XP058385053, DOI: 10.1109/TASLP.2017.2765832 *
YU, C ET AL.: "System Identification in Presence of Outliers", 《IEEE TRANSACTIONS ON CYBERNETICS》, vol. 46, no. 5, pages 1 - 9 *
孙舟;田贺平;潘鸣宇;王伟贤;张禄;陈光;: "有效解决数据缺失问题的聚集查询算法", 计算机工程与应用, no. 24, pages 77 - 83 *
沈亮亮: "面向不完备数据的分类方法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 7, pages 138 - 1163 *
潘鸣宇;张禄;龙国标;李香龙;马冬雪;徐亮;: "用于重复充电运营记录的基于块采样的高效聚集查询算法", 计算机应用, no. 06, pages 76 - 80 *
王永红;: "定量专利分析的样本选取与数据清洗", 情报理论与实践, no. 01, pages 93 - 96 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115879441A (en) * 2022-11-10 2023-03-31 中国科学技术信息研究所 Text novelty detection method and device, electronic equipment and readable storage medium
CN115879441B (en) * 2022-11-10 2024-04-12 中国科学技术信息研究所 Text novelty detection method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN113127461B (en) 2023-11-24

Similar Documents

Publication Publication Date Title
Mishra et al. A fast algorithm for finding the non dominated set in multi objective optimization
CN110929752B (en) Grouping method based on knowledge driving and data driving and related equipment
CN107506389B (en) Method and device for extracting job skill requirements
CN107102999B (en) Correlation analysis method and device
US9305076B1 (en) Flattening a cluster hierarchy tree to filter documents
CN111090686B (en) Data processing method, device, server and storage medium
CN111429980A (en) Automatic acquisition method for material crystal structure characteristics
US8250106B2 (en) Incremental inference
CN113254593B (en) Text abstract generation method and device, computer equipment and storage medium
CN110019806B (en) Document clustering method and device
CN112580817A (en) Managing machine learning features
CN111967521B (en) Cross-border active user identification method and device
Abidin et al. Singular Value Decomposition for dimensionality reduction in unsupervised text learning problems
CN106599049B (en) A kind of decision table Data Reduction method
CN110717092A (en) Method, system, device and storage medium for matching objects for articles
CN113127461B (en) Data cleaning method and device, electronic equipment and storage medium
CN111339778B (en) Text processing method, device, storage medium and processor
CN106970919B (en) Method and device for discovering new word group
CN107679174A (en) Construction method, device and the server of Knowledge Organization System
CN111143511A (en) Emerging technology prediction method, emerging technology prediction device, electronic equipment and medium
CN116756373A (en) Project review expert screening method, system and medium based on knowledge graph update
CN106682107B (en) Method and device for determining incidence relation of database table
CN114077905A (en) Commodity recall method and device based on priority queue and storage medium
US20160335300A1 (en) Searching Large Data Space for Statistically Significant Patterns
CN105912727A (en) Quick recommendation method in online social network labeling system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant