CN113656385A - Data cleaning method, data cleaning device, storage medium and electronic equipment - Google Patents

Data cleaning method, data cleaning device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113656385A
CN113656385A CN202010396586.3A CN202010396586A CN113656385A CN 113656385 A CN113656385 A CN 113656385A CN 202010396586 A CN202010396586 A CN 202010396586A CN 113656385 A CN113656385 A CN 113656385A
Authority
CN
China
Prior art keywords
cleaning
data
item
time
original data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010396586.3A
Other languages
Chinese (zh)
Inventor
邱俊傑
尹伟
郭利伟
邢大飞
李怡姗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010396586.3A priority Critical patent/CN113656385A/en
Publication of CN113656385A publication Critical patent/CN113656385A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0609Buyer or seller confidence or verification

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Finance (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Development Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Operations Research (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a data cleaning method, a data cleaning device, a storage medium and an electronic device. The data cleaning method comprises the following steps: periodically identifying data to be cleaned of each item from the original data of the item; receiving a cleaning instruction; when the current time reaches the cleaning starting time, counting the original data volume of all the projects and the data volume to be cleaned of the projects within the cleaning range; and in the period from the cleaning starting time to the cleaning ending time, carrying out the check and reduction on the original data volume of the items in the cleaning range every other check and reduction period, and rearranging the original data volume of all the items after each check and reduction. According to the data cleaning method, the original data volume is subjected to censoring based on the data volume to be cleaned, so that the ranking list is more accurate, the data to be cleaned does not need to be deleted from the bottom database, and the stability of the whole data is ensured.

Description

Data cleaning method, data cleaning device, storage medium and electronic equipment
Technical Field
The present invention generally relates to a data cleaning method, a data cleaning apparatus, a storage medium, and an electronic device.
Background
During the operation of the website, every minute and every second, various detailed data are generated. Taking e-commerce websites as an example, common detail data includes sales of a certain brand, sales volume of a certain shop, and the like; counting the data of the same type and immediately ranking according to a certain rule from more to less, so as to generate a data ranking list for real-time counting, for example, the sales volume of each brand belonging to the same category is ranked in real time according to the sales volume of the brand in the same day, which can be called as the real-time data ranking list of the brand sales volume in the certain category in the same day; a real-time statistical data ranking list can be realized by compiling the following parts by computer software, wherein the parts are respectively as follows: the system comprises a bottom database, a data acquisition module, a data statistical processing module, a ranking real-time calculation module and a front-end display module;
the data ranking list for real-time statistics has high requirements on the accuracy and the real-time performance of the data, but the problem that the data statistics caliber of an e-commerce scene is complex and involves human factors such as billing is solved, and in order to ensure the accuracy, continuity, real-time performance and high efficiency of the data, invalid data needs to be cleared at a fixed key time node. In a common method, invalid data is directly separated from a bottom-layer database, then statistical processing is carried out on the obtained correct data, and ranking of the list is refreshed.
The defects of the prior art are as follows:
firstly, the existing ranking list data cleaning technology can cause the sudden change of the ranking displayed at the front end or the numerical value corresponding to the ranking, and cause the question to be made on the data accuracy of the ranking list by a user viewing the ranking list;
secondly, the existing ranking list data cleaning technology directly deletes the data from the database, so that certain influence is caused on the stability of the whole data;
the above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In this summary, concepts in a simplified form are introduced that are further described in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In order to solve the problem of poor stability of the whole data caused by direct deletion processing from a database in the prior art, the invention provides a data cleaning method for counting a ranking list, which comprises the following steps:
periodically identifying data to be cleaned of each item from the original data of the item;
receiving a cleaning instruction, wherein the cleaning instruction comprises cleaning starting time, cleaning finishing time and a cleaning range;
when the current time reaches the cleaning starting time, counting the original data volume of all the projects and the data volume to be cleaned of the projects within the cleaning range;
and in the period from the cleaning starting time to the cleaning ending time, performing decrement on the original data volume of the items in the cleaning range every other one decrement period, and rearranging the original data volume of all the items after each decrement, wherein the original data volume subtracted from one item in the period is equal to the to-be-cleaned data volume of the item.
According to an embodiment of the invention, in a period from a cleaning starting time to a cleaning ending time, the original data volume of an item in a cleaning range is verified and subtracted every other verification period, and the original data volume of all items is rearranged after each verification, wherein the original data volume subtracted from an item in the period is equal to the data volume to be cleaned of the item, the method comprises the following steps:
calculating the time point of each nuclear subtraction according to the cleaning starting time, the cleaning ending time and the nuclear subtraction period;
calculating the core reduction amount of the original data volume of each item in the cleaning range at each time point according to the time of each time point, the cleaning starting time and the data volume to be cleaned;
when the current time reaches any time point, the checking and reducing amount corresponding to the time point of the current arrival of the item is checked and reduced from the original data amount of each item in the cleaning range, and after each checking and reducing, the original data amounts of all the items are re-ranked to update the ranking list.
According to one embodiment of the invention, the amount of reduction of the original data of the same item of the cleaning range at each time point is different from each other.
According to one embodiment of the present invention, the amount of reduction of the kernel that should be subtracted from the original data amount of each item at each time point is calculated according to the following equation:
Qm=a·Δt2+[2a(tm-t0)+b]·Δt (1)
wherein Q ismThe nuclear decrement corresponding to the mth time point, delta t is the nuclear decrement period, t0To clear the start time, tmAt the moment of the mth time point, a is a preset value less than zero,
Figure BDA0002487775270000031
and A is the total data volume to be cleaned.
According to one embodiment of the invention, a is generated by a random number generator.
According to an embodiment of the present invention, when the current time reaches any time point, the decrement amount corresponding to the time point at which each item currently reaches is subtracted from the original data amount of each item in the cleaning range, and the original data amounts of all items are re-ranked after each subtraction to update the ranking list, including the following steps:
and (3) performing nuclear subtraction: after the current time reaches a time point, the core decrement corresponding to the current reaching time point is subtracted from the original data volume of each project in the cleaning range;
ranking the original data volume of all the projects;
and judging whether an elapsed time point exists after the current time or not, and if so, entering a step of checking and subtracting.
According to one embodiment of the invention, the data cleaning method is applied to the E-market scene, the original data is the sales record, the original data volume is the sales volume, the data to be cleaned is the false sales record, and the data volume to be cleaned is the false sales volume.
According to one embodiment of the invention, the method comprises the following steps:
an information processing module;
the data module is connected with the information processing module and used for periodically identifying the data to be cleaned of each item from the original data of each item;
the instruction module is connected with the information processing module and used for sending a cleaning instruction to the information processing module, and the cleaning instruction comprises cleaning starting time, cleaning finishing time and a cleaning range;
the information processing module is used for performing check-down on the original data volume of the items in the cleaning range every other check-down period within a period from the cleaning starting time to the cleaning ending time, and rearranging the original data volume of all the items after each check-down, wherein the original data volume of one item subtracted within the period is equal to the data volume to be cleaned of the item.
The invention also proposes a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements a data cleansing method as described above.
The invention also proposes an electronic device, characterized in that it comprises:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the data cleansing method as described above via execution of the executable instructions.
According to the technical scheme, the data cleaning method has the advantages and positive effects that:
according to the data cleaning method, the original data volume is subjected to censorship based on the data volume to be cleaned, so that the ranking list is more accurate, the data to be cleaned does not need to be deleted from the bottom database, and the stability of the whole data is ensured. Meanwhile, the ranking list changes once every other nuclear reduction period, and the ranking list times become closer to the true level, so that a user can have a sense organ which is safer and more stable to the ranking list when looking over the ranking list, and the user experience is improved.
Drawings
Various objects, features and advantages of the present invention will become more apparent from the following detailed description of preferred embodiments of the invention, when considered in conjunction with the accompanying drawings. The drawings are merely exemplary of the invention and are not necessarily drawn to scale. In the drawings, like reference characters designate the same or similar parts throughout the different views. Wherein:
FIG. 1 is a flow diagram illustrating a method of data cleansing in accordance with an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating a data cleansing apparatus according to an exemplary embodiment;
FIG. 3 is a schematic diagram of an electronic device shown in accordance with an exemplary embodiment;
FIG. 4 is a schematic diagram illustrating a computer-readable storage medium according to an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their detailed description will be omitted.
FIG. 1 shows a flow diagram of a data cleansing method for a statistical leaderboard. The data cleaning method includes steps S1-S4. The data cleaning method can be implemented by a data cleaning device based on a statistical ranking list, and referring to fig. 2, the data cleaning device 1 includes an instruction module 12, a data module 11 and an information processing module 13.
S1: identifying data to be cleaned of each item from the original data of the item;
the data module 11 comprises a real-time unit 111 and a database 112 connected to the real-time unit 111. The database 112 is used to store data to be cleaned for each item. The real-time unit 111 is configured to store raw data of each item, periodically identify data to be cleaned in the raw data of each item, and send the data to be cleaned to the database 112 for storage.
The data module 11 is further configured to periodically identify data to be cleaned in the raw data of each item according to a predetermined rule, and meanwhile, push the identified data to be cleaned to the database 112. After receiving the data to be cleaned, the database 112 stores the data to be cleaned.
In the e-commerce application scenario, one item corresponds to one commodity of one merchant, and the raw data of each item is the sales record of each commodity of each merchant. For example, a merchant selling a good on an e-commerce platform generates a sales record for the good. Counting the sales records can obtain the sales volume of the commodity of the merchant in a period, wherein the period can be one day, one week, one month or one year. The data to be cleaned of each item is a false sales record of each commodity. The false sales record may be, for example, a sales record generated by a merchant violation swipe. The general expression of the bill-swiping is that a merchant payer impersonates a customer to buy goods in an online store, and the ranking and sales of the online store are improved in a fake shopping mode to acquire sales and to attract the customer with good comment. The false sales records can be identified according to preset rules. For example, when a certain customer has a transaction frequency exceeding a preset threshold during a period, all sales records associated with the customer during the period may be considered as fake sales records. For example, when the commodity circulation information related to the sales record is empty or the commodity circulation information is false circulation information, the sales record is determined as the false sales record. When the sales record is identified as a false sales record, the sales record is copied to the database 112 and marked as a false sales record.
S2: receiving a cleaning instruction, wherein the cleaning instruction comprises cleaning starting time, cleaning finishing time and a cleaning range;
the instruction module 12 is connected to the information processing module 13, and the instruction module 12 is configured to send a cleaning instruction to the information processing module 13. The instruction module 12 includes an instruction input unit 121 and an instruction storage unit 122. The instruction storage unit 122 is connected to the instruction input unit 121 and the information processing module 13, respectively. The instruction input unit 121 is configured to input a scrubbing instruction, and send the scrubbing instruction to the instruction storage unit 122. The instruction storage unit 122 stores the cleaning instruction after receiving the cleaning instruction, and also forwards the cleaning instruction to the information processing module 13.
The cleaning instruction includes a cleaning start time, a cleaning end time, and a cleaning range. The cleaning starting time is the starting time of the cleaning task, and the cleaning of the data to be cleaned is started when the current time reaches the cleaning starting time. And the cleaning ending time is the finishing time of the cleaning task, and when the current time reaches the cleaning ending time, the cleaning of the data to be cleaned is ended.
The cleaning scope is the set of items that need to be cleaned. The cleaning range is set by human, for example, the cleaning range may be one or more items specified, and for example, the cleaning range may be all items.
S3: when the current time reaches the cleaning starting time, counting the original data volume of all the projects and the data volume to be cleaned of the projects within the cleaning range;
the information processing module 13 includes a cleaning task component 132 and a real-time task component 131. The cleaning task component 132 is connected to the database 112, the instruction storage unit 122, and the real-time task component 131, respectively. The real-time task component 131 is connected to the real-time unit 111, and is configured to obtain raw data of all items from the real-time unit 111 in real time and send the raw data of all items to the cleaning task component 132. After the cleaning task component 132 receives the cleaning instruction from the instruction storage unit 122, when the current time reaches the cleaning start time in the cleaning instruction, the cleaning task component 132 obtains the data to be cleaned of the items within the cleaning range from the database 112.
The cleaning task component 132 is further configured to count the amount of data to be cleaned of the items within the cleaning scope and the amount of raw data of all the items after obtaining the data to be cleaned of the items within the cleaning scope and the raw data of all the items. In an e-commerce scenario, the raw data volume for each item may be the sales volume for each commodity for each merchant, and the data volume to be cleaned for items within the cleaning scope may be the false sales volume for a particular category of commodities for the merchant specified by the cleaning scope. And counting the sales volume of each commodity of each merchant according to the sales record of each commodity of each merchant, and counting the false sales volume of each commodity in the clearing range according to the false sales record of the commodities in the clearing range.
S4: and in the period from the cleaning starting time to the cleaning ending time, carrying out decrement on the original data volume of the items in the cleaning range every other decrement period, and rearranging the original data volume of all the items after each decrement, wherein the original data volume subtracted from one item in the period from the cleaning starting time to the cleaning ending time is equal to the to-be-cleaned data volume of the item.
The cleaning task component 132 is also configured to perform the above step S4.
The step S4 includes steps S41 to S43.
S41: calculating the time point of each nuclear subtraction according to the cleaning starting time, the cleaning ending time and the nuclear subtraction period;
in the information processing module 13, a time length from the cleaning start time to the cleaning end time is calculated first, and an integer is obtained for the quotient after dividing the time length by the kernel subtraction period, where the integer is the number n of time points. The time of the time point is t0+ Δ t × m, where Δ t is the nuclear subtraction period, t0For cleaning start time, m is the serial number of the time points, m is 1, 2, 3 … … n, and n is the number of the time points.
The core-down period is preset in the information processing module 13, and the value of the core-down period can be selected according to actual conditions. The core-down period is typically much less than the duration between the cleaning start time and the cleaning end time. The value range of the nuclear subtraction period is preferably 0.2-5 seconds, and more preferably 1 second. The duration between the cleaning start time and the cleaning end time is preferably an integer multiple of the nuclear subtraction period.
S42: calculating the core reduction amount of the original data volume of each item in the cleaning range at each time point according to the time of each time point, the cleaning starting time and the data volume to be cleaned;
the sum of the reduction amounts of the cores of the original data amount of each item is equal to the data amount to be cleaned of the item, namely, the reduction amount of the cores of each reduction is equal to the partial data amount to be cleaned. In the e-market scenario, the core decrement for each item is a partial false sales volume for each commodity.
The information processing module 13 is further configured to calculate a core decrement for decrementing the data to be cleared by a core at each time point according to the time of each time point, the clearing start time, and the data amount to be cleared.
Preferably, the core reduction amount of the original data amount of the same item of the cleaning range at each time point is different from each other, so that the original data amount after each core reduction amount is subjected to nonlinear reduction along with the increase of time in the future.
Specifically, the amount of reduction of the kernel that should be subtracted from the original data amount of each item at each time point can be calculated respectively according to the following equation:
Qm=a·Δt2+[2a(tm-t0)+b]·Δt (1)
wherein Q ismThe nuclear decrement corresponding to the mth time point, delta t is the nuclear decrement period, t0To clear the start time, tmAt the moment of the mth time point, a is a preset value less than zero,
Figure BDA0002487775270000081
and A is the total data volume to be cleaned.
Where a may be generated by a random number generator. The change amplitude of the original data quantity after each core subtraction with the core subtraction quantity along with the time increase is determined by a.
S43: when the current time reaches any time point, the checking and reducing amount corresponding to the time point of the current arrival of the item is checked and reduced from the original data amount of each item in the cleaning range, and after each checking and reducing, the original data amounts of all the items are re-ranked to update the ranking list.
Step S43 includes steps S431 to S434.
S431: after the current time reaches a time point, the core decrement corresponding to the current reaching time point is subtracted from the original data size of each item in the cleaning range, and the step S432 is entered;
in an e-commerce scene, the sales volume of each commodity in the cleaning range is reduced by the core reduction, so that the sales volume of each commodity in the cleaning range is closer to a real level.
S432: ranking the original data volumes of all the items, and entering step S433;
and after the original data volume is subjected to the check and subtraction every time point is reached, the original data volumes of all the items are ranked. In the E-market scenario, the sales of each commodity is re-ranked. In this way, the ranking list is changed every time the verification is performed, and the ranking list is changed step by step.
The step may further include pushing the new ranking list to the cache 21 of the front-end device 2, so that the display module 22 of the front-end device 2 can quickly obtain and display the latest ranking list in the cache 21. The front-end device 2 may be a mobile device such as a mobile phone and a tablet computer, and may also be a computer.
S433: judging whether an elapsed time point exists after the current time, if so, entering step S431, otherwise, entering step S434;
s434: and finishing the cleaning.
In the data cleaning method in this embodiment, the original data amount is subjected to the censoring based on the data amount to be cleaned, so that the ranking list is more accurate, the data to be cleaned does not need to be deleted from the bottom database 112, and the stability of the whole data is ensured. Meanwhile, the ranking list changes once every other nuclear reduction period, and the ranking list times become closer to the true level, so that a user can have a sense organ which is safer and more stable to the ranking list when looking over the ranking list, and the user experience is improved.
In the present embodiment, since the subtraction amount that should be subtracted from the original data amount of each item at each time point is calculated using the above equation (1), respectively, the relationship of the change of the original data amount with time is:
ym=a(tm-t0)2+b(tm-t0)+B
wherein, ymIs the original data amount after the m-th time point is verified, B is the original data amount, Δ t is the verification period, t0To clear the start time, tmAt the moment of the mth time point, a is a preset value less than zero,
Figure BDA0002487775270000091
it can be seen that the raw data volume decreases in a parabolic trend with time.
An electronic device 800 according to this embodiment of the invention is described below with reference to fig. 3. The electronic device 800 shown in fig. 3 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present invention.
As shown in fig. 3, electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, and a bus 830 that couples the various system components including the memory unit 820 and the processing unit 810.
Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification.
The storage unit 820 may include readable media in the form of volatile memory units such as a random access memory unit (RAM)8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.
The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 800 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable an insurance customer to interact with the electronic device 800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the data cleaning method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing one of the data cleansing methods described above in the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.
Referring to fig. 4, a program product 900 for implementing the above-described data scrubbing method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the insurance client computing device, partly on the insurance client device, as a stand-alone software package, partly on the insurance client computing device and partly on the remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the insurance client computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the data cleaning method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
Although the present invention has been disclosed with reference to certain embodiments, numerous variations and modifications may be made to the described embodiments without departing from the scope and ambit of the present invention. It is to be understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the scope of the appended claims and their equivalents.

Claims (10)

1. A data cleaning method for a statistical ranking list is characterized by comprising the following steps:
periodically identifying data to be cleaned of each item from the original data of the item;
receiving a cleaning instruction, wherein the cleaning instruction comprises cleaning starting time, cleaning finishing time and a cleaning range;
when the current time reaches the cleaning starting time, counting the original data volume of all the projects and the data volume to be cleaned of the projects within the cleaning range;
and in the period from the cleaning starting time to the cleaning ending time, performing decrement on the original data volume of the items in the cleaning range every other one decrement period, and rearranging the original data volume of all the items after each decrement, wherein the original data volume subtracted from one item in the period is equal to the to-be-cleaned data volume of the item.
2. The data cleaning method of claim 1, wherein in a period from a cleaning start time to a cleaning end time, the original data amount of an item in the cleaning range is subjected to subtraction every subtraction cycle, and the original data amount of all items is rearranged after each subtraction, wherein the subtracted original data amount of an item in the period is equal to the to-be-cleaned data amount of the item, the method comprises the following steps:
calculating the time point of each nuclear subtraction according to the cleaning starting time, the cleaning ending time and the nuclear subtraction period;
calculating the core reduction amount of the original data volume of each item in the cleaning range at each time point according to the time of each time point, the cleaning starting time and the data volume to be cleaned;
when the current time reaches any time point, the checking and reducing amount corresponding to the time point of the current arrival of the item is checked and reduced from the original data amount of each item in the cleaning range, and after each checking and reducing, the original data amounts of all the items are re-ranked to update the ranking list.
3. The data cleansing method according to claim 2, wherein the amounts of reduction of the cores at each time point of the original data amounts of the same item of the cleansing range are different from each other.
4. The data cleansing method according to claim 2, wherein the amount of coring that should be cored from the original data amount of each item at each time point is calculated respectively according to the following equations:
Figure FDA0002487775260000011
wherein Q ismThe nuclear decrement corresponding to the mth time point, delta t is the nuclear decrement period, t0To clear the start time, tmAt the moment of the mth time point, a is a preset value less than zero,
Figure FDA0002487775260000021
and A is the total data volume to be cleaned.
5. The data cleansing method of claim 4, wherein a is generated by a random number generator.
6. The data cleaning method as claimed in claim 2, wherein when the current time reaches any time point, the decrement corresponding to the time point of the current arrival of each item is checked from the original data volume of each item in the cleaning range, and after each decrement, the original data volumes of all the items are re-ranked to update the ranking list, comprising the following steps:
and (3) performing nuclear subtraction: after the current time reaches a time point, the core decrement corresponding to the current reaching time point is subtracted from the original data volume of each project in the cleaning range;
ranking the original data volume of all the projects;
and judging whether an elapsed time point exists after the current time or not, and if so, entering a step of checking and subtracting.
7. The data cleaning method according to any one of claims 1 to 6, wherein the data cleaning method is applied to an e-market scene, the original data is a sales record, the original data amount is a sales volume, the data to be cleaned is a fake sales record, and the data to be cleaned is a fake sales volume.
8. A data cleaning device for counting a ranking list is characterized by comprising:
an information processing module;
the data module is connected with the information processing module and used for periodically identifying the data to be cleaned of each item from the original data of each item;
the instruction module is connected with the information processing module and used for sending a cleaning instruction to the information processing module, and the cleaning instruction comprises cleaning starting time, cleaning finishing time and a cleaning range;
the information processing module is used for performing check-down on the original data volume of the items in the cleaning range every other check-down period within a period from the cleaning starting time to the cleaning ending time, and rearranging the original data volume of all the items after each check-down, wherein the original data volume of one item subtracted within the period is equal to the data volume to be cleaned of the item.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data cleansing method according to any one of claims 1 to 7.
10. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the data cleansing method of any of claims 1-7 via execution of the executable instructions.
CN202010396586.3A 2020-05-12 2020-05-12 Data cleaning method, data cleaning device, storage medium and electronic equipment Pending CN113656385A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010396586.3A CN113656385A (en) 2020-05-12 2020-05-12 Data cleaning method, data cleaning device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010396586.3A CN113656385A (en) 2020-05-12 2020-05-12 Data cleaning method, data cleaning device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113656385A true CN113656385A (en) 2021-11-16

Family

ID=78476857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010396586.3A Pending CN113656385A (en) 2020-05-12 2020-05-12 Data cleaning method, data cleaning device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113656385A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536395A (en) * 2018-04-03 2018-09-14 北京京东尚科信息技术有限公司 A kind of method and apparatus of cleaning hard disk
CN109299169A (en) * 2018-10-24 2019-02-01 中国平安人寿保险股份有限公司 Data visualization method, system, terminal and computer readable storage medium
CN109766497A (en) * 2019-01-22 2019-05-17 网易(杭州)网络有限公司 Ranking list generation method and device, storage medium, electronic equipment
CN110401843A (en) * 2019-08-06 2019-11-01 广州虎牙科技有限公司 List data-updating method, device, equipment and medium in platform is broadcast live
US20200042611A1 (en) * 2018-07-31 2020-02-06 Market Advantage, Inc. System, computer program product and method for generating embeddings of textual and quantitative data
US20210357795A1 (en) * 2020-05-15 2021-11-18 International Business Machines Corporation Transferring large datasets by using data generalization
US20230297542A1 (en) * 2022-02-25 2023-09-21 Timothy John Ryder Shinkle Cloud based AI Recycle Bin (AiRB)

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536395A (en) * 2018-04-03 2018-09-14 北京京东尚科信息技术有限公司 A kind of method and apparatus of cleaning hard disk
US20200042611A1 (en) * 2018-07-31 2020-02-06 Market Advantage, Inc. System, computer program product and method for generating embeddings of textual and quantitative data
CN109299169A (en) * 2018-10-24 2019-02-01 中国平安人寿保险股份有限公司 Data visualization method, system, terminal and computer readable storage medium
CN109766497A (en) * 2019-01-22 2019-05-17 网易(杭州)网络有限公司 Ranking list generation method and device, storage medium, electronic equipment
CN110401843A (en) * 2019-08-06 2019-11-01 广州虎牙科技有限公司 List data-updating method, device, equipment and medium in platform is broadcast live
US20210357795A1 (en) * 2020-05-15 2021-11-18 International Business Machines Corporation Transferring large datasets by using data generalization
US20230297542A1 (en) * 2022-02-25 2023-09-21 Timothy John Ryder Shinkle Cloud based AI Recycle Bin (AiRB)

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
PENG, N等: "Finding Interesting Cleaning Rules from Dirty Data", 2017 10TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID), 7 May 2018 (2018-05-07) *
唐钰;陈浩;叶柏龙;: "基于逆向清理的实时异构数据整合模型研究", 计算机工程, no. 23, 5 December 2012 (2012-12-05) *
杜巍;高长元;翟丽丽;: "基于新鲜度度量的多样性推荐模型研究", 情报理论与实践, no. 08, 1 March 2018 (2018-03-01) *

Similar Documents

Publication Publication Date Title
US10354314B1 (en) Ranking of items as a function of virtual shopping cart activity
US20180276006A1 (en) Triggering User Aid Based On User Actions At Independent Locations
CN111222931B (en) Product recommendation method and system
KR20210070593A (en) Method and system for reserving furture purchases of goods
CN112348648A (en) Resource processing method and device
US7464008B2 (en) Methods and apparatus for selecting event sequences based on a likelihood of a user completing each event sequence
US20140236699A1 (en) Personalized product pricing
US11960465B2 (en) Database inventory isolation
US20110015951A1 (en) Evaluation of website visitor based on value grade
CN106296248A (en) Information push method and device
US20140278902A1 (en) Return Processing Systems And Methods For A Price Comparison System
CN112633933A (en) Information recommendation method and device
CN108960604B (en) Information processing method, system and device
CN111144987A (en) Abnormal shopping behavior limiting method, limiting assembly and shopping system
CN113656385A (en) Data cleaning method, data cleaning device, storage medium and electronic equipment
US20220301015A1 (en) Method, apparatus, and computer program product for adaptive tail digital content object bid value generation
CN113792039B (en) Data processing method and device, electronic equipment and storage medium
CN112330427B (en) Method, electronic device and storage medium for commodity sorting
US11288720B1 (en) Invoice generation recommendation
CN113542047B (en) Abnormal request detection method and device, electronic equipment and computer readable medium
CN110163764B (en) Premium payment processing method, device and storage medium
CN114004660A (en) Data processing method and device, electronic equipment and storage medium
CN113139115A (en) Information recommendation method, search method, device, client, medium and equipment
US10672024B1 (en) Generating filters based upon item attributes
JP2020107293A (en) Information providing device, information providing method, and information providing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination