CN109947757B - System and method for cleaning and processing mass data in real time - Google Patents

System and method for cleaning and processing mass data in real time Download PDF

Info

Publication number
CN109947757B
CN109947757B CN201910231734.3A CN201910231734A CN109947757B CN 109947757 B CN109947757 B CN 109947757B CN 201910231734 A CN201910231734 A CN 201910231734A CN 109947757 B CN109947757 B CN 109947757B
Authority
CN
China
Prior art keywords
data
data packet
packet
storing
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910231734.3A
Other languages
Chinese (zh)
Other versions
CN109947757A (en
Inventor
苟雨轩
谢赟
周龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Datatom Information Technology Co ltd
Original Assignee
Shanghai Datatom Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Datatom Information Technology Co ltd filed Critical Shanghai Datatom Information Technology Co ltd
Priority to CN201910231734.3A priority Critical patent/CN109947757B/en
Publication of CN109947757A publication Critical patent/CN109947757A/en
Application granted granted Critical
Publication of CN109947757B publication Critical patent/CN109947757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a system for cleaning and processing mass data in real time, which comprises: the data storage module integrates the source data of all parties into an original data packet and stores the original data packet; the data sub-packet distribution module is used for classifying and sub-packaging the original data packets stored in the data storage module at regular time, storing the original data packets again and then deleting the original data packets; the data cleaning module is used for analyzing each data packet after classification and subpackaging in sequence at regular time and then checking the data obtained by analysis one by one and removing the duplicate; and the distributed storage module is used for storing qualified data cleaned by the data cleaning module. Meanwhile, the invention also discloses a method for cleaning and processing mass data in real time. The invention can clean and store mass data in real time.

Description

System and method for real-time cleaning and processing of mass data
Technical Field
The invention relates to the technical field of data processing, in particular to real-time cleaning processing of mass data.
Background
In the early stage of the internet, data storage can meet requirements by depending on relational databases such as mysql, pgsql and the like with open sources, no obstacles exist in the aspects of data cleaning and storage, various data are explosively increased along with the development of the internet, meanwhile, data which are not in accordance with the specification are increased along with the increase of data quantity, and the data cleaning, data storage and data timeliness cannot keep pace with the data cleaning, data storage and data timeliness. In the aspect of data storage, the existing distributed Hadoop (Hadoop is a distributed system infrastructure developed by the Apache foundation) storage which is slowly evolved from the storage by using a relational database is only improved from the aspect of unilateral storage.
In the aspect of data timeliness, in short, the collected data can be obtained in real time, but the collected data can be used only after being verified to be qualified and removed from duplicate, however, most of the existing technologies are long in time consumption in the aspect of data cleaning, cannot be processed in real time, cause data accumulation, and cannot achieve data timeliness. The method is characterized in that a large amount of data is required to be cleaned and stored in real time, only Ariiyun products and systems for researching real-time processing of the large data by some companies can achieve the effect at present, but the products are expensive and can not be borne by general companies, and some open-source architecture schemes can achieve the real-time processing effect, but have extremely high technical requirements on technicians.
Disclosure of Invention
The invention aims to provide a system and a method for cleaning and processing mass data in real time, which can clean and store the mass data in real time.
The technical scheme for realizing the purpose is as follows:
the invention relates to a system for cleaning and processing mass data in real time, which comprises:
the data storage module integrates source data of all parties into an original data packet and stores the original data packet;
the data sub-packet distribution module is used for classifying and sub-packaging the original data packets stored in the data storage module at regular time, storing the original data packets again and then deleting the original data packets;
the data cleaning module is used for analyzing each data packet subjected to classified sub-packaging in sequence at regular time, and then checking and de-duplicating the data obtained by analysis one by one; and
and the distributed storage module is used for storing qualified data cleaned by the data cleaning module.
Preferably, the method further comprises the following steps: and the internet interaction module is connected with the distributed storage module and is used for man-machine interaction to count data in real time and search data in real time.
Preferably, the data storage module includes:
the first folder module is used for storing the original data packet; and
and the second folder module comprises a plurality of subfolders and is used for storing the classified and subpackaged data packages.
The invention also provides a method for cleaning and processing mass data in real time, which comprises the following steps:
integrating source data of all parties into an original data packet and storing the original data packet;
classifying and sub-packaging the original data packet at regular time, storing the original data packet again, and then deleting the original data packet;
analyzing each data packet after classification and sub-packaging in sequence at regular time, and then checking the data obtained by analysis one by one and removing duplication; and
and performing distributed storage on the qualified data.
Preferably, the original data packets are stored in one folder, and the classified and packetized data packets are stored in the subfolders of the other folder.
Preferably, the packet parsing includes: if the data packet does not exist or the data packet does not have data, deleting the data packet, otherwise, decompressing the data packet;
and checking and de-duplicating the data obtained by analysis one by one, comprising the following steps:
checking whether the mandatory field exists or not, if not, entering next data verification, otherwise, carrying out the following steps;
checking whether each mandatory field accords with a preset rule, if not, entering next data verification, and if not, performing the following steps;
and (3) forming a character string by the several mandatory fields, encrypting the character string, inserting the character string into a PostgreSQL table, if the character string is normally inserted, storing the data, and if the character string is not normally inserted, deleting the data and verifying the next piece of data.
Preferably, the method further comprises: and inputting keywords, and performing real-time statistics and real-time search from the data stored in a distributed mode according to the keywords.
The invention has the beneficial effects that: the invention adopts data sub-packaging, distributes the data package and deletes the data in the original catalog, thereby playing a role of data dispersion, breaking the whole into parts, facilitating data cleaning, and simultaneously deleting the original data package, avoiding repeated reading of the package and occupation of disk space. The invention adopts a postgresql (relational database management system) database and uses a temporary table to perform intermediate data deduplication. The invention adopts the elastic search (search server based on Lucene) to store data, and the distributed search engine and the data analysis engine use the _ bulk of the REST API (representation layer state transfer) to insert in batches, thereby achieving tens of thousands of data insertion per second, being capable of carrying out full-text retrieval, achieving the timeliness of data and being capable of storing real-time search data. The invention adopts the Elasticissearch distributed cluster, on one hand, the service pressure is reduced, on the other hand, one service is stopped, the whole processing data and the searching data are not influenced, and the fault tolerance is good.
Drawings
FIG. 1 is a block diagram of a system for real-time cleaning of mass data in accordance with the present invention;
FIG. 2 is a flow chart of a method of the present invention for real-time cleaning of mass data;
FIG. 3 is a flow chart of step S2 of the method for real-time cleaning processing of mass data of the present invention;
FIG. 4 is a flow chart of steps S3 and S4 of the method for real-time cleaning processing of mass data of the present invention;
fig. 5 is a flow chart of the present invention for checking and deduplication of parsed data item by item.
Detailed Description
The invention will be further explained with reference to the drawings.
Referring to fig. 1, the system for real-time cleaning and processing mass data of the present invention includes: the system comprises a data storage module 1, a data sub-packet distribution module 2, a data cleaning module 3, a distributed storage module 4 and an internet interaction module 5.
The data storage module 1 integrates the source data of each party into an original data packet. The data sub-packet distribution module 2 classifies and sub-packets the original data packets stored in the data storage module 1 at regular time, stores the original data packets again by the data storage module 1, and then deletes the original data packets.
The data cleaning module 3 analyzes the classified and subpackaged data packets in sequence at regular time, checks the data obtained by analysis one by one and removes the duplication, and stores the qualified data by the distributed storage module 4. Finally, through the internet interaction module 5, real-time data statistics and real-time data searching are carried out on a web page by clicking a button or inputting keywords in an input box, and through interaction among the modules, original data are cleaned in real time, so that the data can be searched in real time.
The data storage module 1 includes a first folder module and a second folder module. The original data packet is stored in the first folder module. The second folder module comprises a plurality of subfolders for storing the classified and subpackaged data packets.
Referring to fig. 2, a method for real-time cleaning and processing mass data according to the present invention includes the following steps:
and S1, integrating source data of all parties into an original data packet and storing the original data packet. In this embodiment, 3 servers with 2c4G storage disks of 64G are prepared, the operating system is linux centros 7, and the servers are numbered as 1,2 and 3. Selecting a server No. 1 as original data storage, newly building two folders in an opt folder, for example, two folders of data (data) and branchdata (branch data), wherein the data folder is used for storing original data packets, the branchdata is used for storing sub-packet data packets, and 10 sub-folders are newly built under the branchdata, named as No. 1-10, and are used for randomly storing the data packets.
And S2, classifying and sub-packaging the original data packet at regular time, storing the original data packet again, and then deleting the original data packet. And (3) starting 1 system timing task in the server No. 1 by combining the contents, and executing the cleaning program once at a timing of 1 minute respectively. And randomly drawing 40 packages from the data file and randomly moving the 40 packages to folders 1-10 under the brandchdata directory each time, and deleting the 40 packages under the data directory after the movement is finished. As shown in particular in fig. 3.
And S3, analyzing each data packet after classification and subpackage in sequence at regular time, and then checking the data obtained by analysis one by one and removing duplicate. In this embodiment, an open source relational database PostgreSQL is installed in the server No. 1, a table is newly created, a field unique key is newly created on the inside and outside of the table, and the table is set as a unique key, and a temporary table is set and a write-in error log is closed. Starting 3 system tasks at the number 1 server, respectively executing the system tasks at regular time for 1 minute, randomly locking one file from numbers 1 to 10 in a branddata directory each time, starting decompressing a packet after locking, directly deleting the packet if the packet is decompressed without data or is an error packet (without a data packet), and continuously decompressing the next packet. After json (JavaScript Object Notation) data of the decompressed packet are obtained, the json data are converted into arrays, each inspection is carried out step by step, data which do not accord with preset rules are directly deleted, and the number of the data is recorded. The preset rules are designed, for example: checking the telephone number, wherein the preset rule is as follows: the phone number must be an 11 digit number with the first digit at the beginning being 1, the second digit being 3, 4, 5, 7, 8, and the remaining 9 digits being 0-9, and if the phone number is 16034568790, the preset rules are not met. For qualified data, selecting key fields to form a character string, for example, forming the field A, the field B and the field C into a character string, performing unidirectional encryption, inserting the encrypted character string into a PostgreSQL table, if the encrypted character string is normally inserted, indicating that the encrypted character string is not repeated data, and throwing the data into an array A while recording the number of the data. If an error is returned, the data is the data de-duplication, and finally the data of the array A is dropped into the storage module, and the decompressed packet is deleted. As shown in fig. 4 and 5.
And S4, performing distributed storage on the qualified data. And (3) installing an elastic search server on the server No. 2 and the server No. 3, and forming the server No. 2 and the server No. 3 into a distributed cluster for deployment. And the stored data is randomly inserted into the elasticsearch in batches in the number 2 server or the number 2 server, and meanwhile, the data can be pushed to a third-party data interface.
And (3) establishing a web service on the number 3 server, inputting a key field by a user on a page, then initiating a request in the distributed cluster, searching key data in the elastic search, and counting the data.
When data are collected, a data butting party pushes 80-100 data packets per minute, each data comprises 1000-5000 pieces of data, which is equivalent to pushing 8W-50W of data in one minute, but the data comprises a large amount of data which do not accord with rules and a large amount of repeated data, and a client requires the data to be searched in real time.
The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention, and therefore all equivalent technical solutions should also fall within the scope of the present invention, and should be defined by the claims.

Claims (4)

1. A system for real-time cleaning and processing of mass data is characterized by comprising:
the data storage module integrates the source data of all parties into an original data packet and stores the original data packet;
the data sub-packet distribution module is used for classifying and sub-packaging the original data packets stored in the data storage module at regular time, storing the original data packets again and then deleting the original data packets;
the data cleaning module is used for analyzing each data packet after classification and subpackaging in sequence at regular time and then checking the data obtained by analysis one by one and removing the duplicate; and
the distributed storage module is used for storing qualified data cleaned by the data cleaning module;
the data storage module includes:
the first folder module is used for storing original data packets; and
a second folder module which comprises a plurality of subfolders and is used for storing each classified and subpackaged data packet;
and analyzing the data packet, including: if the data packet does not exist or the data packet does not have data, deleting the data packet, otherwise, decompressing the data packet;
and (3) checking and de-duplicating the data obtained by analysis item by item, comprising the following steps:
checking whether the mandatory field exists, if not, entering next data verification, otherwise, performing the following steps;
checking whether each mandatory field accords with a preset rule, if not, entering next data verification, and if not, performing the following steps;
and (3) forming a character string by the several mandatory fields, encrypting the character string, inserting the character string into a PostgreSQL table, if the character string is normally inserted, storing the data, and if the character string is not normally inserted, deleting the data and verifying the next piece of data.
2. The system for real-time cleaning and processing of mass data according to claim 1, further comprising: and the internet interaction module is connected with the distributed storage module and is used for man-machine interaction to count data in real time and search data in real time.
3. A method for cleaning and processing mass data in real time is characterized by comprising the following steps:
integrating source data of all parties into an original data packet and storing the original data packet;
classifying and sub-packaging the original data packet at regular time, storing the original data packet again, and then deleting the original data packet;
analyzing each data packet after classification and sub-packaging in sequence at regular time, and then checking and de-duplicating the data obtained by analysis one by one; and
performing distributed storage on qualified data;
storing the original data packets in a folder, wherein each classified and grouped data packet is stored in each subfolder of the other folder;
and analyzing the data packet, including: if the data packet does not exist or the data packet does not have data, deleting the data packet, otherwise, decompressing the data packet;
and checking and de-duplicating the data obtained by analysis one by one, comprising the following steps:
checking whether the mandatory field exists, if not, entering next data verification, otherwise, performing the following steps;
checking whether each mandatory field accords with a preset rule, if not, entering next data verification, and if not, performing the following steps;
and (3) forming a character string by the several mandatory fields, encrypting the character string, inserting the character string into a PostgreSQL table, if the character string is normally inserted, storing the data, and if the character string is not normally inserted, deleting the data and verifying the next piece of data.
4. The method for real-time cleaning and processing of mass data according to claim 3, wherein the method further comprises: inputting keywords, and carrying out real-time statistics and real-time search from the data stored in a distributed manner according to the keywords.
CN201910231734.3A 2019-03-26 2019-03-26 System and method for cleaning and processing mass data in real time Active CN109947757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910231734.3A CN109947757B (en) 2019-03-26 2019-03-26 System and method for cleaning and processing mass data in real time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910231734.3A CN109947757B (en) 2019-03-26 2019-03-26 System and method for cleaning and processing mass data in real time

Publications (2)

Publication Number Publication Date
CN109947757A CN109947757A (en) 2019-06-28
CN109947757B true CN109947757B (en) 2023-03-14

Family

ID=67010913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910231734.3A Active CN109947757B (en) 2019-03-26 2019-03-26 System and method for cleaning and processing mass data in real time

Country Status (1)

Country Link
CN (1) CN109947757B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002099574A (en) * 2000-09-26 2002-04-05 Tsutaya Online:Kk Content search system and collection and control system for content search data
CN106709003A (en) * 2016-12-23 2017-05-24 长沙理工大学 Hadoop-based mass log data processing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002099574A (en) * 2000-09-26 2002-04-05 Tsutaya Online:Kk Content search system and collection and control system for content search data
CN106709003A (en) * 2016-12-23 2017-05-24 长沙理工大学 Hadoop-based mass log data processing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Hadoop在气象数据存储中的应用;任晓炜等;《气象研究与应用》;20190315(第01期);全文 *

Also Published As

Publication number Publication date
CN109947757A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
US8918363B2 (en) Data processing service
CN102810089B (en) Short link system and implementation method based on content
US20120016901A1 (en) Data Storage and Processing Service
US20160314160A1 (en) Database system and method
CN105069111B (en) Block level data duplicate removal method based on similitude in cloud storage
CN106452450B (en) Method and system for data compression
US10606816B2 (en) Compression-aware partial sort of streaming columnar data
CN107729399B (en) Data processing method and device
CN102460404A (en) Generating obfuscated data
CN103902698A (en) Data storage system and data storage method
CN104021132A (en) Method and system for verification of consistency of backup data of host database and backup database
CN104573124A (en) Education cloud application statistics method based on parallelized association rule algorithm
US10296614B2 (en) Bulk data insertion in analytical databases
CN107515878A (en) The management method and device of a kind of data directory
CN104584524A (en) Aggregating data in a mediation system
EP2556446A1 (en) Columnar storage representations of records
CN110727406A (en) Data storage scheduling method and device
CN106649602A (en) Way, device and server of processing business object data
CN110109874A (en) A kind of non-stop layer distributed document retrieval method based on block chain
CN104881475A (en) Method and system for randomly sampling big data
CN109947757B (en) System and method for cleaning and processing mass data in real time
Du et al. Deduplicated disk image evidence acquisition and forensically-sound reconstruction
CN113419896A (en) Data recovery method and device, electronic equipment and computer readable medium
Srivastava Learning Elasticsearch 7. x: Index, Analyze, Search and Aggregate Your Data Using Elasticsearch (English Edition)
Cormode et al. Time‐decayed correlated aggregates over data streams

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant