CN109947757B

CN109947757B - System and method for cleaning and processing mass data in real time

Info

Publication number: CN109947757B
Application number: CN201910231734.3A
Authority: CN
Inventors: 苟雨轩; 谢赟; 周龙
Original assignee: Shanghai Datatom Information Technology Co ltd
Current assignee: Shanghai Datatom Information Technology Co ltd
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2023-03-14
Anticipated expiration: 2039-03-26
Also published as: CN109947757A

Abstract

The invention discloses a system for cleaning and processing mass data in real time, which comprises: the data storage module integrates the source data of all parties into an original data packet and stores the original data packet; the data sub-packet distribution module is used for classifying and sub-packaging the original data packets stored in the data storage module at regular time, storing the original data packets again and then deleting the original data packets; the data cleaning module is used for analyzing each data packet after classification and subpackaging in sequence at regular time and then checking the data obtained by analysis one by one and removing the duplicate; and the distributed storage module is used for storing qualified data cleaned by the data cleaning module. Meanwhile, the invention also discloses a method for cleaning and processing mass data in real time. The invention can clean and store mass data in real time.

Description

System and method for real-time cleaning and processing of mass data

Technical Field

The invention relates to the technical field of data processing, in particular to real-time cleaning processing of mass data.

Background

In the early stage of the internet, data storage can meet requirements by depending on relational databases such as mysql, pgsql and the like with open sources, no obstacles exist in the aspects of data cleaning and storage, various data are explosively increased along with the development of the internet, meanwhile, data which are not in accordance with the specification are increased along with the increase of data quantity, and the data cleaning, data storage and data timeliness cannot keep pace with the data cleaning, data storage and data timeliness. In the aspect of data storage, the existing distributed Hadoop (Hadoop is a distributed system infrastructure developed by the Apache foundation) storage which is slowly evolved from the storage by using a relational database is only improved from the aspect of unilateral storage.

In the aspect of data timeliness, in short, the collected data can be obtained in real time, but the collected data can be used only after being verified to be qualified and removed from duplicate, however, most of the existing technologies are long in time consumption in the aspect of data cleaning, cannot be processed in real time, cause data accumulation, and cannot achieve data timeliness. The method is characterized in that a large amount of data is required to be cleaned and stored in real time, only Ariiyun products and systems for researching real-time processing of the large data by some companies can achieve the effect at present, but the products are expensive and can not be borne by general companies, and some open-source architecture schemes can achieve the real-time processing effect, but have extremely high technical requirements on technicians.

Disclosure of Invention

The invention aims to provide a system and a method for cleaning and processing mass data in real time, which can clean and store the mass data in real time.

The technical scheme for realizing the purpose is as follows:

the invention relates to a system for cleaning and processing mass data in real time, which comprises:

the data storage module integrates source data of all parties into an original data packet and stores the original data packet;

the data sub-packet distribution module is used for classifying and sub-packaging the original data packets stored in the data storage module at regular time, storing the original data packets again and then deleting the original data packets;

the data cleaning module is used for analyzing each data packet subjected to classified sub-packaging in sequence at regular time, and then checking and de-duplicating the data obtained by analysis one by one; and

and the distributed storage module is used for storing qualified data cleaned by the data cleaning module.

Preferably, the method further comprises the following steps: and the internet interaction module is connected with the distributed storage module and is used for man-machine interaction to count data in real time and search data in real time.

Preferably, the data storage module includes:

the first folder module is used for storing the original data packet; and

and the second folder module comprises a plurality of subfolders and is used for storing the classified and subpackaged data packages.

The invention also provides a method for cleaning and processing mass data in real time, which comprises the following steps:

integrating source data of all parties into an original data packet and storing the original data packet;

classifying and sub-packaging the original data packet at regular time, storing the original data packet again, and then deleting the original data packet;

analyzing each data packet after classification and sub-packaging in sequence at regular time, and then checking the data obtained by analysis one by one and removing duplication; and

and performing distributed storage on the qualified data.

Preferably, the original data packets are stored in one folder, and the classified and packetized data packets are stored in the subfolders of the other folder.

Preferably, the packet parsing includes: if the data packet does not exist or the data packet does not have data, deleting the data packet, otherwise, decompressing the data packet;

and checking and de-duplicating the data obtained by analysis one by one, comprising the following steps:

checking whether the mandatory field exists or not, if not, entering next data verification, otherwise, carrying out the following steps;

checking whether each mandatory field accords with a preset rule, if not, entering next data verification, and if not, performing the following steps;

and (3) forming a character string by the several mandatory fields, encrypting the character string, inserting the character string into a PostgreSQL table, if the character string is normally inserted, storing the data, and if the character string is not normally inserted, deleting the data and verifying the next piece of data.

Preferably, the method further comprises: and inputting keywords, and performing real-time statistics and real-time search from the data stored in a distributed mode according to the keywords.

The invention has the beneficial effects that: the invention adopts data sub-packaging, distributes the data package and deletes the data in the original catalog, thereby playing a role of data dispersion, breaking the whole into parts, facilitating data cleaning, and simultaneously deleting the original data package, avoiding repeated reading of the package and occupation of disk space. The invention adopts a postgresql (relational database management system) database and uses a temporary table to perform intermediate data deduplication. The invention adopts the elastic search (search server based on Lucene) to store data, and the distributed search engine and the data analysis engine use the _ bulk of the REST API (representation layer state transfer) to insert in batches, thereby achieving tens of thousands of data insertion per second, being capable of carrying out full-text retrieval, achieving the timeliness of data and being capable of storing real-time search data. The invention adopts the Elasticissearch distributed cluster, on one hand, the service pressure is reduced, on the other hand, one service is stopped, the whole processing data and the searching data are not influenced, and the fault tolerance is good.

Drawings

FIG. 1 is a block diagram of a system for real-time cleaning of mass data in accordance with the present invention;

FIG. 2 is a flow chart of a method of the present invention for real-time cleaning of mass data;

FIG. 3 is a flow chart of step S2 of the method for real-time cleaning processing of mass data of the present invention;

FIG. 4 is a flow chart of steps S3 and S4 of the method for real-time cleaning processing of mass data of the present invention;

fig. 5 is a flow chart of the present invention for checking and deduplication of parsed data item by item.

Detailed Description

The invention will be further explained with reference to the drawings.

Referring to fig. 1, the system for real-time cleaning and processing mass data of the present invention includes: the system comprises a data storage module 1, a data sub-packet distribution module 2, a data cleaning module 3, a distributed storage module 4 and an internet interaction module 5.

The data storage module 1 integrates the source data of each party into an original data packet. The data sub-packet distribution module 2 classifies and sub-packets the original data packets stored in the data storage module 1 at regular time, stores the original data packets again by the data storage module 1, and then deletes the original data packets.

The data cleaning module 3 analyzes the classified and subpackaged data packets in sequence at regular time, checks the data obtained by analysis one by one and removes the duplication, and stores the qualified data by the distributed storage module 4. Finally, through the internet interaction module 5, real-time data statistics and real-time data searching are carried out on a web page by clicking a button or inputting keywords in an input box, and through interaction among the modules, original data are cleaned in real time, so that the data can be searched in real time.

The data storage module 1 includes a first folder module and a second folder module. The original data packet is stored in the first folder module. The second folder module comprises a plurality of subfolders for storing the classified and subpackaged data packets.

Referring to fig. 2, a method for real-time cleaning and processing mass data according to the present invention includes the following steps:

and S1, integrating source data of all parties into an original data packet and storing the original data packet. In this embodiment, 3 servers with 2c4G storage disks of 64G are prepared, the operating system is linux centros 7, and the servers are numbered as 1,2 and 3. Selecting a server No. 1 as original data storage, newly building two folders in an opt folder, for example, two folders of data (data) and branchdata (branch data), wherein the data folder is used for storing original data packets, the branchdata is used for storing sub-packet data packets, and 10 sub-folders are newly built under the branchdata, named as No. 1-10, and are used for randomly storing the data packets.

And S2, classifying and sub-packaging the original data packet at regular time, storing the original data packet again, and then deleting the original data packet. And (3) starting 1 system timing task in the server No. 1 by combining the contents, and executing the cleaning program once at a timing of 1 minute respectively. And randomly drawing 40 packages from the data file and randomly moving the 40 packages to folders 1-10 under the brandchdata directory each time, and deleting the 40 packages under the data directory after the movement is finished. As shown in particular in fig. 3.

And S3, analyzing each data packet after classification and subpackage in sequence at regular time, and then checking the data obtained by analysis one by one and removing duplicate. In this embodiment, an open source relational database PostgreSQL is installed in the server No. 1, a table is newly created, a field unique key is newly created on the inside and outside of the table, and the table is set as a unique key, and a temporary table is set and a write-in error log is closed. Starting 3 system tasks at the number 1 server, respectively executing the system tasks at regular time for 1 minute, randomly locking one file from numbers 1 to 10 in a branddata directory each time, starting decompressing a packet after locking, directly deleting the packet if the packet is decompressed without data or is an error packet (without a data packet), and continuously decompressing the next packet. After json (JavaScript Object Notation) data of the decompressed packet are obtained, the json data are converted into arrays, each inspection is carried out step by step, data which do not accord with preset rules are directly deleted, and the number of the data is recorded. The preset rules are designed, for example: checking the telephone number, wherein the preset rule is as follows: the phone number must be an 11 digit number with the first digit at the beginning being 1, the second digit being 3, 4, 5, 7, 8, and the remaining 9 digits being 0-9, and if the phone number is 16034568790, the preset rules are not met. For qualified data, selecting key fields to form a character string, for example, forming the field A, the field B and the field C into a character string, performing unidirectional encryption, inserting the encrypted character string into a PostgreSQL table, if the encrypted character string is normally inserted, indicating that the encrypted character string is not repeated data, and throwing the data into an array A while recording the number of the data. If an error is returned, the data is the data de-duplication, and finally the data of the array A is dropped into the storage module, and the decompressed packet is deleted. As shown in fig. 4 and 5.

And S4, performing distributed storage on the qualified data. And (3) installing an elastic search server on the server No. 2 and the server No. 3, and forming the server No. 2 and the server No. 3 into a distributed cluster for deployment. And the stored data is randomly inserted into the elasticsearch in batches in the number 2 server or the number 2 server, and meanwhile, the data can be pushed to a third-party data interface.

And (3) establishing a web service on the number 3 server, inputting a key field by a user on a page, then initiating a request in the distributed cluster, searching key data in the elastic search, and counting the data.

When data are collected, a data butting party pushes 80-100 data packets per minute, each data comprises 1000-5000 pieces of data, which is equivalent to pushing 8W-50W of data in one minute, but the data comprises a large amount of data which do not accord with rules and a large amount of repeated data, and a client requires the data to be searched in real time.

The above embodiments are provided only for illustrating the present invention and not for limiting the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention, and therefore all equivalent technical solutions should also fall within the scope of the present invention, and should be defined by the claims.

Claims

1. A system for real-time cleaning and processing of mass data is characterized by comprising:

the data storage module integrates the source data of all parties into an original data packet and stores the original data packet;

the data cleaning module is used for analyzing each data packet after classification and subpackaging in sequence at regular time and then checking the data obtained by analysis one by one and removing the duplicate; and

the distributed storage module is used for storing qualified data cleaned by the data cleaning module;

the data storage module includes:

the first folder module is used for storing original data packets; and

a second folder module which comprises a plurality of subfolders and is used for storing each classified and subpackaged data packet;

and analyzing the data packet, including: if the data packet does not exist or the data packet does not have data, deleting the data packet, otherwise, decompressing the data packet;

and (3) checking and de-duplicating the data obtained by analysis item by item, comprising the following steps:

checking whether the mandatory field exists, if not, entering next data verification, otherwise, performing the following steps;

2. The system for real-time cleaning and processing of mass data according to claim 1, further comprising: and the internet interaction module is connected with the distributed storage module and is used for man-machine interaction to count data in real time and search data in real time.

3. A method for cleaning and processing mass data in real time is characterized by comprising the following steps:

analyzing each data packet after classification and sub-packaging in sequence at regular time, and then checking and de-duplicating the data obtained by analysis one by one; and

performing distributed storage on qualified data;

storing the original data packets in a folder, wherein each classified and grouped data packet is stored in each subfolder of the other folder;

4. The method for real-time cleaning and processing of mass data according to claim 3, wherein the method further comprises: inputting keywords, and carrying out real-time statistics and real-time search from the data stored in a distributed manner according to the keywords.