CN111143367A

CN111143367A - Big data processing system and method with enhanced preprocessing

Info

Publication number: CN111143367A
Application number: CN201911373572.3A
Authority: CN
Inventors: 黄玉划; 郭柯卿; 蓝天; 王娜
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-05-12

Abstract

The invention relates to the technical field of computer systems and discloses a preprocessing enhanced big data processing system and a preprocessing enhanced big data processing method. The collection module is used for selecting and collecting a plurality of data of the internet in a targeted manner, the big data preprocessing module is used for processing the original data and analyzing the processing efficiency of the module, and finally the effective data is input into the storage module for storage, so that the future utilization is facilitated, the data processing speed is increased, and the capacity required by the stored data is reduced by screening and storing.

Description

Big data processing system and method with enhanced preprocessing

Technical Field

The invention relates to the technical field of computer systems, in particular to a preprocessing enhanced big data processing system and a preprocessing enhanced big data processing method.

Background

With the rapid development of computers and the internet in China, more and more data information is flooded on each platform, electronic information data gradually becomes the key point of research of people, and people can not leave various data in daily life, so that big data becomes a hot spot of the current research.

In the times of data explosion, the storage quantity of electronic equipment in the world cannot be estimated, meanwhile, data generated by machine equipment in the internet of things far exceeds data generated by individuals, and data published on the internet is increased year by year, which all generate huge data. The problems encountered by the users are similar, and the access speed is not increased with time while the storage capacity of a hard disk is continuously increased, so that the problems of reading/writing data are solved no matter the Hadoop file system HDFS solves the problem of hardware faults or a MapReduce programming model finishes analysis by combining most data in a certain mode.

The main function of the data processing system is to collect relevant service data from a plurality of external systems and store the relevant service data together in a database of the data processing system. All original data are stored in a basic library of a database after being subjected to a series of processing, analysis and format conversion inside the system; finally, a series of data conversion is carried out to a corresponding data set for thematic analysis or display of other upper layer data application components.

According to the traditional flow process of data, the following modules are generally available: data collection, data storage, data calculation, data analysis, data presentation, and the like. The existing big data processing system has numerous data sources of big data and large data volume, so that the hardware requirement for data processing is still higher, which limits the further popularization of big data technology, and the problems of slow speed, low efficiency and incomplete system function of the traditional processing system need to be solved.

Disclosure of Invention

The invention aims to solve the problems that the existing big data processing system has a plurality of data sources and a large data volume, faces the problems of reliability and expandability, can possibly store massive data for a user, and has a trend of continuously increasing the data scale, so that the big data processing system with enhanced preprocessing and the method thereof are provided to solve the problems of incomplete functions, poor universality and low efficiency of the existing big data processing system.

Technical solution the scheme of the present invention mainly includes the following contents:

in order to realize the purposes of high processing speed, screening, storage and more perfection of the system, the invention provides the following technical scheme: the utility model provides a big data processing system of preliminary treatment reinforcing, including collection module's the output and the input one-way signal connection of input module the output of input module and the one-way signal connection of input of preliminary treatment module the output of preliminary treatment module and the one-way signal connection of input of analysis module the output of analysis module and the one-way signal connection of input of output module and the one-way signal connection of input of storage module.

Based on a big data processing system with enhanced preprocessing, the big data processing method is provided, and the method comprises the following steps:

s1: the acquisition module actively collects required metadata, such as client data, database data, server data or third-party data and the like, packs and transmits the metadata to the input module;

s2: after the data are packaged and transmitted to the input module according to the acquisition module in the S1, the input module actively transmits the data to the preprocessing module for preprocessing, a transmission mode is selected according to the type of the data in the transmission process, and when the data are streaming data, frames such as Kafka and storm are adopted; when the data is batch data, a MapReduce batch processing model is adopted;

s3: according to a series of programs such as analysis, decoding, filling and error correction of the data after the metadata is received by the preprocessing module in the S2, preprocessing the data;

and (3) analysis: when receiving data from an input module, firstly operating an analysis script, converting the transmitted data into XML or JSON format data, and then performing service processing; when the platform issues the data, the data is converted into a data format which can be received by the module through the script and then issued to the lower-layer module;

and (3) decoding: in a computer network, resource sharing and data transmission need to be realized through the network, so when the signal forms of two linked parties are different, for example, when the signal form of a used communication network is different from that of a transmission module, conversion of the signal form is required, and the conversion of the signal form by a receiving party is decoding;

filling: when data is processed, the situation of data missing values is met many times, and in the case of the data missing values, a simple method can be to fill median, average and the like in continuous variables and mode in discrete variables, and then, a deep learning method such as K-means interpolation, mixed Gaussian distribution interpolation and the like can be considered to fill the data;

error correction: when data is input, errors are inevitable, the data needs to be supplemented and corrected along with the lapse of time and the sudden progress of work, the integrity and the accuracy of the data are dynamic, the correctness of basic data needs to be kept, and the key point is to establish a mechanism for correcting error data as soon as possible, namely auditing, correcting and feeding back;

s4: after a series of preprocessing is carried out on the data according to the S3, the processed data are sent to an analysis module for analysis, and favorable data are screened and then transmitted to an output module;

s5: data are collected, input, preprocessed and analyzed according to S1, S2, S3 and S4 and then transmitted to an output module, the output module actively transmits the data to a storage module for storage, and if the data format is a document type, a MongoDB document type database is selected; if the data is structured, the relational database is adopted for storage; when the data reaches a large scale, HDFS storage will be preferred.

The invention has the beneficial effects that the invention is a computer network system, the collection module is used for selecting and collecting a plurality of data of the internet, a series of program processing such as analysis, decoding, filling and error correction is realized on the original data in the big data preprocessing module, the analysis module is used for refining and extracting, the occupied space of the data is reduced, the subsequent processing efficiency is improved, and finally the effective data is input into the storage module for storage, so that the subsequent utilization is facilitated, the data processing speed is improved, and the capacity required by the stored data is reduced by screening and storing.

As an optimization, the preprocessing module is divided into four parts, namely parsing, decoding, padding and error correction.

And as optimization, the preprocessing module is used for receiving the user behavior big data acquired by the big data acquisition module.

[ description of the drawings ]

FIG. 1 is a system framework diagram of the present invention.

[ detailed description of the invention ]

The invention is described in detail below with reference to the figures and the examples.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a big data processing system with enhanced preprocessing includes an acquisition module, an output end of the acquisition module is in one-way signal connection with an input end of an input module, an output end of the input module is in one-way signal connection with an input end of a preprocessing module, an output end of the preprocessing module is in one-way signal connection with an input end of an analysis module, an output end of the analysis module is in one-way signal connection with an input end of an output module, and an output end of the output module is in one-way signal connection with an input end of a storage module.

When the system is used, the invention is a computer network system, a collection module is used for selecting and collecting numerous data of the Internet in a targeted manner, a large data preprocessing module is used for carrying out a series of program processing such as analysis, decoding, filling and error correction on original data, an analysis module is used for refining and extracting, the occupied space of the data is reduced, the subsequent processing efficiency is improved, and finally effective data is input into a storage module for storage, so that the subsequent utilization is facilitated, the data processing speed is improved, and the capacity required by the stored data is reduced by screening and storage.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical scope of the present invention and the equivalent alternatives or modifications according to the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims

1. The utility model provides a big data processing system of preliminary treatment reinforcing, includes collection module, its characterized in that collection module's the output and the input one-way signal connection of input module the output of input module and the one-way signal connection of input of pretreatment module the output of pretreatment module and the one-way signal connection of input of analysis module the output of analysis module and the one-way signal connection of input of output module and the one-way signal connection of input of storage module.

2. A big data processing system with preprocessing enhancement according to claim 1, for which a big data processing method is proposed, characterized by the steps of:

3. The big data processing system and method of claim 1, wherein the pre-processing module is divided into four parts, parsing, decoding, padding and error correction.

4. The big data processing system and method with enhanced preprocessing as claimed in claim 1, wherein the preprocessing module is used to receive big data of user behavior collected by big data collection module.