CN111143367A - Big data processing system and method with enhanced preprocessing - Google Patents

Big data processing system and method with enhanced preprocessing Download PDF

Info

Publication number
CN111143367A
CN111143367A CN201911373572.3A CN201911373572A CN111143367A CN 111143367 A CN111143367 A CN 111143367A CN 201911373572 A CN201911373572 A CN 201911373572A CN 111143367 A CN111143367 A CN 111143367A
Authority
CN
China
Prior art keywords
data
module
input
preprocessing
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911373572.3A
Other languages
Chinese (zh)
Inventor
黄玉划
郭柯卿
蓝天
王娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN201911373572.3A priority Critical patent/CN111143367A/en
Publication of CN111143367A publication Critical patent/CN111143367A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of computer systems and discloses a preprocessing enhanced big data processing system and a preprocessing enhanced big data processing method. The collection module is used for selecting and collecting a plurality of data of the internet in a targeted manner, the big data preprocessing module is used for processing the original data and analyzing the processing efficiency of the module, and finally the effective data is input into the storage module for storage, so that the future utilization is facilitated, the data processing speed is increased, and the capacity required by the stored data is reduced by screening and storing.

Description

Big data processing system and method with enhanced preprocessing
Technical Field
The invention relates to the technical field of computer systems, in particular to a preprocessing enhanced big data processing system and a preprocessing enhanced big data processing method.
Background
With the rapid development of computers and the internet in China, more and more data information is flooded on each platform, electronic information data gradually becomes the key point of research of people, and people can not leave various data in daily life, so that big data becomes a hot spot of the current research.
In the times of data explosion, the storage quantity of electronic equipment in the world cannot be estimated, meanwhile, data generated by machine equipment in the internet of things far exceeds data generated by individuals, and data published on the internet is increased year by year, which all generate huge data. The problems encountered by the users are similar, and the access speed is not increased with time while the storage capacity of a hard disk is continuously increased, so that the problems of reading/writing data are solved no matter the Hadoop file system HDFS solves the problem of hardware faults or a MapReduce programming model finishes analysis by combining most data in a certain mode.
The main function of the data processing system is to collect relevant service data from a plurality of external systems and store the relevant service data together in a database of the data processing system. All original data are stored in a basic library of a database after being subjected to a series of processing, analysis and format conversion inside the system; finally, a series of data conversion is carried out to a corresponding data set for thematic analysis or display of other upper layer data application components.
According to the traditional flow process of data, the following modules are generally available: data collection, data storage, data calculation, data analysis, data presentation, and the like. The existing big data processing system has numerous data sources of big data and large data volume, so that the hardware requirement for data processing is still higher, which limits the further popularization of big data technology, and the problems of slow speed, low efficiency and incomplete system function of the traditional processing system need to be solved.
Disclosure of Invention
The invention aims to solve the problems that the existing big data processing system has a plurality of data sources and a large data volume, faces the problems of reliability and expandability, can possibly store massive data for a user, and has a trend of continuously increasing the data scale, so that the big data processing system with enhanced preprocessing and the method thereof are provided to solve the problems of incomplete functions, poor universality and low efficiency of the existing big data processing system.
Technical solution the scheme of the present invention mainly includes the following contents:
in order to realize the purposes of high processing speed, screening, storage and more perfection of the system, the invention provides the following technical scheme: the utility model provides a big data processing system of preliminary treatment reinforcing, including collection module's the output and the input one-way signal connection of input module the output of input module and the one-way signal connection of input of preliminary treatment module the output of preliminary treatment module and the one-way signal connection of input of analysis module the output of analysis module and the one-way signal connection of input of output module and the one-way signal connection of input of storage module.
Based on a big data processing system with enhanced preprocessing, the big data processing method is provided, and the method comprises the following steps:
s1: the acquisition module actively collects required metadata, such as client data, database data, server data or third-party data and the like, packs and transmits the metadata to the input module;
s2: after the data are packaged and transmitted to the input module according to the acquisition module in the S1, the input module actively transmits the data to the preprocessing module for preprocessing, a transmission mode is selected according to the type of the data in the transmission process, and when the data are streaming data, frames such as Kafka and storm are adopted; when the data is batch data, a MapReduce batch processing model is adopted;
s3: according to a series of programs such as analysis, decoding, filling and error correction of the data after the metadata is received by the preprocessing module in the S2, preprocessing the data;
and (3) analysis: when receiving data from an input module, firstly operating an analysis script, converting the transmitted data into XML or JSON format data, and then performing service processing; when the platform issues the data, the data is converted into a data format which can be received by the module through the script and then issued to the lower-layer module;
and (3) decoding: in a computer network, resource sharing and data transmission need to be realized through the network, so when the signal forms of two linked parties are different, for example, when the signal form of a used communication network is different from that of a transmission module, conversion of the signal form is required, and the conversion of the signal form by a receiving party is decoding;
filling: when data is processed, the situation of data missing values is met many times, and in the case of the data missing values, a simple method can be to fill median, average and the like in continuous variables and mode in discrete variables, and then, a deep learning method such as K-means interpolation, mixed Gaussian distribution interpolation and the like can be considered to fill the data;
error correction: when data is input, errors are inevitable, the data needs to be supplemented and corrected along with the lapse of time and the sudden progress of work, the integrity and the accuracy of the data are dynamic, the correctness of basic data needs to be kept, and the key point is to establish a mechanism for correcting error data as soon as possible, namely auditing, correcting and feeding back;
s4: after a series of preprocessing is carried out on the data according to the S3, the processed data are sent to an analysis module for analysis, and favorable data are screened and then transmitted to an output module;
s5: data are collected, input, preprocessed and analyzed according to S1, S2, S3 and S4 and then transmitted to an output module, the output module actively transmits the data to a storage module for storage, and if the data format is a document type, a MongoDB document type database is selected; if the data is structured, the relational database is adopted for storage; when the data reaches a large scale, HDFS storage will be preferred.
The invention has the beneficial effects that the invention is a computer network system, the collection module is used for selecting and collecting a plurality of data of the internet, a series of program processing such as analysis, decoding, filling and error correction is realized on the original data in the big data preprocessing module, the analysis module is used for refining and extracting, the occupied space of the data is reduced, the subsequent processing efficiency is improved, and finally the effective data is input into the storage module for storage, so that the subsequent utilization is facilitated, the data processing speed is improved, and the capacity required by the stored data is reduced by screening and storing.
As an optimization, the preprocessing module is divided into four parts, namely parsing, decoding, padding and error correction.
And as optimization, the preprocessing module is used for receiving the user behavior big data acquired by the big data acquisition module.
[ description of the drawings ]
FIG. 1 is a system framework diagram of the present invention.
[ detailed description of the invention ]
The invention is described in detail below with reference to the figures and the examples.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a big data processing system with enhanced preprocessing includes an acquisition module, an output end of the acquisition module is in one-way signal connection with an input end of an input module, an output end of the input module is in one-way signal connection with an input end of a preprocessing module, an output end of the preprocessing module is in one-way signal connection with an input end of an analysis module, an output end of the analysis module is in one-way signal connection with an input end of an output module, and an output end of the output module is in one-way signal connection with an input end of a storage module.
Based on a big data processing system with enhanced preprocessing, the big data processing method is provided, and the method comprises the following steps:
s1: the acquisition module actively collects required metadata, such as client data, database data, server data or third-party data and the like, packs and transmits the metadata to the input module;
s2: after the data are packaged and transmitted to the input module according to the acquisition module in the S1, the input module actively transmits the data to the preprocessing module for preprocessing, a transmission mode is selected according to the type of the data in the transmission process, and when the data are streaming data, frames such as Kafka and storm are adopted; when the data is batch data, a MapReduce batch processing model is adopted;
s3: according to a series of programs such as analysis, decoding, filling and error correction of the data after the metadata is received by the preprocessing module in the S2, preprocessing the data;
and (3) analysis: when receiving data from an input module, firstly operating an analysis script, converting the transmitted data into XML or JSON format data, and then performing service processing; when the platform issues the data, the data is converted into a data format which can be received by the module through the script and then issued to the lower-layer module;
and (3) decoding: in a computer network, resource sharing and data transmission need to be realized through the network, so when the signal forms of two linked parties are different, for example, when the signal form of a used communication network is different from that of a transmission module, conversion of the signal form is required, and the conversion of the signal form by a receiving party is decoding;
filling: when data is processed, the situation of data missing values is met many times, and in the case of the data missing values, a simple method can be to fill median, average and the like in continuous variables and mode in discrete variables, and then, a deep learning method such as K-means interpolation, mixed Gaussian distribution interpolation and the like can be considered to fill the data;
error correction: when data is input, errors are inevitable, the data needs to be supplemented and corrected along with the lapse of time and the sudden progress of work, the integrity and the accuracy of the data are dynamic, the correctness of basic data needs to be kept, and the key point is to establish a mechanism for correcting error data as soon as possible, namely auditing, correcting and feeding back;
s4: after a series of preprocessing is carried out on the data according to the S3, the processed data are sent to an analysis module for analysis, and favorable data are screened and then transmitted to an output module;
s5: data are collected, input, preprocessed and analyzed according to S1, S2, S3 and S4 and then transmitted to an output module, the output module actively transmits the data to a storage module for storage, and if the data format is a document type, a MongoDB document type database is selected; if the data is structured, the relational database is adopted for storage; when the data reaches a large scale, HDFS storage will be preferred.
When the system is used, the invention is a computer network system, a collection module is used for selecting and collecting numerous data of the Internet in a targeted manner, a large data preprocessing module is used for carrying out a series of program processing such as analysis, decoding, filling and error correction on original data, an analysis module is used for refining and extracting, the occupied space of the data is reduced, the subsequent processing efficiency is improved, and finally effective data is input into a storage module for storage, so that the subsequent utilization is facilitated, the data processing speed is improved, and the capacity required by the stored data is reduced by screening and storage.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical scope of the present invention and the equivalent alternatives or modifications according to the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims (4)

1. The utility model provides a big data processing system of preliminary treatment reinforcing, includes collection module, its characterized in that collection module's the output and the input one-way signal connection of input module the output of input module and the one-way signal connection of input of pretreatment module the output of pretreatment module and the one-way signal connection of input of analysis module the output of analysis module and the one-way signal connection of input of output module and the one-way signal connection of input of storage module.
2. A big data processing system with preprocessing enhancement according to claim 1, for which a big data processing method is proposed, characterized by the steps of:
s1: the acquisition module actively collects required metadata, such as client data, database data, server data or third-party data and the like, packs and transmits the metadata to the input module;
s2: after the data are packaged and transmitted to the input module according to the acquisition module in the S1, the input module actively transmits the data to the preprocessing module for preprocessing, a transmission mode is selected according to the type of the data in the transmission process, and when the data are streaming data, frames such as Kafka and storm are adopted; when the data is batch data, a MapReduce batch processing model is adopted;
s3: according to a series of programs such as analysis, decoding, filling and error correction of the data after the metadata is received by the preprocessing module in the S2, preprocessing the data;
and (3) analysis: when receiving data from an input module, firstly operating an analysis script, converting the transmitted data into XML or JSON format data, and then performing service processing; when the platform issues the data, the data is converted into a data format which can be received by the module through the script and then issued to the lower-layer module;
and (3) decoding: in a computer network, resource sharing and data transmission need to be realized through the network, so when the signal forms of two linked parties are different, for example, when the signal form of a used communication network is different from that of a transmission module, conversion of the signal form is required, and the conversion of the signal form by a receiving party is decoding;
filling: when data is processed, the situation of data missing values is met many times, and in the case of the data missing values, a simple method can be to fill median, average and the like in continuous variables and mode in discrete variables, and then, a deep learning method such as K-means interpolation, mixed Gaussian distribution interpolation and the like can be considered to fill the data;
error correction: when data is input, errors are inevitable, the data needs to be supplemented and corrected along with the lapse of time and the sudden progress of work, the integrity and the accuracy of the data are dynamic, the correctness of basic data needs to be kept, and the key point is to establish a mechanism for correcting error data as soon as possible, namely auditing, correcting and feeding back;
s4: after a series of preprocessing is carried out on the data according to the S3, the processed data are sent to an analysis module for analysis, and favorable data are screened and then transmitted to an output module;
s5: data are collected, input, preprocessed and analyzed according to S1, S2, S3 and S4 and then transmitted to an output module, the output module actively transmits the data to a storage module for storage, and if the data format is a document type, a MongoDB document type database is selected; if the data is structured, the relational database is adopted for storage; when the data reaches a large scale, HDFS storage will be preferred.
3. The big data processing system and method of claim 1, wherein the pre-processing module is divided into four parts, parsing, decoding, padding and error correction.
4. The big data processing system and method with enhanced preprocessing as claimed in claim 1, wherein the preprocessing module is used to receive big data of user behavior collected by big data collection module.
CN201911373572.3A 2019-12-27 2019-12-27 Big data processing system and method with enhanced preprocessing Pending CN111143367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911373572.3A CN111143367A (en) 2019-12-27 2019-12-27 Big data processing system and method with enhanced preprocessing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911373572.3A CN111143367A (en) 2019-12-27 2019-12-27 Big data processing system and method with enhanced preprocessing

Publications (1)

Publication Number Publication Date
CN111143367A true CN111143367A (en) 2020-05-12

Family

ID=70521239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911373572.3A Pending CN111143367A (en) 2019-12-27 2019-12-27 Big data processing system and method with enhanced preprocessing

Country Status (1)

Country Link
CN (1) CN111143367A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449876A (en) * 2021-06-11 2021-09-28 北京四维图新科技股份有限公司 Processing method, system and storage medium for deep learning training data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354876A (en) * 2016-09-22 2017-01-25 珠海格力电器股份有限公司 Data processing system and method
CN106815338A (en) * 2016-12-25 2017-06-09 北京中海投资管理有限公司 A kind of real-time storage of big data, treatment and inquiry system
CN109271566A (en) * 2018-10-17 2019-01-25 穆逸诚 A kind of computer data processing system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354876A (en) * 2016-09-22 2017-01-25 珠海格力电器股份有限公司 Data processing system and method
CN106815338A (en) * 2016-12-25 2017-06-09 北京中海投资管理有限公司 A kind of real-time storage of big data, treatment and inquiry system
CN109271566A (en) * 2018-10-17 2019-01-25 穆逸诚 A kind of computer data processing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
韦智勇 等: "面向推荐系统的用户行为记录数据实时预处理研究与实现" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449876A (en) * 2021-06-11 2021-09-28 北京四维图新科技股份有限公司 Processing method, system and storage medium for deep learning training data

Similar Documents

Publication Publication Date Title
CN111444236B (en) Mobile terminal user portrait construction method and system based on big data
CN110347719B (en) Enterprise foreign trade risk early warning method and system based on big data
CN111400326B (en) Smart city data management system and method thereof
CN106709012A (en) Method and device for analyzing big data
CN109753502B (en) Data acquisition method based on NiFi
CN103838867A (en) Log processing method and device
CN110362544A (en) Log processing system, log processing method, terminal and storage medium
CN111666490A (en) Information pushing method, device, equipment and storage medium based on kafka
CN108038207A (en) A kind of daily record data processing system, method and server
CN104199879A (en) Data processing method and device
CN111949850B (en) Multi-source data acquisition method, device, equipment and storage medium
CN111352903A (en) Log management platform, log management method, medium, and electronic device
CN112948492A (en) Data processing system, method and device, electronic equipment and storage medium
CN109246219A (en) A kind of working method and system of IoT data collection system
AU2017254506A1 (en) Method, apparatus, computing device and storage medium for data analyzing and processing
CN106844550B (en) Virtualization platform operation recommendation method and device
CN113362118A (en) User electricity consumption behavior analysis method and system based on random forest
CN111143367A (en) Big data processing system and method with enhanced preprocessing
CN106557483B (en) Data processing method, data query method, data processing equipment and data query equipment
CN113010542A (en) Service data processing method and device, computer equipment and storage medium
CN112214494B (en) Retrieval method and device
CN114371884A (en) Method, device, equipment and storage medium for processing Flink calculation task
CN111581254A (en) ETL method and system based on internet financial data
CN112612823A (en) Big data time sequence analysis method based on fusion of Pyspark and Pandas
Hwang A Study on Big Data Platform Architecture-based Conceptual Measurement Model Using Comparative Analysis for Social Commerce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200512

WD01 Invention patent application deemed withdrawn after publication