CN105512201A - Data collection and processing method and device - Google Patents

Data collection and processing method and device Download PDF

Info

Publication number
CN105512201A
CN105512201A CN201510843845.1A CN201510843845A CN105512201A CN 105512201 A CN105512201 A CN 105512201A CN 201510843845 A CN201510843845 A CN 201510843845A CN 105512201 A CN105512201 A CN 105512201A
Authority
CN
China
Prior art keywords
data
collection
node
layout
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510843845.1A
Other languages
Chinese (zh)
Inventor
汤奇峰
汤余
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Original Assignee
ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd filed Critical ZAMPLUS ADVERTISING (SHANGHAI) CO Ltd
Priority to CN201510843845.1A priority Critical patent/CN105512201A/en
Publication of CN105512201A publication Critical patent/CN105512201A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications

Abstract

The invention provides a data collection and processing method and device. The data collection and processing method comprises following steps: defining a data format based on sample data and a format of sample data; collecting first data; filtering and cleaning first data according to the data format in order to eliminate dirty data and obtain second data; storing second data; and extracting second data to perform data processing according to the business logic.The data collection and processing method and device have following beneficial effects: by defining the data format, the data format is processed after data is collected so that the application scope of data collection is widened; and by filtering and cleaning the first data and eliminating dirty data, effectiveness of data collection is improved and data processing efficiency is increased.

Description

Data Collection and job operation and device
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of Data Collection and job operation and device.
Background technology
In field of computer technology, in order to carry out analysis and prediction to the network behavior of user, need the visit data of user to preserve in the mode of journal file.Data in journal file have the feature of diversity and complicacy, and it is extremely complicated for setting up an energy to the system that various daily record data can be supported.In large data fields, that first will do to provide a unified Data Analysis Platform can allow user various data exactly by some mode access data analysis platform.
In prior art, have following several to the collection method of data: by WEB server, daily record data is preserved generation journal file, then journal file is stored in distributed file system (HadoopDistributedFileSystem by uploading, HDFS), HDFS system possesses the feature of Error Tolerance; Data can also be pushed toward message queue in the form of streaming, as distributed post subscribe message system Kafka message queue, then use consumption mode by the data upload of message queue.Wherein, in different network environments, the network producing net and data, services place is normally isolated, across a network transmission data are normally by safety shell protocol (SecureShell, SSH) code entry of exempting from is got through, then by transmission such as the remote synchronization instrument RSYNC in linux system or telecopy file tool SCP, or a HTML (Hypertext Markup Language), ftp server is built by client upload file.
But method of data capture of the prior art, in the process of Data Collection, can collect dirty data and junk data, reduce validity and the accuracy of Data Collection; And method of data capture for data type single, the scope of application is little.
Summary of the invention
The technical matters that the present invention solves how to improve the validity of Data Collection, expands the scope of application of Data Collection.
For solving the problems of the technologies described above, the embodiment of the present invention provides a kind of Data Collection and job operation, and described Data Collection and job operation comprise:
According to the form of sample data and described sample data, definition data layout;
Gather the first data;
According to described data layout, described first data are carried out filtering and cleaning, reject dirty data, obtain the second data;
Store described second data;
Extract described second data and carry out data processing according to service logic.
Optionally, described second data of described extraction after carrying out data processing according to service logic, also comprising: described second data are carried out data conversion; Read described second data after conversion.
Optionally, described second data after conversion are read by least one mode in HDFS and OLAP mode.
Optionally, described according to described data layout gather the first data comprise following one or more: in the subscriber data file uploaded, gather described first data according to described data layout; According to the FTP address of uploading, pull described subscriber data file, and gather described first data according to described data layout; Receive user data based on DataBase combining, Open Database Connection, HTML (Hypertext Markup Language), thrift, remote procedure call protocol, and gather described first data according to described data layout.
Optionally, described described data are carried out filtering and cleaning, reject dirty data and comprise: filtering by stabbing form, hand over word type, Host standardization or canonical coupling switching time, revising, deleting described first data.
Optionally, described second data of described storage comprise: be event by described second data encapsulation, and stored by memory, file, database purchase or Kafka storage mode, store described event.
Optionally, described second data carry out transmitting and storing by data stream or based on the mode of file-level.
For solving the problems of the technologies described above, the embodiment of the invention also discloses a kind of Data Collection and processing unit (plant), described Data Collection and processing unit (plant) comprise data-collection nodes, at least one data processing node and consumption node; Described data-collection nodes comprises:
Data format definition unit, is suitable for the form according to the sample data uploaded and described sample data, definition data layout;
Data acquisition unit, is suitable for collection first data;
Cleaning unit, is suitable for described first data being carried out filtering and cleaning according to described data layout, rejects dirty data, obtains the second data;
Storage unit, is suitable for storing described second data;
Processing unit, is suitable for extracting described second data and carrying out data processing according to service logic.
Optionally, described data processing node is suitable for described second data to carry out data conversion; Described consumption node is suitable for described second data after reading conversion.
Optionally, described data processing node adopts mode that is in parallel or series connection to connect.
Optionally, described consumption node reads described second data after conversion by least one mode in HDFS and OLAP mode.
Compared with prior art, the technical scheme of the embodiment of the present invention has following beneficial effect:
The embodiment of the present invention is according to the form of sample data and described sample data, definition data layout, the first data are gathered according to described data layout, described first data are carried out filtering and cleaning, reject dirty data, obtain the second data, extract described second data after storing described second data and carry out data processing according to service logic.The embodiment of the present invention is by definition data layout, to process according to described data layout after image data, make data-collection nodes, data processing node and consumption node be easy to expansion, be easy to monitoring, adjust when being convenient to go wrong, expand the scope of application of Data Collection; By described first data being carried out filtering and cleaning, rejecting dirty data, improve the convenience of the validity of Data Collection and the data transmission of across a network.
Further, described second data carry out transmitting and storing by data stream or based on the mode of file-level, make source file preserve according to unprocessed form and need not upset the data of source file, ensure that the accuracy of data; The mode of usage data stream allow Data Collection and processing each node can flexible organization, configuration and deployment, expand the scope of application of Data Collection.
Accompanying drawing explanation
Fig. 1 is a kind of data-collection nodes structural representation of the embodiment of the present invention;
Fig. 2 is a kind of data processing node structural representation of the embodiment of the present invention;
Fig. 3 is embodiment of the present invention one consumption node structure schematic diagram;
Fig. 4 is embodiment of the present invention another kind consumption node structure schematic diagram;
Fig. 5 is a kind of Data Collection of the embodiment of the present invention and processing unit (plant) structural representation;
Fig. 6 is the another kind of Data Collection of the embodiment of the present invention and processing unit (plant) structural representation;
Fig. 7 is a kind of Data Collection of the embodiment of the present invention and job operation process flow diagram.
Embodiment
As described in the background art, method of data capture of the prior art, in the process of Data Collection, can collect dirty data and junk data, reduces validity and the accuracy of Data Collection; And method of data capture for data type single, the scope of application is little.
The embodiment of the present invention is according to the form of sample data and described sample data, definition data layout, the first data are gathered according to described data layout, described first data are carried out filtering and cleaning, reject dirty data, obtain the second data, extract described second data after storing described second data and carry out data processing according to service logic.The embodiment of the present invention is by definition data layout, to process according to described data layout after image data, make data-collection nodes, data processing node and consumption node be easy to expansion, be easy to monitoring, adjust when being convenient to go wrong, expand the scope of application of Data Collection; By described first data being carried out filtering and cleaning, rejecting dirty data, improve the convenience of the validity of Data Collection and the data transmission of across a network.
For enabling above-mentioned purpose of the present invention, feature and advantage more become apparent, and are described in detail specific embodiments of the invention below in conjunction with accompanying drawing.
Fig. 1 is a kind of data-collection nodes structural representation of the embodiment of the present invention.
Please refer to Fig. 1, described data-collection nodes 10 comprises:
Data format definition unit 101, is suitable for the form according to the sample data uploaded and described sample data, definition data layout.
In the present embodiment, client is by HTML (Hypertext Markup Language) (HyperTextTransferProtocol, HTTP) interface uploads sample data and form corresponding to data, data format definition unit 101 defines data layout, the data layout of definition is returned to client to confirm, after client confirms, a kind of data layout is defined to be completed.
Data acquisition unit 102, is suitable for collection first data.
In the present embodiment, client upload data file, in the subscriber data file uploaded, gathers described first data according to described data layout.
In the present embodiment, the collection of data also can adopt the mode of file transfer protocol (FTP) (FileTransferProtocol, the FTP) address providing corresponding, according to described FTP address, pull described subscriber data file, gather described first data according to described data layout.If FTP address, data acquisition unit 102 will start a thread and pull file on backstage from FTP address.The collection of data can also by arranging the mode of external interface, adopt HTML (Hypertext Markup Language) (HyperTextTransferProtocol, HTTP) agreement, remote procedure call protocol (RemoteProcedureCallProtocol, RPC) receive the data transmission of other system or device, and gather the first data according to described data layout.
Be understandable that, the collection of data can also based on DataBase combining (JavaDataBaseConnectivity, java, JDBC), Open Database Connection (OpenDatabaseConnectivity, ODBC), the agreements such as thrift gather; Can also carry out the Real-time Collection of data based on data stream, can realize the real-time with data sampling and processing and consumption, the embodiment of the present invention does not limit this.
Cleaning unit 103, is suitable for described first data being carried out filtering and cleaning according to described data layout, rejects dirty data, obtains the second data.
In the present embodiment, after data acquisition unit 102 collects the first data, the described data layout that cleaning unit 103 defines according to configuration file, to be filtered data by filtrator, blocker and cleans.Cleaning unit 103 is filtered by Interceptor blocker, and Interceptor blocker is tackled in first data transmission process, adds the service logic of needs; Can also filter by stabbing form, hand over word type, Host standardization or canonical coupling switching time, revise, delete described first data, make it meet the described data layout of data format definition unit 101 definition, obtain the second data, realize Dynamic Recognition and the collection of data.
In concrete enforcement, cleaning unit 103 is filtrator and blocker configuration corresponding data item, and filtrator and blocker being packaged into correspondence put into by Jar bag data item catalogue with the form of plug-in unit, cleaning unit 103 automatically loads bag by the mode that heat loads and configures.
In concrete enforcement, cleaning unit 103 supports that configuration can implement arbitrarily the cleaning blocker of quantity to form a processing chain, and performs according to the order of configuration.
Storage unit 104, is suitable for storing described second data.
In the present embodiment, storage unit 104 is for storing the second data; Described second data are encapsulated as event (Event), and by memory, file storage, database purchase or Kafka storage mode, store described event.Storing the second data is that data are stored into buffer memory, consumes for step below.
Processing unit 105, is suitable for extracting described second data and carrying out data processing according to service logic.
In the present embodiment, processing unit 105 is configured according to specific service logic, the mode of processing unit 105 supported data stream and Single document consumption.
In concrete enforcement, when using multithread mode to accelerate the speed of consuming, at least one same processing unit 105 can be configured in configuration file, and adopt different names.
In the present embodiment, described second data carry out transmitting and storing by data stream or based on the mode of file-level.Wherein, the mode of data stream is basic transmission mode, and the second data push toward message queue with stateless mode, after processing unit 105 receives the second data, by it according to unified mode process; Mode based on file-level is the mode process with Single document, the rearmounted program of execution after file processing completes.Also need after data processing to load.Convenient data source file need not being upset source file according to unprocessed form preservation of mode of operation based on file-level.
Fig. 2 is a kind of data processing node structural representation of the embodiment of the present invention.
Please refer to Fig. 2, described data processing node 20 comprises: data receipt unit 201, data storage cell 202 and data processing unit 203.
In the present embodiment, the second data encapsulation received, for receiving the second data, is become event by data receipt unit (Source) 201; Wherein, Event represents a data cell, and with an optional message header, key assignments is put in the message header of event, and message content is put in message body.
In the present embodiment, Event is pushed in data storage cell (Channel) 202 by data receipt unit 201, storage unit 202 is for data buffer storage, one of transfer Event stores temporarily, and by memory, file storage, database purchase or Kafka storage mode, store Event.
In concrete enforcement, Kafka is a kind of distributed post subscribe message system of high-throughput, the stream of action data in the website of process consumer scale.Kafka unifies the Message Processing with off-line on line by the loaded in parallel mechanism of Hadoop, provides real-time consumption.
Data processing unit (Sink) 203, reads and removes Event from storage unit 202, then carries out data conversion, is finally loaded into the interface of rearmounted node.
In the present embodiment, after DSR, legacy data table is emptied, then data loading is entered; Can also directly toward supplemental data in table, when an input record with existed record repeat time, need to abandon or insert during definition a new record; If the major key of input record and the key of a record existed match each other, just the target record of coupling is upgraded; If the major key of input record mates with the existing key recorded, so just retain the record that existing record adds input.
In the present embodiment, described data processing node 20 can with Morphline Tool integration, provide ETL processing chain, ETL (Extract-Transform-Load) process that process is complicated.ETL process refers to passes through data the process extracting, change, be loaded on destination from source terminal.Morphline is a kind of ETL instrument.
In the present embodiment, after ETL flow process terminates, store final result data, storage mode comprises memory and persistent storage, and wherein, persistent storage is used for reloading data when restarting or recover, and ensures that the data stored can not be lost.
Fig. 3 is embodiment of the present invention one consumption node structure schematic diagram.
Please refer to Fig. 3, described consumption node 30 is distributed file system (HadoopDistributedFileSystem, HDFS) consume node, comprising: data receipt unit (Source) 301, data storage cell (Channel) 302 and HDFS processing unit (Sink) 303.
In concrete enforcement, HDFS is applicable to operating in common hardware (CommodityHardware).HDFS is a kind of system of Error Tolerance, can provide the data access of high-throughput, is applicable to being applied to large-scale dataset, realizes the object reading streaming file data.
In the present embodiment, HDFS consumes the data that node 30 receives file and data-stream form, and data are stored in HDFS with source document, or forms data stream with the file of the regular burst such as time, size by certain rule.
The data encapsulation received, for receiving data, is become event by data receipt unit 301, and some key-value pairs are put in the message header of event, and message content is put in message body.
Data storage cell 302 is for the buffer memory of data.
Event in HDFS processing unit 303 consumption data storage unit 302, comprises source document or the transmitting data stream of user's transmission, and uploads to HDFS by HDFS client with source document or slicing files.
Fig. 4 is embodiment of the present invention another kind consumption node structure schematic diagram.
Please refer to Fig. 4, described consumption node 40 is on-line analysing processing system (OnlineAnalyticalProcessing, OLAP) consume node, comprising: data receipt unit (Source) 401, data storage cell (Channel) 402 and OLAP processing unit (Sink) 403.
In concrete enforcement, OLAP by carrying out fast the multiple possible observation form of information, the access of stable and consistent and interactivity, allow management decision personnel to carry out deep observation to data.Decision data is multidimensional data, and multidimensional data is exactly the main contents of decision-making.OLAP is for supporting complicated analysis operation, stress the decision support to decision-maker and senior management staff, can according to the requirement of analyst fast, carry out the complex query processing of big data quantity neatly, and with a kind of form directly perceived and understandable, Query Result is supplied to decision-maker, OLAP has the outstanding advantages such as analytic function, intuitively data manipulation and analysis result visable representation flexibly.
The data encapsulation received, for receiving file or data flow data, is become event by data receipt unit 401, and some key-value pairs are put in the message header of event, and message content is put in message body.
Data storage cell 402 is for the buffer memory of data.
Event in OLAP processing unit 403 consumption data storage unit 402.The mode of the loading data file by OLAP self is imported to OLAP by data file; The data of data-stream form will load according to after the regular burst spanned file such as time, size again.OLAP processing unit 403 notifies post processing program after having loaded data.
In the present embodiment, OLAP consumes node and consumes from HDFS that node is different to be, OLAP processing unit 403 transfers data to OLAP with instruments such as data loadings.
It should be noted that, the embodiment of the present invention can by the data processing method of OLAP and HDFS, also can adopt the consumption node of other any enforceable processing modes, to adapt to the business demand under different application environment, the embodiment of the present invention does not limit this.
Embodiment with reference to aforementioned corresponding embodiment, can repeat no more herein.
Fig. 5 is a kind of Data Collection of the embodiment of the present invention and processing unit (plant) structural representation.
Please refer to Fig. 5, described Data Collection and processing unit (plant) comprise: data-collection nodes 10, data processing node 20, HDFS consume node 30 and OLAP consumes node 40.
Data-collection nodes 10 defines data layout, image data, with card format configuration filter or blocker, filters according to data layout and cleans the data collected.
Data processing node 20 carries out conversion process to the data after filtering, and is sent to and consumes node accordingly.
In the present embodiment, HDFS consumes node 30 and reads the described data after changing.Because of the isolation of the environment in HDFS and other environment, HDFS consumes node 30 can provide independent node, realizes the data transmission of across a network, supports multiple uploading protocol.
In the present embodiment, OLAP consumes node 40 and reads the described data after changing.OLAP consumes the data transmission that node 40 can realize across a network, supports multiple uploading protocol.
In the present embodiment, data processing node 20 and HDFS consume node 30 and OLAP consumes node 40 one_to_one corresponding, and different consumption nodes connects different data processing nodes.Data processing node 20 adopts mode in parallel to connect.
In the embodiment of the present invention, data-collection nodes 10, data processing node 20, HDFS consume node 30 and OLAP consumption node 40 can define according to practical application request, is connected between node with node by interface.Each node definition receiving end interface, the processing unit of a upper node is to the receiving interface propelling data of next node.Divided and unified interface type by unified function logic, each node completes specific function logic, and transmission data, through pre-defined form, make can freely be connected and combine between node with node.
In concrete enforcement, by the data flow (DataFlow) of defined node, in independent assortment mode, the mode of node according to workflow is combined, dispose with reference to the actual network environment of use and the application scenarios of each node.Data processing node 20 can independent assortment or not usage data processing node 20.
The embodiment of the present invention can also provide a kind of Java administration extensions (JavaManagementExtensions, JMX) function, support that the UI monitoring tools such as Ganglia, Grafana consume the monitoring of node 40 to data-collection nodes 10, data processing node 20, HDFS consumption node 30 and OLAP in whole device, realize the monitoring of the steps such as Data Collection, filtration, processing, adjust when being convenient to go wrong.
Be understandable that, data processing node 20 in described Data Collection and processing unit (plant), HDFS consume the quantity that node 30 and OLAP consume node 40 can do adaptive adjustment according to actual application environment.
The embodiment of the present invention, by the processing after different node deployment and loading, improves the convenience of data transmission.
Fig. 6 is the another kind of Data Collection of the embodiment of the present invention and processing unit (plant) structural representation.
Please refer to Fig. 6, described Data Collection and processing unit (plant) comprise: data-collection nodes 10, data processing node 20, HDFS consume node 30 and OLAP consumes node 40.
In the present embodiment, HDFS consumes node 30 and OLAP consumption node 40 adopts identical data processing node 2.Data processing node 20 adopts the mode of series connection to connect.
Be understandable that, data processing node 20 in described Data Collection and processing unit (plant), HDFS consume the quantity that node 30 and OLAP consume node 40 can do adaptive adjustment according to actual application environment.
Embodiment with reference to aforementioned corresponding embodiment, can repeat no more herein.
The Data Collection of the embodiment of the present invention and processing unit (plant) are all provided with storage unit at data-collection nodes, data processing node and consumption node, for transmitting the preservation of data, with make above-mentioned node cause due to extraneous factor restart time, the data before startup can be given automatically for change, ensure that the data security in data transmission procedure.In addition, even if in the normal situation of network, also can ensure can not to interrupt and lose in the data transmission of each transmit stage.
Fig. 7 is a kind of Data Collection of the embodiment of the present invention and job operation process flow diagram.
Please refer to Fig. 7, described Data Collection and job operation comprise: step S701, according to the form of sample data and described sample data, and definition data layout.
Step S702, gathers the first data.
Described first data are carried out filtering and cleaning according to described data layout by step S703, reject dirty data, obtain the second data.
Step S704, stores described second data.
Step S705, extracts described second data and carries out data processing according to service logic.
Described Data Collection and job operation also comprise, and described second data are carried out data conversion; Read described second data after conversion.Described second data after conversion are read by least one mode in HDFS and OLAP mode.
Embodiment with reference to aforementioned corresponding embodiment, can repeat no more herein.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is that the hardware that can carry out instruction relevant by program has come, this program can be stored in computer-readable recording medium, and storage medium can comprise: ROM, RAM, disk or CD etc.
Although the present invention discloses as above, the present invention is not defined in this.Any those skilled in the art, without departing from the spirit and scope of the present invention, all can make various changes or modifications, and therefore protection scope of the present invention should be as the criterion with claim limited range.

Claims (11)

1. Data Collection and a job operation, is characterized in that, comprising:
According to the form of sample data and described sample data, definition data layout;
Gather the first data;
According to described data layout, described first data are carried out filtering and cleaning, reject dirty data, obtain the second data;
Store described second data;
Extract described second data and carry out data processing according to service logic.
2. Data Collection according to claim 1 and job operation, is characterized in that, extracts described second data and after carrying out data processing according to described service logic, also comprise: described second data are carried out data conversion; Read described second data after conversion.
3. Data Collection according to claim 2 and job operation, is characterized in that, reads described second data after conversion by least one mode in HDFS and OLAP mode.
4. Data Collection according to claim 1 and job operation, is characterized in that, according to described data layout gather the first data comprise following one or more: in the subscriber data file uploaded, gather described first data according to described data layout; According to the FTP address of uploading, pull described subscriber data file, and gather described first data according to described data layout; Receive user data based on DataBase combining, Open Database Connection, HTML (Hypertext Markup Language), thrift, remote procedure call protocol, and gather described first data according to described data layout.
5. Data Collection according to claim 1 and job operation, it is characterized in that, described data are carried out filtering and cleaning, rejects dirty data and comprise: filtering by stabbing form, hand over word type, Host standardization or canonical coupling switching time, revising, deleting described first data.
6. Data Collection according to claim 1 and job operation, it is characterized in that, store described second data to comprise: be event by described second data encapsulation, and stored by memory, file, database purchase or Kafka storage mode, store described event.
7. Data Collection according to claim 1 and job operation, is characterized in that, described second data carry out transmitting and storing by data stream or based on the mode of file-level.
8. Data Collection and a processing unit (plant), is characterized in that, comprises data-collection nodes, at least one data processing node and consumption node; Described data-collection nodes comprises:
Data format definition unit, is suitable for the form according to the sample data uploaded and described sample data, definition data layout;
Data acquisition unit, is suitable for collection first data;
Cleaning unit, is suitable for described first data being carried out filtering and cleaning according to described data layout, rejects dirty data, obtains the second data;
Storage unit, is suitable for storing described second data;
Processing unit, is suitable for extracting described second data and carrying out data processing according to service logic.
9. Data Collection according to claim 8 and processing unit (plant), is characterized in that, described data processing node is suitable for described second data to carry out data conversion; Described consumption node is suitable for described second data after reading conversion.
10. Data Collection according to claim 9 and processing unit (plant), is characterized in that, described data processing node adopts mode that is in parallel or series connection to connect.
11. Data Collection according to claim 9 and processing unit (plant)s, is characterized in that, described consumption node reads described second data after conversion by least one mode in HDFS and OLAP mode.
CN201510843845.1A 2015-11-26 2015-11-26 Data collection and processing method and device Pending CN105512201A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510843845.1A CN105512201A (en) 2015-11-26 2015-11-26 Data collection and processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510843845.1A CN105512201A (en) 2015-11-26 2015-11-26 Data collection and processing method and device

Publications (1)

Publication Number Publication Date
CN105512201A true CN105512201A (en) 2016-04-20

Family

ID=55720183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510843845.1A Pending CN105512201A (en) 2015-11-26 2015-11-26 Data collection and processing method and device

Country Status (1)

Country Link
CN (1) CN105512201A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156886A (en) * 2016-06-30 2016-11-23 亿阳安全技术有限公司 A kind of method and system based on business system Supplementing Data rule application flow
CN106227862A (en) * 2016-07-29 2016-12-14 浪潮软件集团有限公司 E-commerce data integration method based on distribution
CN106354772A (en) * 2016-08-23 2017-01-25 成都卡莱博尔信息技术股份有限公司 Mass data system with data cleaning function
CN107247721A (en) * 2017-04-24 2017-10-13 江苏曙光信息技术有限公司 Visualize collecting method
CN107688535A (en) * 2017-08-15 2018-02-13 武汉斗鱼网络科技有限公司 A kind of mobile device APP performance data display methods and device
CN108052574A (en) * 2017-12-08 2018-05-18 南京中新赛克科技有限责任公司 Slave ftp server based on Kafka technologies imports the ETL system and implementation method of mass data
CN108829747A (en) * 2018-05-24 2018-11-16 新华三大数据技术有限公司 Data load method and device
WO2018214599A1 (en) * 2017-05-22 2018-11-29 平安科技(深圳)有限公司 Scalable data reporting method and system, and storage medium
CN108984708A (en) * 2018-07-06 2018-12-11 蔚来汽车有限公司 Dirty data recognition methods and device, data cleaning method and device, controller
CN110147391A (en) * 2019-04-08 2019-08-20 顺丰速运有限公司 Data handover method, system, equipment and storage medium
CN110502491A (en) * 2019-07-25 2019-11-26 北京神州泰岳智能数据技术有限公司 A kind of Log Collect System and its data transmission method, device
CN110515619A (en) * 2019-08-09 2019-11-29 济南浪潮数据技术有限公司 A kind of theme creation method, device, equipment and readable storage medium storing program for executing
CN110659323A (en) * 2019-09-26 2020-01-07 卓尔购信息科技(武汉)有限公司 Real-time and off-line big data processing system, method, storage medium and terminal
CN111327649A (en) * 2018-12-14 2020-06-23 中国电信股份有限公司 Service data processing method, device, SMF, system and storage medium
CN111797149A (en) * 2019-06-27 2020-10-20 厦门雅基软件有限公司 Data acquisition method, device, equipment and computer readable storage medium
CN112019605A (en) * 2020-08-13 2020-12-01 上海哔哩哔哩科技有限公司 Data distribution method and system of data stream
CN112347093A (en) * 2020-11-05 2021-02-09 哈尔滨航天恒星数据系统科技有限公司 Method for facilitating cleaning, integrating and storing of mass multi-source heterogeneous data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030217055A1 (en) * 2002-05-20 2003-11-20 Chang-Huang Lee Efficient incremental method for data mining of a database
CN101110858A (en) * 2007-08-29 2008-01-23 中兴通讯股份有限公司 Telecommunication report generation system and method thereof
CN102349081A (en) * 2009-02-10 2012-02-08 渣普控股有限公司 Etl builder
CN102880709A (en) * 2012-09-28 2013-01-16 用友软件股份有限公司 Data warehouse management system and data warehouse management method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030217055A1 (en) * 2002-05-20 2003-11-20 Chang-Huang Lee Efficient incremental method for data mining of a database
CN101110858A (en) * 2007-08-29 2008-01-23 中兴通讯股份有限公司 Telecommunication report generation system and method thereof
CN102349081A (en) * 2009-02-10 2012-02-08 渣普控股有限公司 Etl builder
CN102880709A (en) * 2012-09-28 2013-01-16 用友软件股份有限公司 Data warehouse management system and data warehouse management method

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156886A (en) * 2016-06-30 2016-11-23 亿阳安全技术有限公司 A kind of method and system based on business system Supplementing Data rule application flow
CN106227862A (en) * 2016-07-29 2016-12-14 浪潮软件集团有限公司 E-commerce data integration method based on distribution
CN106354772A (en) * 2016-08-23 2017-01-25 成都卡莱博尔信息技术股份有限公司 Mass data system with data cleaning function
CN107247721A (en) * 2017-04-24 2017-10-13 江苏曙光信息技术有限公司 Visualize collecting method
WO2018214599A1 (en) * 2017-05-22 2018-11-29 平安科技(深圳)有限公司 Scalable data reporting method and system, and storage medium
CN107688535A (en) * 2017-08-15 2018-02-13 武汉斗鱼网络科技有限公司 A kind of mobile device APP performance data display methods and device
CN108052574A (en) * 2017-12-08 2018-05-18 南京中新赛克科技有限责任公司 Slave ftp server based on Kafka technologies imports the ETL system and implementation method of mass data
CN108829747A (en) * 2018-05-24 2018-11-16 新华三大数据技术有限公司 Data load method and device
CN108984708A (en) * 2018-07-06 2018-12-11 蔚来汽车有限公司 Dirty data recognition methods and device, data cleaning method and device, controller
CN108984708B (en) * 2018-07-06 2022-02-01 蔚来(安徽)控股有限公司 Dirty data identification method and device, data cleaning method and device, and controller
CN111327649B (en) * 2018-12-14 2023-04-21 中国电信股份有限公司 Service data processing method, device, SMF, system and storage medium
CN111327649A (en) * 2018-12-14 2020-06-23 中国电信股份有限公司 Service data processing method, device, SMF, system and storage medium
CN110147391A (en) * 2019-04-08 2019-08-20 顺丰速运有限公司 Data handover method, system, equipment and storage medium
CN111797149B (en) * 2019-06-27 2023-01-31 厦门雅基软件有限公司 Data acquisition method, device, equipment and computer readable storage medium
CN111797149A (en) * 2019-06-27 2020-10-20 厦门雅基软件有限公司 Data acquisition method, device, equipment and computer readable storage medium
CN110502491A (en) * 2019-07-25 2019-11-26 北京神州泰岳智能数据技术有限公司 A kind of Log Collect System and its data transmission method, device
CN110515619A (en) * 2019-08-09 2019-11-29 济南浪潮数据技术有限公司 A kind of theme creation method, device, equipment and readable storage medium storing program for executing
CN110659323A (en) * 2019-09-26 2020-01-07 卓尔购信息科技(武汉)有限公司 Real-time and off-line big data processing system, method, storage medium and terminal
CN112019605A (en) * 2020-08-13 2020-12-01 上海哔哩哔哩科技有限公司 Data distribution method and system of data stream
CN112019605B (en) * 2020-08-13 2023-05-09 上海哔哩哔哩科技有限公司 Data distribution method and system for data stream
CN112347093A (en) * 2020-11-05 2021-02-09 哈尔滨航天恒星数据系统科技有限公司 Method for facilitating cleaning, integrating and storing of mass multi-source heterogeneous data

Similar Documents

Publication Publication Date Title
CN105512201A (en) Data collection and processing method and device
CN103942210A (en) Processing method, device and system of mass log information
CN104090891B (en) Data processing method, Apparatus and system
CN105912609B (en) A kind of data file processing method and device
US10061858B2 (en) Method and apparatus for processing exploding data stream
CN104111996A (en) Health insurance outpatient clinic big data extraction system and method based on hadoop platform
CN108847977A (en) A kind of monitoring method of business datum, storage medium and server
CN105824744A (en) Real-time log collection and analysis method on basis of B2B (Business to Business) platform
CN102780726A (en) Log analysis method and log analysis system based on WEB platform
CN107818120A (en) Data processing method and device based on big data
CN104778188A (en) Distributed device log collection method
CN103970788A (en) Webpage-crawling-based crawler technology
CN105303456A (en) Method for processing monitoring data of electric power transmission equipment
CN102508908A (en) Method for acquiring subordinate financial business data and system for acquiring subordinate financial business data
CN112118174A (en) Software defined data gateway
CN105447146A (en) Massive data collecting and exchanging system and method
CN103577482A (en) Web page collecting method and device as well as browser
CN106330963A (en) Cross-network multi-node log collecting method
CN111400393A (en) Data processing method and device based on multi-application platform and storage medium
CN113312428A (en) Multi-source heterogeneous training data fusion method, device and equipment
CN109831316A (en) Massive logs real-time analyzer, real-time analysis method and readable storage medium storing program for executing
CN114201540A (en) Industrial multi-source data acquisition and storage system
CN104281980A (en) Remote diagnosis method and system for thermal generator set based on distributed calculation
CN105159820A (en) Transmission method and device of system log data
CN108090186A (en) A kind of electric power data De-weight method on big data platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160420

RJ01 Rejection of invention patent application after publication