CN105512201A - Data collection and processing method and device - Google Patents
Data collection and processing method and device Download PDFInfo
- Publication number
- CN105512201A CN105512201A CN201510843845.1A CN201510843845A CN105512201A CN 105512201 A CN105512201 A CN 105512201A CN 201510843845 A CN201510843845 A CN 201510843845A CN 105512201 A CN105512201 A CN 105512201A
- Authority
- CN
- China
- Prior art keywords
- data
- collection
- node
- layout
- mode
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/069—Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
Abstract
The invention provides a data collection and processing method and device. The data collection and processing method comprises following steps: defining a data format based on sample data and a format of sample data; collecting first data; filtering and cleaning first data according to the data format in order to eliminate dirty data and obtain second data; storing second data; and extracting second data to perform data processing according to the business logic.The data collection and processing method and device have following beneficial effects: by defining the data format, the data format is processed after data is collected so that the application scope of data collection is widened; and by filtering and cleaning the first data and eliminating dirty data, effectiveness of data collection is improved and data processing efficiency is increased.
Description
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of Data Collection and job operation and device.
Background technology
In field of computer technology, in order to carry out analysis and prediction to the network behavior of user, need the visit data of user to preserve in the mode of journal file.Data in journal file have the feature of diversity and complicacy, and it is extremely complicated for setting up an energy to the system that various daily record data can be supported.In large data fields, that first will do to provide a unified Data Analysis Platform can allow user various data exactly by some mode access data analysis platform.
In prior art, have following several to the collection method of data: by WEB server, daily record data is preserved generation journal file, then journal file is stored in distributed file system (HadoopDistributedFileSystem by uploading, HDFS), HDFS system possesses the feature of Error Tolerance; Data can also be pushed toward message queue in the form of streaming, as distributed post subscribe message system Kafka message queue, then use consumption mode by the data upload of message queue.Wherein, in different network environments, the network producing net and data, services place is normally isolated, across a network transmission data are normally by safety shell protocol (SecureShell, SSH) code entry of exempting from is got through, then by transmission such as the remote synchronization instrument RSYNC in linux system or telecopy file tool SCP, or a HTML (Hypertext Markup Language), ftp server is built by client upload file.
But method of data capture of the prior art, in the process of Data Collection, can collect dirty data and junk data, reduce validity and the accuracy of Data Collection; And method of data capture for data type single, the scope of application is little.
Summary of the invention
The technical matters that the present invention solves how to improve the validity of Data Collection, expands the scope of application of Data Collection.
For solving the problems of the technologies described above, the embodiment of the present invention provides a kind of Data Collection and job operation, and described Data Collection and job operation comprise:
According to the form of sample data and described sample data, definition data layout;
Gather the first data;
According to described data layout, described first data are carried out filtering and cleaning, reject dirty data, obtain the second data;
Store described second data;
Extract described second data and carry out data processing according to service logic.
Optionally, described second data of described extraction after carrying out data processing according to service logic, also comprising: described second data are carried out data conversion; Read described second data after conversion.
Optionally, described second data after conversion are read by least one mode in HDFS and OLAP mode.
Optionally, described according to described data layout gather the first data comprise following one or more: in the subscriber data file uploaded, gather described first data according to described data layout; According to the FTP address of uploading, pull described subscriber data file, and gather described first data according to described data layout; Receive user data based on DataBase combining, Open Database Connection, HTML (Hypertext Markup Language), thrift, remote procedure call protocol, and gather described first data according to described data layout.
Optionally, described described data are carried out filtering and cleaning, reject dirty data and comprise: filtering by stabbing form, hand over word type, Host standardization or canonical coupling switching time, revising, deleting described first data.
Optionally, described second data of described storage comprise: be event by described second data encapsulation, and stored by memory, file, database purchase or Kafka storage mode, store described event.
Optionally, described second data carry out transmitting and storing by data stream or based on the mode of file-level.
For solving the problems of the technologies described above, the embodiment of the invention also discloses a kind of Data Collection and processing unit (plant), described Data Collection and processing unit (plant) comprise data-collection nodes, at least one data processing node and consumption node; Described data-collection nodes comprises:
Data format definition unit, is suitable for the form according to the sample data uploaded and described sample data, definition data layout;
Data acquisition unit, is suitable for collection first data;
Cleaning unit, is suitable for described first data being carried out filtering and cleaning according to described data layout, rejects dirty data, obtains the second data;
Storage unit, is suitable for storing described second data;
Processing unit, is suitable for extracting described second data and carrying out data processing according to service logic.
Optionally, described data processing node is suitable for described second data to carry out data conversion; Described consumption node is suitable for described second data after reading conversion.
Optionally, described data processing node adopts mode that is in parallel or series connection to connect.
Optionally, described consumption node reads described second data after conversion by least one mode in HDFS and OLAP mode.
Compared with prior art, the technical scheme of the embodiment of the present invention has following beneficial effect:
The embodiment of the present invention is according to the form of sample data and described sample data, definition data layout, the first data are gathered according to described data layout, described first data are carried out filtering and cleaning, reject dirty data, obtain the second data, extract described second data after storing described second data and carry out data processing according to service logic.The embodiment of the present invention is by definition data layout, to process according to described data layout after image data, make data-collection nodes, data processing node and consumption node be easy to expansion, be easy to monitoring, adjust when being convenient to go wrong, expand the scope of application of Data Collection; By described first data being carried out filtering and cleaning, rejecting dirty data, improve the convenience of the validity of Data Collection and the data transmission of across a network.
Further, described second data carry out transmitting and storing by data stream or based on the mode of file-level, make source file preserve according to unprocessed form and need not upset the data of source file, ensure that the accuracy of data; The mode of usage data stream allow Data Collection and processing each node can flexible organization, configuration and deployment, expand the scope of application of Data Collection.
Accompanying drawing explanation
Fig. 1 is a kind of data-collection nodes structural representation of the embodiment of the present invention;
Fig. 2 is a kind of data processing node structural representation of the embodiment of the present invention;
Fig. 3 is embodiment of the present invention one consumption node structure schematic diagram;
Fig. 4 is embodiment of the present invention another kind consumption node structure schematic diagram;
Fig. 5 is a kind of Data Collection of the embodiment of the present invention and processing unit (plant) structural representation;
Fig. 6 is the another kind of Data Collection of the embodiment of the present invention and processing unit (plant) structural representation;
Fig. 7 is a kind of Data Collection of the embodiment of the present invention and job operation process flow diagram.
Embodiment
As described in the background art, method of data capture of the prior art, in the process of Data Collection, can collect dirty data and junk data, reduces validity and the accuracy of Data Collection; And method of data capture for data type single, the scope of application is little.
The embodiment of the present invention is according to the form of sample data and described sample data, definition data layout, the first data are gathered according to described data layout, described first data are carried out filtering and cleaning, reject dirty data, obtain the second data, extract described second data after storing described second data and carry out data processing according to service logic.The embodiment of the present invention is by definition data layout, to process according to described data layout after image data, make data-collection nodes, data processing node and consumption node be easy to expansion, be easy to monitoring, adjust when being convenient to go wrong, expand the scope of application of Data Collection; By described first data being carried out filtering and cleaning, rejecting dirty data, improve the convenience of the validity of Data Collection and the data transmission of across a network.
For enabling above-mentioned purpose of the present invention, feature and advantage more become apparent, and are described in detail specific embodiments of the invention below in conjunction with accompanying drawing.
Fig. 1 is a kind of data-collection nodes structural representation of the embodiment of the present invention.
Please refer to Fig. 1, described data-collection nodes 10 comprises:
Data format definition unit 101, is suitable for the form according to the sample data uploaded and described sample data, definition data layout.
In the present embodiment, client is by HTML (Hypertext Markup Language) (HyperTextTransferProtocol, HTTP) interface uploads sample data and form corresponding to data, data format definition unit 101 defines data layout, the data layout of definition is returned to client to confirm, after client confirms, a kind of data layout is defined to be completed.
Data acquisition unit 102, is suitable for collection first data.
In the present embodiment, client upload data file, in the subscriber data file uploaded, gathers described first data according to described data layout.
In the present embodiment, the collection of data also can adopt the mode of file transfer protocol (FTP) (FileTransferProtocol, the FTP) address providing corresponding, according to described FTP address, pull described subscriber data file, gather described first data according to described data layout.If FTP address, data acquisition unit 102 will start a thread and pull file on backstage from FTP address.The collection of data can also by arranging the mode of external interface, adopt HTML (Hypertext Markup Language) (HyperTextTransferProtocol, HTTP) agreement, remote procedure call protocol (RemoteProcedureCallProtocol, RPC) receive the data transmission of other system or device, and gather the first data according to described data layout.
Be understandable that, the collection of data can also based on DataBase combining (JavaDataBaseConnectivity, java, JDBC), Open Database Connection (OpenDatabaseConnectivity, ODBC), the agreements such as thrift gather; Can also carry out the Real-time Collection of data based on data stream, can realize the real-time with data sampling and processing and consumption, the embodiment of the present invention does not limit this.
Cleaning unit 103, is suitable for described first data being carried out filtering and cleaning according to described data layout, rejects dirty data, obtains the second data.
In the present embodiment, after data acquisition unit 102 collects the first data, the described data layout that cleaning unit 103 defines according to configuration file, to be filtered data by filtrator, blocker and cleans.Cleaning unit 103 is filtered by Interceptor blocker, and Interceptor blocker is tackled in first data transmission process, adds the service logic of needs; Can also filter by stabbing form, hand over word type, Host standardization or canonical coupling switching time, revise, delete described first data, make it meet the described data layout of data format definition unit 101 definition, obtain the second data, realize Dynamic Recognition and the collection of data.
In concrete enforcement, cleaning unit 103 is filtrator and blocker configuration corresponding data item, and filtrator and blocker being packaged into correspondence put into by Jar bag data item catalogue with the form of plug-in unit, cleaning unit 103 automatically loads bag by the mode that heat loads and configures.
In concrete enforcement, cleaning unit 103 supports that configuration can implement arbitrarily the cleaning blocker of quantity to form a processing chain, and performs according to the order of configuration.
Storage unit 104, is suitable for storing described second data.
In the present embodiment, storage unit 104 is for storing the second data; Described second data are encapsulated as event (Event), and by memory, file storage, database purchase or Kafka storage mode, store described event.Storing the second data is that data are stored into buffer memory, consumes for step below.
Processing unit 105, is suitable for extracting described second data and carrying out data processing according to service logic.
In the present embodiment, processing unit 105 is configured according to specific service logic, the mode of processing unit 105 supported data stream and Single document consumption.
In concrete enforcement, when using multithread mode to accelerate the speed of consuming, at least one same processing unit 105 can be configured in configuration file, and adopt different names.
In the present embodiment, described second data carry out transmitting and storing by data stream or based on the mode of file-level.Wherein, the mode of data stream is basic transmission mode, and the second data push toward message queue with stateless mode, after processing unit 105 receives the second data, by it according to unified mode process; Mode based on file-level is the mode process with Single document, the rearmounted program of execution after file processing completes.Also need after data processing to load.Convenient data source file need not being upset source file according to unprocessed form preservation of mode of operation based on file-level.
Fig. 2 is a kind of data processing node structural representation of the embodiment of the present invention.
Please refer to Fig. 2, described data processing node 20 comprises: data receipt unit 201, data storage cell 202 and data processing unit 203.
In the present embodiment, the second data encapsulation received, for receiving the second data, is become event by data receipt unit (Source) 201; Wherein, Event represents a data cell, and with an optional message header, key assignments is put in the message header of event, and message content is put in message body.
In the present embodiment, Event is pushed in data storage cell (Channel) 202 by data receipt unit 201, storage unit 202 is for data buffer storage, one of transfer Event stores temporarily, and by memory, file storage, database purchase or Kafka storage mode, store Event.
In concrete enforcement, Kafka is a kind of distributed post subscribe message system of high-throughput, the stream of action data in the website of process consumer scale.Kafka unifies the Message Processing with off-line on line by the loaded in parallel mechanism of Hadoop, provides real-time consumption.
Data processing unit (Sink) 203, reads and removes Event from storage unit 202, then carries out data conversion, is finally loaded into the interface of rearmounted node.
In the present embodiment, after DSR, legacy data table is emptied, then data loading is entered; Can also directly toward supplemental data in table, when an input record with existed record repeat time, need to abandon or insert during definition a new record; If the major key of input record and the key of a record existed match each other, just the target record of coupling is upgraded; If the major key of input record mates with the existing key recorded, so just retain the record that existing record adds input.
In the present embodiment, described data processing node 20 can with Morphline Tool integration, provide ETL processing chain, ETL (Extract-Transform-Load) process that process is complicated.ETL process refers to passes through data the process extracting, change, be loaded on destination from source terminal.Morphline is a kind of ETL instrument.
In the present embodiment, after ETL flow process terminates, store final result data, storage mode comprises memory and persistent storage, and wherein, persistent storage is used for reloading data when restarting or recover, and ensures that the data stored can not be lost.
Fig. 3 is embodiment of the present invention one consumption node structure schematic diagram.
Please refer to Fig. 3, described consumption node 30 is distributed file system (HadoopDistributedFileSystem, HDFS) consume node, comprising: data receipt unit (Source) 301, data storage cell (Channel) 302 and HDFS processing unit (Sink) 303.
In concrete enforcement, HDFS is applicable to operating in common hardware (CommodityHardware).HDFS is a kind of system of Error Tolerance, can provide the data access of high-throughput, is applicable to being applied to large-scale dataset, realizes the object reading streaming file data.
In the present embodiment, HDFS consumes the data that node 30 receives file and data-stream form, and data are stored in HDFS with source document, or forms data stream with the file of the regular burst such as time, size by certain rule.
The data encapsulation received, for receiving data, is become event by data receipt unit 301, and some key-value pairs are put in the message header of event, and message content is put in message body.
Data storage cell 302 is for the buffer memory of data.
Event in HDFS processing unit 303 consumption data storage unit 302, comprises source document or the transmitting data stream of user's transmission, and uploads to HDFS by HDFS client with source document or slicing files.
Fig. 4 is embodiment of the present invention another kind consumption node structure schematic diagram.
Please refer to Fig. 4, described consumption node 40 is on-line analysing processing system (OnlineAnalyticalProcessing, OLAP) consume node, comprising: data receipt unit (Source) 401, data storage cell (Channel) 402 and OLAP processing unit (Sink) 403.
In concrete enforcement, OLAP by carrying out fast the multiple possible observation form of information, the access of stable and consistent and interactivity, allow management decision personnel to carry out deep observation to data.Decision data is multidimensional data, and multidimensional data is exactly the main contents of decision-making.OLAP is for supporting complicated analysis operation, stress the decision support to decision-maker and senior management staff, can according to the requirement of analyst fast, carry out the complex query processing of big data quantity neatly, and with a kind of form directly perceived and understandable, Query Result is supplied to decision-maker, OLAP has the outstanding advantages such as analytic function, intuitively data manipulation and analysis result visable representation flexibly.
The data encapsulation received, for receiving file or data flow data, is become event by data receipt unit 401, and some key-value pairs are put in the message header of event, and message content is put in message body.
Data storage cell 402 is for the buffer memory of data.
Event in OLAP processing unit 403 consumption data storage unit 402.The mode of the loading data file by OLAP self is imported to OLAP by data file; The data of data-stream form will load according to after the regular burst spanned file such as time, size again.OLAP processing unit 403 notifies post processing program after having loaded data.
In the present embodiment, OLAP consumes node and consumes from HDFS that node is different to be, OLAP processing unit 403 transfers data to OLAP with instruments such as data loadings.
It should be noted that, the embodiment of the present invention can by the data processing method of OLAP and HDFS, also can adopt the consumption node of other any enforceable processing modes, to adapt to the business demand under different application environment, the embodiment of the present invention does not limit this.
Embodiment with reference to aforementioned corresponding embodiment, can repeat no more herein.
Fig. 5 is a kind of Data Collection of the embodiment of the present invention and processing unit (plant) structural representation.
Please refer to Fig. 5, described Data Collection and processing unit (plant) comprise: data-collection nodes 10, data processing node 20, HDFS consume node 30 and OLAP consumes node 40.
Data-collection nodes 10 defines data layout, image data, with card format configuration filter or blocker, filters according to data layout and cleans the data collected.
Data processing node 20 carries out conversion process to the data after filtering, and is sent to and consumes node accordingly.
In the present embodiment, HDFS consumes node 30 and reads the described data after changing.Because of the isolation of the environment in HDFS and other environment, HDFS consumes node 30 can provide independent node, realizes the data transmission of across a network, supports multiple uploading protocol.
In the present embodiment, OLAP consumes node 40 and reads the described data after changing.OLAP consumes the data transmission that node 40 can realize across a network, supports multiple uploading protocol.
In the present embodiment, data processing node 20 and HDFS consume node 30 and OLAP consumes node 40 one_to_one corresponding, and different consumption nodes connects different data processing nodes.Data processing node 20 adopts mode in parallel to connect.
In the embodiment of the present invention, data-collection nodes 10, data processing node 20, HDFS consume node 30 and OLAP consumption node 40 can define according to practical application request, is connected between node with node by interface.Each node definition receiving end interface, the processing unit of a upper node is to the receiving interface propelling data of next node.Divided and unified interface type by unified function logic, each node completes specific function logic, and transmission data, through pre-defined form, make can freely be connected and combine between node with node.
In concrete enforcement, by the data flow (DataFlow) of defined node, in independent assortment mode, the mode of node according to workflow is combined, dispose with reference to the actual network environment of use and the application scenarios of each node.Data processing node 20 can independent assortment or not usage data processing node 20.
The embodiment of the present invention can also provide a kind of Java administration extensions (JavaManagementExtensions, JMX) function, support that the UI monitoring tools such as Ganglia, Grafana consume the monitoring of node 40 to data-collection nodes 10, data processing node 20, HDFS consumption node 30 and OLAP in whole device, realize the monitoring of the steps such as Data Collection, filtration, processing, adjust when being convenient to go wrong.
Be understandable that, data processing node 20 in described Data Collection and processing unit (plant), HDFS consume the quantity that node 30 and OLAP consume node 40 can do adaptive adjustment according to actual application environment.
The embodiment of the present invention, by the processing after different node deployment and loading, improves the convenience of data transmission.
Fig. 6 is the another kind of Data Collection of the embodiment of the present invention and processing unit (plant) structural representation.
Please refer to Fig. 6, described Data Collection and processing unit (plant) comprise: data-collection nodes 10, data processing node 20, HDFS consume node 30 and OLAP consumes node 40.
In the present embodiment, HDFS consumes node 30 and OLAP consumption node 40 adopts identical data processing node 2.Data processing node 20 adopts the mode of series connection to connect.
Be understandable that, data processing node 20 in described Data Collection and processing unit (plant), HDFS consume the quantity that node 30 and OLAP consume node 40 can do adaptive adjustment according to actual application environment.
Embodiment with reference to aforementioned corresponding embodiment, can repeat no more herein.
The Data Collection of the embodiment of the present invention and processing unit (plant) are all provided with storage unit at data-collection nodes, data processing node and consumption node, for transmitting the preservation of data, with make above-mentioned node cause due to extraneous factor restart time, the data before startup can be given automatically for change, ensure that the data security in data transmission procedure.In addition, even if in the normal situation of network, also can ensure can not to interrupt and lose in the data transmission of each transmit stage.
Fig. 7 is a kind of Data Collection of the embodiment of the present invention and job operation process flow diagram.
Please refer to Fig. 7, described Data Collection and job operation comprise: step S701, according to the form of sample data and described sample data, and definition data layout.
Step S702, gathers the first data.
Described first data are carried out filtering and cleaning according to described data layout by step S703, reject dirty data, obtain the second data.
Step S704, stores described second data.
Step S705, extracts described second data and carries out data processing according to service logic.
Described Data Collection and job operation also comprise, and described second data are carried out data conversion; Read described second data after conversion.Described second data after conversion are read by least one mode in HDFS and OLAP mode.
Embodiment with reference to aforementioned corresponding embodiment, can repeat no more herein.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is that the hardware that can carry out instruction relevant by program has come, this program can be stored in computer-readable recording medium, and storage medium can comprise: ROM, RAM, disk or CD etc.
Although the present invention discloses as above, the present invention is not defined in this.Any those skilled in the art, without departing from the spirit and scope of the present invention, all can make various changes or modifications, and therefore protection scope of the present invention should be as the criterion with claim limited range.
Claims (11)
1. Data Collection and a job operation, is characterized in that, comprising:
According to the form of sample data and described sample data, definition data layout;
Gather the first data;
According to described data layout, described first data are carried out filtering and cleaning, reject dirty data, obtain the second data;
Store described second data;
Extract described second data and carry out data processing according to service logic.
2. Data Collection according to claim 1 and job operation, is characterized in that, extracts described second data and after carrying out data processing according to described service logic, also comprise: described second data are carried out data conversion; Read described second data after conversion.
3. Data Collection according to claim 2 and job operation, is characterized in that, reads described second data after conversion by least one mode in HDFS and OLAP mode.
4. Data Collection according to claim 1 and job operation, is characterized in that, according to described data layout gather the first data comprise following one or more: in the subscriber data file uploaded, gather described first data according to described data layout; According to the FTP address of uploading, pull described subscriber data file, and gather described first data according to described data layout; Receive user data based on DataBase combining, Open Database Connection, HTML (Hypertext Markup Language), thrift, remote procedure call protocol, and gather described first data according to described data layout.
5. Data Collection according to claim 1 and job operation, it is characterized in that, described data are carried out filtering and cleaning, rejects dirty data and comprise: filtering by stabbing form, hand over word type, Host standardization or canonical coupling switching time, revising, deleting described first data.
6. Data Collection according to claim 1 and job operation, it is characterized in that, store described second data to comprise: be event by described second data encapsulation, and stored by memory, file, database purchase or Kafka storage mode, store described event.
7. Data Collection according to claim 1 and job operation, is characterized in that, described second data carry out transmitting and storing by data stream or based on the mode of file-level.
8. Data Collection and a processing unit (plant), is characterized in that, comprises data-collection nodes, at least one data processing node and consumption node; Described data-collection nodes comprises:
Data format definition unit, is suitable for the form according to the sample data uploaded and described sample data, definition data layout;
Data acquisition unit, is suitable for collection first data;
Cleaning unit, is suitable for described first data being carried out filtering and cleaning according to described data layout, rejects dirty data, obtains the second data;
Storage unit, is suitable for storing described second data;
Processing unit, is suitable for extracting described second data and carrying out data processing according to service logic.
9. Data Collection according to claim 8 and processing unit (plant), is characterized in that, described data processing node is suitable for described second data to carry out data conversion; Described consumption node is suitable for described second data after reading conversion.
10. Data Collection according to claim 9 and processing unit (plant), is characterized in that, described data processing node adopts mode that is in parallel or series connection to connect.
11. Data Collection according to claim 9 and processing unit (plant)s, is characterized in that, described consumption node reads described second data after conversion by least one mode in HDFS and OLAP mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510843845.1A CN105512201A (en) | 2015-11-26 | 2015-11-26 | Data collection and processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510843845.1A CN105512201A (en) | 2015-11-26 | 2015-11-26 | Data collection and processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105512201A true CN105512201A (en) | 2016-04-20 |
Family
ID=55720183
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510843845.1A Pending CN105512201A (en) | 2015-11-26 | 2015-11-26 | Data collection and processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105512201A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156886A (en) * | 2016-06-30 | 2016-11-23 | 亿阳安全技术有限公司 | A kind of method and system based on business system Supplementing Data rule application flow |
CN106227862A (en) * | 2016-07-29 | 2016-12-14 | 浪潮软件集团有限公司 | E-commerce data integration method based on distribution |
CN106354772A (en) * | 2016-08-23 | 2017-01-25 | 成都卡莱博尔信息技术股份有限公司 | Mass data system with data cleaning function |
CN107247721A (en) * | 2017-04-24 | 2017-10-13 | 江苏曙光信息技术有限公司 | Visualize collecting method |
CN107688535A (en) * | 2017-08-15 | 2018-02-13 | 武汉斗鱼网络科技有限公司 | A kind of mobile device APP performance data display methods and device |
CN108052574A (en) * | 2017-12-08 | 2018-05-18 | 南京中新赛克科技有限责任公司 | Slave ftp server based on Kafka technologies imports the ETL system and implementation method of mass data |
CN108829747A (en) * | 2018-05-24 | 2018-11-16 | 新华三大数据技术有限公司 | Data load method and device |
WO2018214599A1 (en) * | 2017-05-22 | 2018-11-29 | 平安科技(深圳)有限公司 | Scalable data reporting method and system, and storage medium |
CN108984708A (en) * | 2018-07-06 | 2018-12-11 | 蔚来汽车有限公司 | Dirty data recognition methods and device, data cleaning method and device, controller |
CN110147391A (en) * | 2019-04-08 | 2019-08-20 | 顺丰速运有限公司 | Data handover method, system, equipment and storage medium |
CN110502491A (en) * | 2019-07-25 | 2019-11-26 | 北京神州泰岳智能数据技术有限公司 | A kind of Log Collect System and its data transmission method, device |
CN110515619A (en) * | 2019-08-09 | 2019-11-29 | 济南浪潮数据技术有限公司 | A kind of theme creation method, device, equipment and readable storage medium storing program for executing |
CN110659323A (en) * | 2019-09-26 | 2020-01-07 | 卓尔购信息科技(武汉)有限公司 | Real-time and off-line big data processing system, method, storage medium and terminal |
CN111327649A (en) * | 2018-12-14 | 2020-06-23 | 中国电信股份有限公司 | Service data processing method, device, SMF, system and storage medium |
CN111797149A (en) * | 2019-06-27 | 2020-10-20 | 厦门雅基软件有限公司 | Data acquisition method, device, equipment and computer readable storage medium |
CN112019605A (en) * | 2020-08-13 | 2020-12-01 | 上海哔哩哔哩科技有限公司 | Data distribution method and system of data stream |
CN112347093A (en) * | 2020-11-05 | 2021-02-09 | 哈尔滨航天恒星数据系统科技有限公司 | Method for facilitating cleaning, integrating and storing of mass multi-source heterogeneous data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030217055A1 (en) * | 2002-05-20 | 2003-11-20 | Chang-Huang Lee | Efficient incremental method for data mining of a database |
CN101110858A (en) * | 2007-08-29 | 2008-01-23 | 中兴通讯股份有限公司 | Telecommunication report generation system and method thereof |
CN102349081A (en) * | 2009-02-10 | 2012-02-08 | 渣普控股有限公司 | Etl builder |
CN102880709A (en) * | 2012-09-28 | 2013-01-16 | 用友软件股份有限公司 | Data warehouse management system and data warehouse management method |
-
2015
- 2015-11-26 CN CN201510843845.1A patent/CN105512201A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030217055A1 (en) * | 2002-05-20 | 2003-11-20 | Chang-Huang Lee | Efficient incremental method for data mining of a database |
CN101110858A (en) * | 2007-08-29 | 2008-01-23 | 中兴通讯股份有限公司 | Telecommunication report generation system and method thereof |
CN102349081A (en) * | 2009-02-10 | 2012-02-08 | 渣普控股有限公司 | Etl builder |
CN102880709A (en) * | 2012-09-28 | 2013-01-16 | 用友软件股份有限公司 | Data warehouse management system and data warehouse management method |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156886A (en) * | 2016-06-30 | 2016-11-23 | 亿阳安全技术有限公司 | A kind of method and system based on business system Supplementing Data rule application flow |
CN106227862A (en) * | 2016-07-29 | 2016-12-14 | 浪潮软件集团有限公司 | E-commerce data integration method based on distribution |
CN106354772A (en) * | 2016-08-23 | 2017-01-25 | 成都卡莱博尔信息技术股份有限公司 | Mass data system with data cleaning function |
CN107247721A (en) * | 2017-04-24 | 2017-10-13 | 江苏曙光信息技术有限公司 | Visualize collecting method |
WO2018214599A1 (en) * | 2017-05-22 | 2018-11-29 | 平安科技(深圳)有限公司 | Scalable data reporting method and system, and storage medium |
CN107688535A (en) * | 2017-08-15 | 2018-02-13 | 武汉斗鱼网络科技有限公司 | A kind of mobile device APP performance data display methods and device |
CN108052574A (en) * | 2017-12-08 | 2018-05-18 | 南京中新赛克科技有限责任公司 | Slave ftp server based on Kafka technologies imports the ETL system and implementation method of mass data |
CN108829747A (en) * | 2018-05-24 | 2018-11-16 | 新华三大数据技术有限公司 | Data load method and device |
CN108984708A (en) * | 2018-07-06 | 2018-12-11 | 蔚来汽车有限公司 | Dirty data recognition methods and device, data cleaning method and device, controller |
CN108984708B (en) * | 2018-07-06 | 2022-02-01 | 蔚来(安徽)控股有限公司 | Dirty data identification method and device, data cleaning method and device, and controller |
CN111327649B (en) * | 2018-12-14 | 2023-04-21 | 中国电信股份有限公司 | Service data processing method, device, SMF, system and storage medium |
CN111327649A (en) * | 2018-12-14 | 2020-06-23 | 中国电信股份有限公司 | Service data processing method, device, SMF, system and storage medium |
CN110147391A (en) * | 2019-04-08 | 2019-08-20 | 顺丰速运有限公司 | Data handover method, system, equipment and storage medium |
CN111797149B (en) * | 2019-06-27 | 2023-01-31 | 厦门雅基软件有限公司 | Data acquisition method, device, equipment and computer readable storage medium |
CN111797149A (en) * | 2019-06-27 | 2020-10-20 | 厦门雅基软件有限公司 | Data acquisition method, device, equipment and computer readable storage medium |
CN110502491A (en) * | 2019-07-25 | 2019-11-26 | 北京神州泰岳智能数据技术有限公司 | A kind of Log Collect System and its data transmission method, device |
CN110515619A (en) * | 2019-08-09 | 2019-11-29 | 济南浪潮数据技术有限公司 | A kind of theme creation method, device, equipment and readable storage medium storing program for executing |
CN110659323A (en) * | 2019-09-26 | 2020-01-07 | 卓尔购信息科技(武汉)有限公司 | Real-time and off-line big data processing system, method, storage medium and terminal |
CN112019605A (en) * | 2020-08-13 | 2020-12-01 | 上海哔哩哔哩科技有限公司 | Data distribution method and system of data stream |
CN112019605B (en) * | 2020-08-13 | 2023-05-09 | 上海哔哩哔哩科技有限公司 | Data distribution method and system for data stream |
CN112347093A (en) * | 2020-11-05 | 2021-02-09 | 哈尔滨航天恒星数据系统科技有限公司 | Method for facilitating cleaning, integrating and storing of mass multi-source heterogeneous data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105512201A (en) | Data collection and processing method and device | |
CN103942210A (en) | Processing method, device and system of mass log information | |
CN104090891B (en) | Data processing method, Apparatus and system | |
CN105912609B (en) | A kind of data file processing method and device | |
US10061858B2 (en) | Method and apparatus for processing exploding data stream | |
CN104111996A (en) | Health insurance outpatient clinic big data extraction system and method based on hadoop platform | |
CN108847977A (en) | A kind of monitoring method of business datum, storage medium and server | |
CN105824744A (en) | Real-time log collection and analysis method on basis of B2B (Business to Business) platform | |
CN102780726A (en) | Log analysis method and log analysis system based on WEB platform | |
CN107818120A (en) | Data processing method and device based on big data | |
CN104778188A (en) | Distributed device log collection method | |
CN103970788A (en) | Webpage-crawling-based crawler technology | |
CN105303456A (en) | Method for processing monitoring data of electric power transmission equipment | |
CN102508908A (en) | Method for acquiring subordinate financial business data and system for acquiring subordinate financial business data | |
CN112118174A (en) | Software defined data gateway | |
CN105447146A (en) | Massive data collecting and exchanging system and method | |
CN103577482A (en) | Web page collecting method and device as well as browser | |
CN106330963A (en) | Cross-network multi-node log collecting method | |
CN111400393A (en) | Data processing method and device based on multi-application platform and storage medium | |
CN113312428A (en) | Multi-source heterogeneous training data fusion method, device and equipment | |
CN109831316A (en) | Massive logs real-time analyzer, real-time analysis method and readable storage medium storing program for executing | |
CN114201540A (en) | Industrial multi-source data acquisition and storage system | |
CN104281980A (en) | Remote diagnosis method and system for thermal generator set based on distributed calculation | |
CN105159820A (en) | Transmission method and device of system log data | |
CN108090186A (en) | A kind of electric power data De-weight method on big data platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160420 |
|
RJ01 | Rejection of invention patent application after publication |