CN109145040A - A kind of data administering method based on double message queues - Google Patents

A kind of data administering method based on double message queues Download PDF

Info

Publication number
CN109145040A
CN109145040A CN201810687548.6A CN201810687548A CN109145040A CN 109145040 A CN109145040 A CN 109145040A CN 201810687548 A CN201810687548 A CN 201810687548A CN 109145040 A CN109145040 A CN 109145040A
Authority
CN
China
Prior art keywords
data
message queue
stored
message
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810687548.6A
Other languages
Chinese (zh)
Inventor
张宝华
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Translation Language Through Polytron Technologies Inc
Original Assignee
Chinese Translation Language Through Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Translation Language Through Polytron Technologies Inc filed Critical Chinese Translation Language Through Polytron Technologies Inc
Priority to CN201810687548.6A priority Critical patent/CN109145040A/en
Publication of CN109145040A publication Critical patent/CN109145040A/en
Pending legal-status Critical Current

Links

Abstract

The present invention discloses a kind of data administering method based on double message queues, the following steps are included: 1) pass through data access tool from user data source input data, 2) data resource of access is stored in message queue, 3) data are extracted from message queue to carry out the pretreatment operation such as cleaning, 4) pretreated data are stored in message queue again, 5) various data improvement programs are extracted data respectively from message queue and are administered, then result will be administered and be stored in message queue again, 6) the last one is administered program and extracts data from message queue, result will be administered by completing after administering is stored in result database, it is used for subsequent process.The method is inserted into message queue before data are administered and after data improvement respectively, data after data and improvement before improvement are buffered, the Stream Processing for realizing data carries out global optimization to data processing links from reliability, availability, retractility, data safety and performance etc..

Description

A kind of data administering method based on double message queues
Technical field
The invention belongs to distributed computings and technical field of data processing, and in particular to a kind of number based on double message queues According to administering method.
Background technique
Data improvement is to read data from a kind of storage medium, after a series of data administer link, then is stored To the process of another storage medium.There are two ways to data biggish for data volume are administered, traditional: one is pass through Single thread mode sequence is read, then is sequentially written in target storage medium;Another is to read number by the way that some rules are parallel According to the process for being written in parallel to target storage again.But in governance process, all it there is problems in that
1. data improvement has delay: above two method or batch read-write, or timing are read and write, and real-time reading cannot be all reached It writes, it is not applicable to the very high business scenario of requirement of real-time;
2. administering link can not track: the program interrupt generated in governance process or problem can not track, and leading to the problem of can only weigh It is new to administer;
3. it is low that data administer performance: can not be administered in real time to large batch of data by traditional approach, be easy to produce bottle Neck, scalability be not high;
4. Information Security is not high: in data governance process the case where cannot cannot being administered in real time due to other factors, It is likely to result in the risk of loss of data.
" message " is the data unit in the transmission of two intercomputers.Message can be very simple, such as only includes text Character string;Can also be more complicated, it may include embedded object.
Message is sent in queue." message queue " is the container that message is saved in the transmission process of message.Message Queue management device is acted as an intermediary when message to be relayed to its target from its source.The main purpose of queue is to provide routing And guarantee the transmitting of message;If recipient is unavailable when sending message, message queue can reservation message, until can be successfully Transmit it.
Kafka is the open source stream process platform developed by Apache Software Foundation, by Scala and written in Java. Kafka is that a kind of distributed post of high-throughput subscribes to message system, it can handle the institute in the website of consumer's scale There is movement flow data.This movement (web page browsing, the action of search and other users) is many societies on modern network One key factor of function.These data be often as the requirement of handling capacity and by processing log and log aggregation come It solves.For the daily record data and off-line analysis system as Hadoop, but the limitation handled in real time is required, this is one A feasible solution.The purpose of Kafka is to unify message on line and offline by the loaded in parallel mechanism of Hadoop Processing, also for providing real-time message by cluster.
The present invention will realize a kind of data administering method based on double message queues, be passed by the data of double message queues It passs, realizes the Stream Processing of data, from reliability, availability, retractility, data safety and performance various aspects to data processing ring Section carries out global optimization.
Summary of the invention
In order to solve to postpone existing for existing data administering method, can not track, the problems such as safety is not high, the present invention A kind of data administering method based on double message queues is provided, the method is inserted into respectively before data are administered and after data improvement Message queue is realized to the bufferings of data after the data and improvement before improvement, realizes the Stream Processing of data, from reliability, can Global optimization is carried out to data processing links with property, retractility, data safety and performance various aspects.
To realize above-mentioned target, the invention adopts the following technical scheme:
A kind of data administering method based on double message queues, the method place data into and disappear after data source access data Queue is ceased, then the data in message queue are carried out the pretreatment operation such as to clean, then on the one hand by pretreated data Database is stored in as backup, on the other hand data are stored in message queue for the consumption of data abatement tools again by treated.
A kind of data administering method based on double message queues, the described method comprises the following steps:
1) pass through data access tool from user data source input data;
2) data resource of access is stored in message queue;
3) data improvement program extracts data from message queue and carries out the pretreatment operation such as cleaning;
4) data administer program and pretreated data are stored in message queue again;
5) various data improvement programs are extracted data respectively from message queue and are administered, and then will administer result and are stored in again Message queue;
6) the last one is administered program and extracts data from message queue, and result will be administered by completing after administering is stored in result data Library is used for subsequent process.
Preferably, the message queue is kafka cluster.
Preferably, after the step 1) access data, standard data format is carried out to data,
Standard data format includes: field verification, polishing, attribution data.
Preferably, in the step 3), news data and social data, the cleaning to news data are distinguished in data cleansing It include: data format verification, polishing, URL duplicate removal, messy code identification filtering, language identification, domain name is analyzed and filling, content of text The processing of the scripts such as middle js, time legitimacy verifies, body matter missing, URL missing, issuing time missing, the processing of author's missing. Field according to missing is must to fill out attribute or select to fill out attribute to select different processing methods, and the field for that must fill out attribute lacks Mistake directly abandons the data to wrong file, and the field missing supplement for filling out attribute for choosing is sky.Body matter in above-mentioned field, URL, issuing time are required field, and author is to select word filling section.
Cleaning to social data includes: data format verification, polishing, and URL duplicate removal, messy code, which identifies, to be filtered, language identification, Domain name is analyzed and filling, the scripts processing such as js in content of text, time legitimacy verifies, and body matter missing, is sent out URL missing Cloth time missing, the processing of author's missing etc..Different according to medium type, the process of data cleansing is also different, such as social The author field of media is required field, and the author field of news media then must not fill out, and social media has number of fans, concern The fields such as number, comment number, and news media just do not have;There are media level field in news media, and social media does not have.
The advantages and benefits of the present invention are:
1) it reliability: after acquisition layer access data, handles and counts in real time by message queue, message queue caches 7 days Data are looked into so that active to any problem occurred in data handling procedure, and can guarantee number to data cached secondary improvement According to reliable;
2) availability: with increasing rapidly for data center's day more data volume, the pressure of system bottom can be increasing, using disappearing The distributed structure/architecture for ceasing queue, increases the availability of system;
3. retractility: needing to complete 10 or more algorithms for the governance flowchart of data center, algorithm passes through successive ignition also Can adjust, it is desirable that the retractility of governance flowchart is strong, can with plug-in adjustment algorithm process.Use the framework of message queue It is easier to reconfiguration code, reinforces the retractility of program;
4. data safety: the framework of message queue is easier to tracking data cases, so that data are safer during processing; Message queue is distributed type assemblies, more secure to the safety of data buffer storage;
5. performance: distributed message queue is with good expansibility, and required data flow can be provided for data center Turn performance.
Detailed description of the invention
Attached drawing 1 is the process flow diagram of the data administering method of the present invention based on double message queues.
Specific embodiment
Below with reference to embodiment, the invention will be further described.
Embodiment
Referring to attached drawing 1.
A kind of data administering method based on double message queues, the described method comprises the following steps:
1) pass through data access tool from user data source input data;
2) data resource of access is stored in message queue kafka cluster;
3) data improvement program extracts data from message queue and carries out the pretreatment operation such as cleaning;
4) data administer program for pretreated data loading to the original library Hbase, while data are committed to Kafka again Message queue;
5) various data improvement programs are extracted data respectively from message queue and are administered, and then will administer result and are stored in again Message queue kafka cluster;
6) the last one is administered program and extracts data from message queue, and result will be administered by completing after administering is stored in result database HBase, ElasticSearch, RabbitMQ are used for subsequent process.
User data source in the step 1) includes the data of news website and the data of social media etc..
Finally, it should be noted that obviously, the above embodiment is merely an example for clearly illustrating the present invention, and simultaneously The non-restriction to embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.And thus drawn The obvious changes or variations of stretching are still in the protection scope of this invention.

Claims (7)

1. a kind of data administering method based on double message queues, it is characterised in that: the method from data source access data after, Message queue is placed data into, pretreatment operation then is carried out to the data in message queue, then by pretreated data On the one hand deposit database is as backup, and on the other hand by treated, data are stored in message queue for data abatement tools again Consumption.
2. a kind of data administering method based on double message queues according to claim 1, which is characterized in that the side Method the following steps are included:
1) pass through data access tool from user data source input data;
2) data resource of access is stored in message queue;
3) data administer program and extract data progress pretreatment operation from message queue;
4) data administer program and pretreated data are stored in message queue again;
5) various data improvement programs are extracted data respectively from message queue and are administered, and then will administer result and are stored in again Message queue;
6) the last one is administered program and extracts data from message queue, and result will be administered by completing after administering is stored in result data Library is used for subsequent process.
3. a kind of data administering method based on double message queues according to claim 1 or 2, it is characterised in that: described Message queue is kafka cluster.
4. a kind of data administering method based on double message queues according to claim 2, it is characterised in that: the step 1) after accessing data, standard data format is carried out to data, standard data format includes: field verification, polishing, data Ownership.
5. a kind of data administering method based on double message queues according to claim 2, it is characterised in that: the step It is rapid 3) in, the pretreatment is data cleansing, and distinguishing news data and social data, the cleaning of data includes: data format school It tests, polishing, URL duplicate removal, messy code identification filtering, language identification, domain name analysis and filling, js script processing in content of text, when Between legitimacy verifies, body matter missing, URL missing, issuing time missing, author's missing processing.
6. a kind of data administering method based on double message queues according to claim 5, it is characterised in that: to missing Field is that must fill out attribute or choosing fills out attribute and selects different processing methods according to it, if the field that must fill out attribute lacks The data are directly abandoned to wrong file, are supplemented if the field missing that attribute is filled out in choosing as sky.
7. a kind of data administering method based on double message queues according to claim 6, it is characterised in that: the word Body matter, URL, issuing time are required field in section, and author is to select word filling section.
CN201810687548.6A 2018-06-28 2018-06-28 A kind of data administering method based on double message queues Pending CN109145040A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810687548.6A CN109145040A (en) 2018-06-28 2018-06-28 A kind of data administering method based on double message queues

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810687548.6A CN109145040A (en) 2018-06-28 2018-06-28 A kind of data administering method based on double message queues

Publications (1)

Publication Number Publication Date
CN109145040A true CN109145040A (en) 2019-01-04

Family

ID=64802532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810687548.6A Pending CN109145040A (en) 2018-06-28 2018-06-28 A kind of data administering method based on double message queues

Country Status (1)

Country Link
CN (1) CN109145040A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457153A (en) * 2019-07-18 2019-11-15 北京顺丰同城科技有限公司 Data check processing method and processing device
CN110955645A (en) * 2019-10-10 2020-04-03 望海康信(北京)科技股份公司 Big data integration processing method and system
CN111858569A (en) * 2020-07-01 2020-10-30 长江岩土工程总公司(武汉) Mass data cleaning method based on stream computing
CN112308431A (en) * 2020-11-03 2021-02-02 平安普惠企业管理有限公司 Big data index management method, device, equipment and storage medium
CN112579326A (en) * 2020-12-29 2021-03-30 北京五八信息技术有限公司 Offline data processing method and device, electronic equipment and computer readable medium
CN113031878A (en) * 2021-05-20 2021-06-25 睿至科技集团有限公司 HBase-based data storage optimization method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105681397A (en) * 2015-12-30 2016-06-15 曙光信息产业(北京)有限公司 Network traffic data storage method and system, query method and device
US20160321308A1 (en) * 2015-05-01 2016-11-03 Ebay Inc. Constructing a data adaptor in an enterprise server data ingestion environment
CN107294801A (en) * 2016-12-30 2017-10-24 江苏号百信息服务有限公司 Stream Processing method and system based on magnanimity real-time Internet DPI data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160321308A1 (en) * 2015-05-01 2016-11-03 Ebay Inc. Constructing a data adaptor in an enterprise server data ingestion environment
CN105681397A (en) * 2015-12-30 2016-06-15 曙光信息产业(北京)有限公司 Network traffic data storage method and system, query method and device
CN107294801A (en) * 2016-12-30 2017-10-24 江苏号百信息服务有限公司 Stream Processing method and system based on magnanimity real-time Internet DPI data

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457153A (en) * 2019-07-18 2019-11-15 北京顺丰同城科技有限公司 Data check processing method and processing device
CN110955645A (en) * 2019-10-10 2020-04-03 望海康信(北京)科技股份公司 Big data integration processing method and system
CN110955645B (en) * 2019-10-10 2022-10-11 望海康信(北京)科技股份公司 Big data integration processing method and system
CN111858569A (en) * 2020-07-01 2020-10-30 长江岩土工程总公司(武汉) Mass data cleaning method based on stream computing
CN112308431A (en) * 2020-11-03 2021-02-02 平安普惠企业管理有限公司 Big data index management method, device, equipment and storage medium
CN112308431B (en) * 2020-11-03 2023-11-21 北京国联视讯信息技术股份有限公司 Big data index management method, device, equipment and storage medium
CN112579326A (en) * 2020-12-29 2021-03-30 北京五八信息技术有限公司 Offline data processing method and device, electronic equipment and computer readable medium
CN113031878A (en) * 2021-05-20 2021-06-25 睿至科技集团有限公司 HBase-based data storage optimization method and system
CN113031878B (en) * 2021-05-20 2021-08-06 睿至科技集团有限公司 HBase-based data storage optimization method and system

Similar Documents

Publication Publication Date Title
CN109145040A (en) A kind of data administering method based on double message queues
US11316727B2 (en) Method and system for clustering event messages and manage event-message clusters
US20230053121A1 (en) Hierarchical, parallel models for extracting in real time high-value information from data streams and system and method for creation of same
KR101171501B1 (en) Transaction aggregation to increase transaction processing throughput
CN109034993A (en) Account checking method, equipment, system and computer readable storage medium
CA2953817C (en) Feature processing tradeoff management
US11301425B2 (en) Systems and computer implemented methods for semantic data compression
US8666985B2 (en) Hardware accelerated application-based pattern matching for real time classification and recording of network traffic
CN106371975B (en) A kind of O&M automation method for early warning and system
US20150379072A1 (en) Input processing for machine learning
CN106033438B (en) Public sentiment data storage method and server
CN110188103A (en) Data account checking method, device, equipment and storage medium
CN105511812A (en) Method and device for optimizing big data of memory system
CN104462096B (en) Public sentiment method for monitoring and analyzing and device
CN113656673A (en) Master-slave distributed content crawling robot for advertisement delivery
CN105159820A (en) Transmission method and device of system log data
CN113535677B (en) Data analysis query management method, device, computer equipment and storage medium
Alves et al. Leveraging BERT's Power to Classify TTP from Unstructured Text
CN106649530A (en) Cloud detailed list inquiry management system and method
CN105335408B (en) A kind of extended method and related system of search term white list
CN107436920A (en) Node.js authority control methods, storage medium, electronic equipment and system
CN113722416A (en) Data cleaning method, device and equipment and readable storage medium
CN101378336B (en) Method for processing batch documents of service management system
CN107247632A (en) Unstructured data, fragmentation data collecting system
CN108509648A (en) A kind of log searching system based on recorder platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190104