CN109145040A - A kind of data administering method based on double message queues - Google Patents
A kind of data administering method based on double message queues Download PDFInfo
- Publication number
- CN109145040A CN109145040A CN201810687548.6A CN201810687548A CN109145040A CN 109145040 A CN109145040 A CN 109145040A CN 201810687548 A CN201810687548 A CN 201810687548A CN 109145040 A CN109145040 A CN 109145040A
- Authority
- CN
- China
- Prior art keywords
- data
- message queue
- stored
- message
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The present invention discloses a kind of data administering method based on double message queues, the following steps are included: 1) pass through data access tool from user data source input data, 2) data resource of access is stored in message queue, 3) data are extracted from message queue to carry out the pretreatment operation such as cleaning, 4) pretreated data are stored in message queue again, 5) various data improvement programs are extracted data respectively from message queue and are administered, then result will be administered and be stored in message queue again, 6) the last one is administered program and extracts data from message queue, result will be administered by completing after administering is stored in result database, it is used for subsequent process.The method is inserted into message queue before data are administered and after data improvement respectively, data after data and improvement before improvement are buffered, the Stream Processing for realizing data carries out global optimization to data processing links from reliability, availability, retractility, data safety and performance etc..
Description
Technical field
The invention belongs to distributed computings and technical field of data processing, and in particular to a kind of number based on double message queues
According to administering method.
Background technique
Data improvement is to read data from a kind of storage medium, after a series of data administer link, then is stored
To the process of another storage medium.There are two ways to data biggish for data volume are administered, traditional: one is pass through
Single thread mode sequence is read, then is sequentially written in target storage medium;Another is to read number by the way that some rules are parallel
According to the process for being written in parallel to target storage again.But in governance process, all it there is problems in that
1. data improvement has delay: above two method or batch read-write, or timing are read and write, and real-time reading cannot be all reached
It writes, it is not applicable to the very high business scenario of requirement of real-time;
2. administering link can not track: the program interrupt generated in governance process or problem can not track, and leading to the problem of can only weigh
It is new to administer;
3. it is low that data administer performance: can not be administered in real time to large batch of data by traditional approach, be easy to produce bottle
Neck, scalability be not high;
4. Information Security is not high: in data governance process the case where cannot cannot being administered in real time due to other factors,
It is likely to result in the risk of loss of data.
" message " is the data unit in the transmission of two intercomputers.Message can be very simple, such as only includes text
Character string;Can also be more complicated, it may include embedded object.
Message is sent in queue." message queue " is the container that message is saved in the transmission process of message.Message
Queue management device is acted as an intermediary when message to be relayed to its target from its source.The main purpose of queue is to provide routing
And guarantee the transmitting of message;If recipient is unavailable when sending message, message queue can reservation message, until can be successfully
Transmit it.
Kafka is the open source stream process platform developed by Apache Software Foundation, by Scala and written in Java.
Kafka is that a kind of distributed post of high-throughput subscribes to message system, it can handle the institute in the website of consumer's scale
There is movement flow data.This movement (web page browsing, the action of search and other users) is many societies on modern network
One key factor of function.These data be often as the requirement of handling capacity and by processing log and log aggregation come
It solves.For the daily record data and off-line analysis system as Hadoop, but the limitation handled in real time is required, this is one
A feasible solution.The purpose of Kafka is to unify message on line and offline by the loaded in parallel mechanism of Hadoop
Processing, also for providing real-time message by cluster.
The present invention will realize a kind of data administering method based on double message queues, be passed by the data of double message queues
It passs, realizes the Stream Processing of data, from reliability, availability, retractility, data safety and performance various aspects to data processing ring
Section carries out global optimization.
Summary of the invention
In order to solve to postpone existing for existing data administering method, can not track, the problems such as safety is not high, the present invention
A kind of data administering method based on double message queues is provided, the method is inserted into respectively before data are administered and after data improvement
Message queue is realized to the bufferings of data after the data and improvement before improvement, realizes the Stream Processing of data, from reliability, can
Global optimization is carried out to data processing links with property, retractility, data safety and performance various aspects.
To realize above-mentioned target, the invention adopts the following technical scheme:
A kind of data administering method based on double message queues, the method place data into and disappear after data source access data
Queue is ceased, then the data in message queue are carried out the pretreatment operation such as to clean, then on the one hand by pretreated data
Database is stored in as backup, on the other hand data are stored in message queue for the consumption of data abatement tools again by treated.
A kind of data administering method based on double message queues, the described method comprises the following steps:
1) pass through data access tool from user data source input data;
2) data resource of access is stored in message queue;
3) data improvement program extracts data from message queue and carries out the pretreatment operation such as cleaning;
4) data administer program and pretreated data are stored in message queue again;
5) various data improvement programs are extracted data respectively from message queue and are administered, and then will administer result and are stored in again
Message queue;
6) the last one is administered program and extracts data from message queue, and result will be administered by completing after administering is stored in result data
Library is used for subsequent process.
Preferably, the message queue is kafka cluster.
Preferably, after the step 1) access data, standard data format is carried out to data,
Standard data format includes: field verification, polishing, attribution data.
Preferably, in the step 3), news data and social data, the cleaning to news data are distinguished in data cleansing
It include: data format verification, polishing, URL duplicate removal, messy code identification filtering, language identification, domain name is analyzed and filling, content of text
The processing of the scripts such as middle js, time legitimacy verifies, body matter missing, URL missing, issuing time missing, the processing of author's missing.
Field according to missing is must to fill out attribute or select to fill out attribute to select different processing methods, and the field for that must fill out attribute lacks
Mistake directly abandons the data to wrong file, and the field missing supplement for filling out attribute for choosing is sky.Body matter in above-mentioned field,
URL, issuing time are required field, and author is to select word filling section.
Cleaning to social data includes: data format verification, polishing, and URL duplicate removal, messy code, which identifies, to be filtered, language identification,
Domain name is analyzed and filling, the scripts processing such as js in content of text, time legitimacy verifies, and body matter missing, is sent out URL missing
Cloth time missing, the processing of author's missing etc..Different according to medium type, the process of data cleansing is also different, such as social
The author field of media is required field, and the author field of news media then must not fill out, and social media has number of fans, concern
The fields such as number, comment number, and news media just do not have;There are media level field in news media, and social media does not have.
The advantages and benefits of the present invention are:
1) it reliability: after acquisition layer access data, handles and counts in real time by message queue, message queue caches 7 days
Data are looked into so that active to any problem occurred in data handling procedure, and can guarantee number to data cached secondary improvement
According to reliable;
2) availability: with increasing rapidly for data center's day more data volume, the pressure of system bottom can be increasing, using disappearing
The distributed structure/architecture for ceasing queue, increases the availability of system;
3. retractility: needing to complete 10 or more algorithms for the governance flowchart of data center, algorithm passes through successive ignition also
Can adjust, it is desirable that the retractility of governance flowchart is strong, can with plug-in adjustment algorithm process.Use the framework of message queue
It is easier to reconfiguration code, reinforces the retractility of program;
4. data safety: the framework of message queue is easier to tracking data cases, so that data are safer during processing;
Message queue is distributed type assemblies, more secure to the safety of data buffer storage;
5. performance: distributed message queue is with good expansibility, and required data flow can be provided for data center
Turn performance.
Detailed description of the invention
Attached drawing 1 is the process flow diagram of the data administering method of the present invention based on double message queues.
Specific embodiment
Below with reference to embodiment, the invention will be further described.
Embodiment
Referring to attached drawing 1.
A kind of data administering method based on double message queues, the described method comprises the following steps:
1) pass through data access tool from user data source input data;
2) data resource of access is stored in message queue kafka cluster;
3) data improvement program extracts data from message queue and carries out the pretreatment operation such as cleaning;
4) data administer program for pretreated data loading to the original library Hbase, while data are committed to Kafka again
Message queue;
5) various data improvement programs are extracted data respectively from message queue and are administered, and then will administer result and are stored in again
Message queue kafka cluster;
6) the last one is administered program and extracts data from message queue, and result will be administered by completing after administering is stored in result database
HBase, ElasticSearch, RabbitMQ are used for subsequent process.
User data source in the step 1) includes the data of news website and the data of social media etc..
Finally, it should be noted that obviously, the above embodiment is merely an example for clearly illustrating the present invention, and simultaneously
The non-restriction to embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description
Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.And thus drawn
The obvious changes or variations of stretching are still in the protection scope of this invention.
Claims (7)
1. a kind of data administering method based on double message queues, it is characterised in that: the method from data source access data after,
Message queue is placed data into, pretreatment operation then is carried out to the data in message queue, then by pretreated data
On the one hand deposit database is as backup, and on the other hand by treated, data are stored in message queue for data abatement tools again
Consumption.
2. a kind of data administering method based on double message queues according to claim 1, which is characterized in that the side
Method the following steps are included:
1) pass through data access tool from user data source input data;
2) data resource of access is stored in message queue;
3) data administer program and extract data progress pretreatment operation from message queue;
4) data administer program and pretreated data are stored in message queue again;
5) various data improvement programs are extracted data respectively from message queue and are administered, and then will administer result and are stored in again
Message queue;
6) the last one is administered program and extracts data from message queue, and result will be administered by completing after administering is stored in result data
Library is used for subsequent process.
3. a kind of data administering method based on double message queues according to claim 1 or 2, it is characterised in that: described
Message queue is kafka cluster.
4. a kind of data administering method based on double message queues according to claim 2, it is characterised in that: the step
1) after accessing data, standard data format is carried out to data, standard data format includes: field verification, polishing, data
Ownership.
5. a kind of data administering method based on double message queues according to claim 2, it is characterised in that: the step
It is rapid 3) in, the pretreatment is data cleansing, and distinguishing news data and social data, the cleaning of data includes: data format school
It tests, polishing, URL duplicate removal, messy code identification filtering, language identification, domain name analysis and filling, js script processing in content of text, when
Between legitimacy verifies, body matter missing, URL missing, issuing time missing, author's missing processing.
6. a kind of data administering method based on double message queues according to claim 5, it is characterised in that: to missing
Field is that must fill out attribute or choosing fills out attribute and selects different processing methods according to it, if the field that must fill out attribute lacks
The data are directly abandoned to wrong file, are supplemented if the field missing that attribute is filled out in choosing as sky.
7. a kind of data administering method based on double message queues according to claim 6, it is characterised in that: the word
Body matter, URL, issuing time are required field in section, and author is to select word filling section.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810687548.6A CN109145040A (en) | 2018-06-28 | 2018-06-28 | A kind of data administering method based on double message queues |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810687548.6A CN109145040A (en) | 2018-06-28 | 2018-06-28 | A kind of data administering method based on double message queues |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109145040A true CN109145040A (en) | 2019-01-04 |
Family
ID=64802532
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810687548.6A Pending CN109145040A (en) | 2018-06-28 | 2018-06-28 | A kind of data administering method based on double message queues |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109145040A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110457153A (en) * | 2019-07-18 | 2019-11-15 | 北京顺丰同城科技有限公司 | Data check processing method and processing device |
CN110955645A (en) * | 2019-10-10 | 2020-04-03 | 望海康信(北京)科技股份公司 | Big data integration processing method and system |
CN111858569A (en) * | 2020-07-01 | 2020-10-30 | 长江岩土工程总公司(武汉) | Mass data cleaning method based on stream computing |
CN112308431A (en) * | 2020-11-03 | 2021-02-02 | 平安普惠企业管理有限公司 | Big data index management method, device, equipment and storage medium |
CN112579326A (en) * | 2020-12-29 | 2021-03-30 | 北京五八信息技术有限公司 | Offline data processing method and device, electronic equipment and computer readable medium |
CN113031878A (en) * | 2021-05-20 | 2021-06-25 | 睿至科技集团有限公司 | HBase-based data storage optimization method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105681397A (en) * | 2015-12-30 | 2016-06-15 | 曙光信息产业(北京)有限公司 | Network traffic data storage method and system, query method and device |
US20160321308A1 (en) * | 2015-05-01 | 2016-11-03 | Ebay Inc. | Constructing a data adaptor in an enterprise server data ingestion environment |
CN107294801A (en) * | 2016-12-30 | 2017-10-24 | 江苏号百信息服务有限公司 | Stream Processing method and system based on magnanimity real-time Internet DPI data |
-
2018
- 2018-06-28 CN CN201810687548.6A patent/CN109145040A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160321308A1 (en) * | 2015-05-01 | 2016-11-03 | Ebay Inc. | Constructing a data adaptor in an enterprise server data ingestion environment |
CN105681397A (en) * | 2015-12-30 | 2016-06-15 | 曙光信息产业(北京)有限公司 | Network traffic data storage method and system, query method and device |
CN107294801A (en) * | 2016-12-30 | 2017-10-24 | 江苏号百信息服务有限公司 | Stream Processing method and system based on magnanimity real-time Internet DPI data |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110457153A (en) * | 2019-07-18 | 2019-11-15 | 北京顺丰同城科技有限公司 | Data check processing method and processing device |
CN110955645A (en) * | 2019-10-10 | 2020-04-03 | 望海康信(北京)科技股份公司 | Big data integration processing method and system |
CN110955645B (en) * | 2019-10-10 | 2022-10-11 | 望海康信(北京)科技股份公司 | Big data integration processing method and system |
CN111858569A (en) * | 2020-07-01 | 2020-10-30 | 长江岩土工程总公司(武汉) | Mass data cleaning method based on stream computing |
CN112308431A (en) * | 2020-11-03 | 2021-02-02 | 平安普惠企业管理有限公司 | Big data index management method, device, equipment and storage medium |
CN112308431B (en) * | 2020-11-03 | 2023-11-21 | 北京国联视讯信息技术股份有限公司 | Big data index management method, device, equipment and storage medium |
CN112579326A (en) * | 2020-12-29 | 2021-03-30 | 北京五八信息技术有限公司 | Offline data processing method and device, electronic equipment and computer readable medium |
CN113031878A (en) * | 2021-05-20 | 2021-06-25 | 睿至科技集团有限公司 | HBase-based data storage optimization method and system |
CN113031878B (en) * | 2021-05-20 | 2021-08-06 | 睿至科技集团有限公司 | HBase-based data storage optimization method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109145040A (en) | A kind of data administering method based on double message queues | |
US11316727B2 (en) | Method and system for clustering event messages and manage event-message clusters | |
US20230053121A1 (en) | Hierarchical, parallel models for extracting in real time high-value information from data streams and system and method for creation of same | |
KR101171501B1 (en) | Transaction aggregation to increase transaction processing throughput | |
CN109034993A (en) | Account checking method, equipment, system and computer readable storage medium | |
CA2953817C (en) | Feature processing tradeoff management | |
US11301425B2 (en) | Systems and computer implemented methods for semantic data compression | |
US8666985B2 (en) | Hardware accelerated application-based pattern matching for real time classification and recording of network traffic | |
CN106371975B (en) | A kind of O&M automation method for early warning and system | |
US20150379072A1 (en) | Input processing for machine learning | |
CN106033438B (en) | Public sentiment data storage method and server | |
CN110188103A (en) | Data account checking method, device, equipment and storage medium | |
CN105511812A (en) | Method and device for optimizing big data of memory system | |
CN104462096B (en) | Public sentiment method for monitoring and analyzing and device | |
CN113656673A (en) | Master-slave distributed content crawling robot for advertisement delivery | |
CN105159820A (en) | Transmission method and device of system log data | |
CN113535677B (en) | Data analysis query management method, device, computer equipment and storage medium | |
Alves et al. | Leveraging BERT's Power to Classify TTP from Unstructured Text | |
CN106649530A (en) | Cloud detailed list inquiry management system and method | |
CN105335408B (en) | A kind of extended method and related system of search term white list | |
CN107436920A (en) | Node.js authority control methods, storage medium, electronic equipment and system | |
CN113722416A (en) | Data cleaning method, device and equipment and readable storage medium | |
CN101378336B (en) | Method for processing batch documents of service management system | |
CN107247632A (en) | Unstructured data, fragmentation data collecting system | |
CN108509648A (en) | A kind of log searching system based on recorder platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190104 |