CN109145040A

CN109145040A - A kind of data administering method based on double message queues

Info

Publication number: CN109145040A
Application number: CN201810687548.6A
Authority: CN
Inventors: 张宝华; 程国艮
Original assignee: Chinese Translation Language Through Polytron Technologies Inc
Current assignee: Chinese Translation Language Through Polytron Technologies Inc
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2019-01-04

Abstract

The present invention discloses a kind of data administering method based on double message queues, the following steps are included: 1) pass through data access tool from user data source input data, 2) data resource of access is stored in message queue, 3) data are extracted from message queue to carry out the pretreatment operation such as cleaning, 4) pretreated data are stored in message queue again, 5) various data improvement programs are extracted data respectively from message queue and are administered, then result will be administered and be stored in message queue again, 6) the last one is administered program and extracts data from message queue, result will be administered by completing after administering is stored in result database, it is used for subsequent process.The method is inserted into message queue before data are administered and after data improvement respectively, data after data and improvement before improvement are buffered, the Stream Processing for realizing data carries out global optimization to data processing links from reliability, availability, retractility, data safety and performance etc..

Description

A kind of data administering method based on double message queues

Technical field

The invention belongs to distributed computings and technical field of data processing, and in particular to a kind of number based on double message queues According to administering method.

Background technique

Data improvement is to read data from a kind of storage medium, after a series of data administer link, then is stored To the process of another storage medium.There are two ways to data biggish for data volume are administered, traditional: one is pass through Single thread mode sequence is read, then is sequentially written in target storage medium；Another is to read number by the way that some rules are parallel According to the process for being written in parallel to target storage again.But in governance process, all it there is problems in that

1. data improvement has delay: above two method or batch read-write, or timing are read and write, and real-time reading cannot be all reached It writes, it is not applicable to the very high business scenario of requirement of real-time；

2. administering link can not track: the program interrupt generated in governance process or problem can not track, and leading to the problem of can only weigh It is new to administer；

3. it is low that data administer performance: can not be administered in real time to large batch of data by traditional approach, be easy to produce bottle Neck, scalability be not high；

4. Information Security is not high: in data governance process the case where cannot cannot being administered in real time due to other factors, It is likely to result in the risk of loss of data.

" message " is the data unit in the transmission of two intercomputers.Message can be very simple, such as only includes text Character string；Can also be more complicated, it may include embedded object.

Message is sent in queue." message queue " is the container that message is saved in the transmission process of message.Message Queue management device is acted as an intermediary when message to be relayed to its target from its source.The main purpose of queue is to provide routing And guarantee the transmitting of message；If recipient is unavailable when sending message, message queue can reservation message, until can be successfully Transmit it.

Kafka is the open source stream process platform developed by Apache Software Foundation, by Scala and written in Java. Kafka is that a kind of distributed post of high-throughput subscribes to message system, it can handle the institute in the website of consumer's scale There is movement flow data.This movement (web page browsing, the action of search and other users) is many societies on modern network One key factor of function.These data be often as the requirement of handling capacity and by processing log and log aggregation come It solves.For the daily record data and off-line analysis system as Hadoop, but the limitation handled in real time is required, this is one A feasible solution.The purpose of Kafka is to unify message on line and offline by the loaded in parallel mechanism of Hadoop Processing, also for providing real-time message by cluster.

The present invention will realize a kind of data administering method based on double message queues, be passed by the data of double message queues It passs, realizes the Stream Processing of data, from reliability, availability, retractility, data safety and performance various aspects to data processing ring Section carries out global optimization.

Summary of the invention

In order to solve to postpone existing for existing data administering method, can not track, the problems such as safety is not high, the present invention A kind of data administering method based on double message queues is provided, the method is inserted into respectively before data are administered and after data improvement Message queue is realized to the bufferings of data after the data and improvement before improvement, realizes the Stream Processing of data, from reliability, can Global optimization is carried out to data processing links with property, retractility, data safety and performance various aspects.

To realize above-mentioned target, the invention adopts the following technical scheme:

A kind of data administering method based on double message queues, the method place data into and disappear after data source access data Queue is ceased, then the data in message queue are carried out the pretreatment operation such as to clean, then on the one hand by pretreated data Database is stored in as backup, on the other hand data are stored in message queue for the consumption of data abatement tools again by treated.

A kind of data administering method based on double message queues, the described method comprises the following steps:

1) pass through data access tool from user data source input data；

2) data resource of access is stored in message queue；

3) data improvement program extracts data from message queue and carries out the pretreatment operation such as cleaning；

4) data administer program and pretreated data are stored in message queue again；

5) various data improvement programs are extracted data respectively from message queue and are administered, and then will administer result and are stored in again Message queue；

6) the last one is administered program and extracts data from message queue, and result will be administered by completing after administering is stored in result data Library is used for subsequent process.

Preferably, the message queue is kafka cluster.

Preferably, after the step 1) access data, standard data format is carried out to data,

Standard data format includes: field verification, polishing, attribution data.

Preferably, in the step 3), news data and social data, the cleaning to news data are distinguished in data cleansing It include: data format verification, polishing, URL duplicate removal, messy code identification filtering, language identification, domain name is analyzed and filling, content of text The processing of the scripts such as middle js, time legitimacy verifies, body matter missing, URL missing, issuing time missing, the processing of author's missing. Field according to missing is must to fill out attribute or select to fill out attribute to select different processing methods, and the field for that must fill out attribute lacks Mistake directly abandons the data to wrong file, and the field missing supplement for filling out attribute for choosing is sky.Body matter in above-mentioned field, URL, issuing time are required field, and author is to select word filling section.

Cleaning to social data includes: data format verification, polishing, and URL duplicate removal, messy code, which identifies, to be filtered, language identification, Domain name is analyzed and filling, the scripts processing such as js in content of text, time legitimacy verifies, and body matter missing, is sent out URL missing Cloth time missing, the processing of author's missing etc..Different according to medium type, the process of data cleansing is also different, such as social The author field of media is required field, and the author field of news media then must not fill out, and social media has number of fans, concern The fields such as number, comment number, and news media just do not have；There are media level field in news media, and social media does not have.

The advantages and benefits of the present invention are:

1) it reliability: after acquisition layer access data, handles and counts in real time by message queue, message queue caches 7 days Data are looked into so that active to any problem occurred in data handling procedure, and can guarantee number to data cached secondary improvement According to reliable；

2) availability: with increasing rapidly for data center's day more data volume, the pressure of system bottom can be increasing, using disappearing The distributed structure/architecture for ceasing queue, increases the availability of system；

3. retractility: needing to complete 10 or more algorithms for the governance flowchart of data center, algorithm passes through successive ignition also Can adjust, it is desirable that the retractility of governance flowchart is strong, can with plug-in adjustment algorithm process.Use the framework of message queue It is easier to reconfiguration code, reinforces the retractility of program；

4. data safety: the framework of message queue is easier to tracking data cases, so that data are safer during processing； Message queue is distributed type assemblies, more secure to the safety of data buffer storage；

5. performance: distributed message queue is with good expansibility, and required data flow can be provided for data center Turn performance.

Detailed description of the invention

Attached drawing 1 is the process flow diagram of the data administering method of the present invention based on double message queues.

Specific embodiment

Below with reference to embodiment, the invention will be further described.

Embodiment

Referring to attached drawing 1.

1) pass through data access tool from user data source input data；

2) data resource of access is stored in message queue kafka cluster；

4) data administer program for pretreated data loading to the original library Hbase, while data are committed to Kafka again Message queue；

5) various data improvement programs are extracted data respectively from message queue and are administered, and then will administer result and are stored in again Message queue kafka cluster；

6) the last one is administered program and extracts data from message queue, and result will be administered by completing after administering is stored in result database HBase, ElasticSearch, RabbitMQ are used for subsequent process.

User data source in the step 1) includes the data of news website and the data of social media etc..

Finally, it should be noted that obviously, the above embodiment is merely an example for clearly illustrating the present invention, and simultaneously The non-restriction to embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.And thus drawn The obvious changes or variations of stretching are still in the protection scope of this invention.

Claims

1. a kind of data administering method based on double message queues, it is characterised in that: the method from data source access data after, Message queue is placed data into, pretreatment operation then is carried out to the data in message queue, then by pretreated data On the one hand deposit database is as backup, and on the other hand by treated, data are stored in message queue for data abatement tools again Consumption.

2. a kind of data administering method based on double message queues according to claim 1, which is characterized in that the side Method the following steps are included:

1) pass through data access tool from user data source input data；

2) data resource of access is stored in message queue；

3) data administer program and extract data progress pretreatment operation from message queue；

3. a kind of data administering method based on double message queues according to claim 1 or 2, it is characterised in that: described Message queue is kafka cluster.

4. a kind of data administering method based on double message queues according to claim 2, it is characterised in that: the step 1) after accessing data, standard data format is carried out to data, standard data format includes: field verification, polishing, data Ownership.

5. a kind of data administering method based on double message queues according to claim 2, it is characterised in that: the step It is rapid 3) in, the pretreatment is data cleansing, and distinguishing news data and social data, the cleaning of data includes: data format school It tests, polishing, URL duplicate removal, messy code identification filtering, language identification, domain name analysis and filling, js script processing in content of text, when Between legitimacy verifies, body matter missing, URL missing, issuing time missing, author's missing processing.

6. a kind of data administering method based on double message queues according to claim 5, it is characterised in that: to missing Field is that must fill out attribute or choosing fills out attribute and selects different processing methods according to it, if the field that must fill out attribute lacks The data are directly abandoned to wrong file, are supplemented if the field missing that attribute is filled out in choosing as sky.

7. a kind of data administering method based on double message queues according to claim 6, it is characterised in that: the word Body matter, URL, issuing time are required field in section, and author is to select word filling section.