CN112131209A

CN112131209A - Hive-based Flume data verification statistical method and device

Info

Publication number: CN112131209A
Application number: CN202010923083.7A
Authority: CN
Inventors: 胡永泽
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2020-12-25

Abstract

The invention discloses a method and a device for Flume data verification statistics based on Hive, wherein the method comprises the following steps: initializing Hive to obtain configuration parameters of Hive Sink and metadata information of fields in a Hive table; acquiring data to be verified, which is acquired by Source and packaged as Event, from the Channel; verifying whether each Event of the data to be verified is error data or not based on the configuration parameters, and storing the Event determined as the error data into an error data recording file; confirming whether to start a dirty data filtering mode or not based on the configuration parameters; in response to the opening of the dirty data filtering mode, dividing each Event which is not error data into fields, respectively verifying whether each field is dirty data which is not matched with the metadata information based on the configuration parameters, and storing the Event with the field determined as the dirty data into a dirty data record file; each Event that is not recorded by both the error data record and the dirty data record file is written to the Hive table. The invention can avoid the interruption of the acquisition process and the data loss, reduce the resource occupation, improve the data quality and facilitate the data analysis.

Description

Hive-based Flume data verification statistical method and device

Technical Field

The present invention relates to the field of data transmission, and more particularly, to a Hive-based Flume data verification statistical method and device.

Background

Hive is a data warehouse tool based on Hadoop, which is used for data extraction, transformation and loading, and is a mechanism for storing, querying and analyzing large-scale data stored in Hadoop. In the actual data writing process, if the number of the segmented field data is not matched with the number of the Hive fields or the segmentation characters are different, the data cannot be written into the Hive table, continuous trial and error of a program are caused, and subsequent data cannot be written into the Hive table; meanwhile, the data acquisition cannot be stopped, a large amount of cache data can be accumulated, a large amount of resources are occupied, and the process can be recovered to be normal only after the process is restarted; and a restart results in a loss of cached data. When the collected data type is inconsistent with the configuration data field type in the Hive table, for example, the ID field in the Hive table is Int type, and the collected data field type is Double type, although the data can be written in, the field in the table will be empty, which generates dirty data and interferes with the subsequent data analysis.

Aiming at the problems of process interruption, resource occupation, data loss and data analysis difficulty caused by writing error data or dirty data into a Hive table in the prior art, no effective solution is available at present.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for Flume data verification statistics based on Hive, which can avoid interruption of an acquisition process and data loss, reduce resource occupation, improve data quality, and facilitate data analysis.

Based on the above purpose, a first aspect of the embodiments of the present invention provides a method for statistical verification of fly data based on Hive, including the following steps:

initializing Hive to obtain configuration parameters of Hive Sink and metadata information of fields in a Hive table;

acquiring data to be verified, which is acquired by Source and packaged as Event, from the Channel;

verifying whether each Event of the data to be verified is error data or not based on the configuration parameters, and storing the Event determined as the error data into an error data recording file;

confirming whether to start a dirty data filtering mode or not based on the configuration parameters;

in response to the opening of the dirty data filtering mode, dividing each Event which is not error data into fields, respectively verifying whether each field is dirty data which is not matched with the metadata information based on the configuration parameters, and storing the Event with the field determined as the dirty data into a dirty data record file;

each Event that is not recorded by both the error data record file and the dirty data record file is written to the Hive table.

In some embodiments, the configuration parameters are stored in a Flume's configuration file; the Hive table is stored in a Hive database.

In some embodiments, the configuration parameters include information of a Hive table, information of an error data record file, information of a dirty data record file, information of whether a dirty data filtering mode is on, and information of a delimiter.

In some embodiments, verifying whether each Event of data to be verified is error data based on configuration parameters, and storing the Event determined to be error data into an error data log file includes:

checking whether the number of the separators in each Event is equal to the number of fields in the Hive table minus one based on the separator information;

determining each Event determined as the number of separators equal to the number of fields in the Hive table minus one as error data;

error data is stored in the error data recording file based on the information of the error data recording file.

In some embodiments, the metadata information includes the number and type of fields in the Hive table.

In some embodiments, segmenting each Event that is not error data into fields, verifying whether each field is dirty data that does not match the metadata information based on the configuration parameters, respectively, and storing the Event having the field determined to be dirty data into a dirty data record file includes:

segmenting each Event into fields based on the segmenter information;

respectively verifying whether each field is matched with the type to be written into the Hive table or not based on the metadata information;

events having fields determined to be dirty data are stored in the dirty data record file based on information of the dirty data record file.

In some embodiments, the method further comprises: in response to storing an Event determined to be error data in the error data log file, the error data accumulator is incremented by one and the error data accumulator is updated to the error data log file.

In some embodiments, the method further comprises: in response to storing an Event having a field determined to be dirty data in the dirty data record file, the dirty data accumulator is incremented and the dirty data accumulator is updated to the dirty data record file.

In some embodiments, the method further comprises: in response to writing each Event not recorded by both the error data record file and the dirty data record file to the Hive table, the correct data accumulator is incremented by one and updated to the statistics file.

In view of the above object, a second aspect of the embodiments of the present invention provides a Hive-based Flume data verification statistic apparatus, including:

a processor; and

a memory storing program code executable by the processor, the program code when executed performing the steps of:

The invention has the following beneficial technical effects: according to the method and the device for Flume data verification statistics based on Hive, provided by the embodiment of the invention, Hive is initialized to obtain configuration parameters of Hive Sink and metadata information of fields in a Hive table; acquiring data to be verified, which is acquired by Source and packaged as Event, from the Channel; verifying whether each Event of the data to be verified is error data or not based on the configuration parameters, and storing the Event determined as the error data into an error data recording file; confirming whether to start a dirty data filtering mode or not based on the configuration parameters; in response to the opening of the dirty data filtering mode, dividing each Event which is not error data into fields, respectively verifying whether each field is dirty data which is not matched with the metadata information based on the configuration parameters, and storing the Event with the field determined as the dirty data into a dirty data record file; the technical scheme of writing each Event which is not recorded in both the error data recording file and the dirty data recording file into the Hive table can avoid interruption of the acquisition process and data loss, reduce resource occupation, improve data quality and facilitate data analysis.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a Hive-based Flume data verification statistical method provided by the invention;

FIG. 2 is a schematic diagram of a Flume data acquisition structure of the method for verifying and counting Flume data based on Hive according to the present invention;

FIG. 3 is a detailed flowchart of an embodiment of a Hive-based Flume data verification statistical method provided by the present invention;

FIG. 4 is a detailed flowchart of an embodiment of a Hive-based Flume data verification statistical method provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

Based on the above purpose, a first aspect of the embodiments of the present invention provides an embodiment of a Hive-based Flume data verification statistical method, which can avoid interruption of an acquisition process and data loss, reduce resource occupation, improve data quality, and facilitate data analysis. FIG. 1 is a flow chart diagram of a Hive-based Flume data verification statistical method provided by the invention.

The method for Flume data verification statistics based on Hive is shown in fig. 1, and comprises the following steps:

step S101: initializing Hive to obtain configuration parameters of Hive Sink and metadata information of fields in a Hive table;

step S103: acquiring data to be verified, which is acquired by Source and packaged as Event, from the Channel;

step S105: verifying whether each Event of the data to be verified is error data or not based on the configuration parameters, and storing the Event determined as the error data into an error data recording file;

step S107: confirming whether to start a dirty data filtering mode or not based on the configuration parameters;

step S109: in response to the opening of the dirty data filtering mode, dividing each Event which is not error data into fields, respectively verifying whether each field is dirty data which is not matched with the metadata information based on the configuration parameters, and storing the Event with the field determined as the dirty data into a dirty data record file;

step S111: each Event that is not recorded by both the error data record file and the dirty data record file is written to the Hive table.

Before data is written into the Hive, the metadata information to be written into the data table is acquired through the Hive Sql, the data needing to be written is checked and counted according to the metadata information, wrong data and dirty data can be written into a local file to be intercepted and recorded, and tracing and counting are facilitated. The problem that correct data collected later cannot be written in and lost due to one error data is avoided; meanwhile, the problem that a large amount of data is accumulated in the Channel due to the error of one data format, and a large amount of host resources are occupied is solved; meanwhile, the processing and interception of the dirty data can also reduce the pressure and cost for data calculation.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct relevant hardware to perform the processes, and the processes can be stored in a computer readable storage medium, and when executed, the processes can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.

segmenting each Event into fields based on the segmenter information;

In some embodiments, the method further comprises: the correct data accumulator is incremented by one and updated to the statistics file in response to each Event that is not recorded by both the error data record file and the dirty data record file being written to the Hive table.

Flume is an excellent data acquisition tool, applied to a plurality of scenes. As shown in fig. 2, for the framework of Flume data collection, Source may collect data from multiple data sources such as log files, network ports, Kafka, and so on, encapsulate into Event, and write into Channel. After the data is successfully written into the Channel, the Sink actively pulls the data from the Channel and writes the data into various large data components such as HDFS, HBase, Hive, ES and the like. The Channel is a passive memory and is responsible for temporarily storing data. The Source module, the Sink module and the Channel module jointly form a Flume Agent process, and the Source module and the Sink module are used as two threads in the Agent process and respectively responsible for acquiring data from a data Source and writing the data into a destination. The configuration in the Source, Sink and Channel in the flow Agent is obtained by reading the configuration file of flow.

The Hive Sink in the past can acquire an Event from a Channel, serialize data, and write the data into a Hive table through Hive Api. In the arrangement, the serialization scheme is a separator scheme. The data serialized by the separator method is divided by a designated separator, and the separator may be arranged as comma, semicolon, or the like. For example, the format of data collected by comma as a separator may be "1, zhangsan, 18", and the collected data is written into the Hive table by corresponding to the fields of "ID", "name" and "age".

Before the Sink is initialized, the number and the corresponding type of fields in the Hive table are inquired through Hive SQL, and therefore metadata information is obtained. When data is acquired, formatting verification is carried out on the data, firstly, whether the segmentation symbol of the data is consistent with the configuration file information is verified, then, whether the number of the segmented data is the same as the number of the fields of the Hive table is verified, if not, the data is judged to be error data, the number of error data accumulators is increased by one, meanwhile, the data is written into an error data file specified by the configuration file, and the number of the error data accumulators is updated into a record file. If the data is consistent with the dirty data, judging whether dirty data can be tolerated or not according to the configuration information, if so, writing the data into the Hive table, adding one to the data of the successful writing accumulator, updating the number of the successful writing accumulator into the record file, if not, analyzing each field according to the segmentation symbol, simultaneously judging whether the data type is matched with the field type in the Hive table or not, if unmatched fields occur, judging that the data is dirty data, writing the data into the dirty data file, adding one to the dirty data accumulator, updating the number of the dirty data accumulator into the record file, if unmatched fields do not exist, writing the data into the Hive table, after the data is successfully written, adding one to the data of the successful writing accumulator, and updating the number of the successful writing accumulator into the record file.

The flow chart of the above method is detailed in fig. 3. When the data with the number of data fields different from the number of fields of the Hive table is defined as error data, and when the data with the type of fields different from the type of fields of the Hive table is dirty data, three files, namely an error data recording file, a dirty data recording file and a statistical file, are added, and the whole method comprises the following steps:

step 1: and initializing Hive Sink to obtain Hive metadata information.

Step 2: and acquiring the collected data from the Channel.

And step 3: and (3) filtering the collected data according to the metadata information of the Hive table, writing error data into an error file, recording, and executing the step (2) again, if the dirty data filtering is started, executing the step (4), otherwise, executing the step (5).

And 4, step 4: and filtering the dirty data according to the metadata information, writing the dirty data into a dirty data file, recording, and re-executing the step 2.

And 5: and writing the data into the Hive, recording after the writing is successful, and executing the step 2.

To more clearly illustrate the implementation of the present invention, the following embodiment is further illustrated in fig. 4 as follows:

1. initializing configuration, and acquiring Hive Sink configuration parameters, such as written Hive table names, error data storage paths, whether to filter dirty data, and the like.

2. Through Hive SQL, metadata information needed to be written into a Hive table is queried, such as a type corresponding to a column name.

3. From the Channel, Source collects and packages the number's Event.

4. And judging whether the number of the separators in the Event is the number of the fields of the Hive table minus one according to the configured separators, and if not, executing the step 5.1 and executing the step 6.

5.1 write the data into the error data recording file.

5.2 error data accumulator plus one.

5.3 writing the number of error data accumulators into the recording file, and executing the step 3.

6. And (4) checking whether the mode is started or not according to the configuration file, and executing the step 7.1 instead of executing the step 8 in the dirty data filtering mode.

7.1 obtaining each field of the data according to the separator, and judging whether each data type accords with the metadata format of the Hive table, accords with the execution 8 and does not accord with the execution 7.2.

7.2 write the piece of data into the dirty data file.

7.3 dirty data accumulator plus one.

7.4 write dirty data accumulator number to record file, execute step 3.

8. And writing the data into the Hive database through Hive Api, and after the data is successfully written, writing the data into a successful accumulator and adding one.

9. And updating the accumulator number successfully written into the record file, and executing the step 3.

It can be seen from the foregoing embodiment that, in the Hive-based Flume data verification statistical method provided in the embodiment of the present invention, Hive is initialized to obtain configuration parameters of Hive Sink and metadata information of fields in a Hive table; acquiring data to be verified, which is acquired by Source and packaged as Event, from the Channel; verifying whether each Event of the data to be verified is error data or not based on the configuration parameters, and storing the Event determined as the error data into an error data recording file; confirming whether to start a dirty data filtering mode or not based on the configuration parameters; dividing each Event which is not error data into fields in response to the opening of a dirty data filtering mode, verifying whether each field is dirty data which is not matched with metadata information based on configuration parameters, and storing the Event with the field determined as the dirty data into a dirty data record file; the technical scheme of writing each Event which is not recorded in both the error data recording file and the dirty data recording file into the Hive table can avoid interruption of the acquisition process and data loss, reduce resource occupation, improve data quality and facilitate data analysis.

It should be particularly noted that, the steps in the foregoing various embodiments of the Hive-based Flume data verification statistical method may be intersected, replaced, added, or deleted, and therefore, these reasonable permutation and combination transformations should also belong to the scope of the present invention, which is based on the Hive-based Flume data verification statistical method, and the scope of the present invention should not be limited to the described embodiments.

In view of the above, a second aspect of the embodiments of the present invention provides an embodiment of a Hive-based Flume data verification statistical apparatus, which is capable of avoiding interruption of an acquisition process and data loss, reducing resource occupation, improving data quality, and facilitating data analysis. The Hive-based Flume data verification statistical device comprises:

a processor; and

It can be seen from the foregoing embodiment that, in the Hive-based Flume data verification statistical apparatus provided in the embodiment of the present invention, Hive is initialized to obtain configuration parameters of Hive Sink and metadata information of fields in a Hive table; acquiring data to be verified, which is acquired by Source and packaged as Event, from the Channel; verifying whether each Event of the data to be verified is error data or not based on the configuration parameters, and storing the Event determined as the error data into an error data recording file; confirming whether to start a dirty data filtering mode or not based on the configuration parameters; in response to the opening of the dirty data filtering mode, dividing each Event which is not error data into fields, respectively verifying whether each field is dirty data which is not matched with the metadata information based on the configuration parameters, and storing the Event with the field determined as the dirty data into a dirty data record file; the technical scheme of writing each Event which is not recorded in both the error data recording file and the dirty data recording file into the Hive table can avoid interruption of the acquisition process and data loss, reduce resource occupation, improve data quality and facilitate data analysis.

It should be particularly noted that, the embodiment of the Hive-based Flume data verification statistical apparatus adopts the embodiment of the Hive-based Flume data verification statistical method to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the Hive-based Flume data verification statistical method. Of course, since the steps in the embodiment of the method for statistical verification of fly data based on Hive may be intersected, replaced, added, and deleted, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiment.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for verifying and counting Flume data based on Hive is characterized by comprising the following steps:

confirming whether to start a dirty data filtering mode based on the configuration parameters;

in response to the opening of a dirty data filtering mode, dividing each Event which is not error data into fields, verifying whether each field is dirty data which is not matched with the metadata information based on the configuration parameters, and storing the Event with the field determined as the dirty data into a dirty data record file;

writing each Event that is not recorded by both the error data record file and dirty data record file to the Hive table.

2. The method of claim 1, wherein the configuration parameters are stored in a Flume configuration file; the Hive table is stored in a Hive database.

3. The method of claim 1, wherein the configuration parameters comprise information of the Hive table, information of the error data record file, information of the dirty data record file, information of whether the dirty data filtering mode is on, and information of a divider.

4. The method of claim 3, wherein verifying whether each Event of the data to be verified is error data based on the configuration parameters, and storing the Event determined as error data in an error data log file comprises:

checking whether the number of separators in each Event is equal to the number of fields in the Hive table minus one based on the separator information;

determining each Event determined to have a number of slicers equal to the number of fields in the Hive table minus one as error data;

storing error data into the error data record file based on the information of the error data record file.

5. The method of claim 3, wherein the metadata information comprises the number and type of fields in the Hive table.

6. The method of claim 5, wherein segmenting each Event that is not error data into fields, verifying whether each field is dirty data that does not match the metadata information based on the configuration parameters, respectively, and storing the Event having the field determined to be dirty data into a dirty data record file comprises:

segmenting each Event into fields based on the segmenter information;

verifying whether each field matches a type to be written in the Hive table based on the metadata information, respectively;

events having fields determined to be dirty data are stored in the dirty data record file based on the information of the dirty data record file.

7. The method of claim 1, further comprising: in response to storing an Event determined to be error data in an error data log file, an error data accumulator is incremented by one and the error data accumulator is updated to the error data log file.

8. The method of claim 1, further comprising: in response to storing an Event having a field determined to be dirty data in a dirty data record file, a dirty data accumulator is incremented and the dirty data accumulator is updated to the dirty data record file.

9. The method of claim 1, further comprising: in response to writing each Event not recorded by both the error data record file and dirty data record file to the Hive table, a correct data accumulator is incremented by one and updated to a statistics file.

10. The invention provides a fly data verification statistical device based on Hive, which is characterized by comprising the following components:

a processor; and