CN106446092A - Flume-based method for analyzing data of semi-structured text file - Google Patents

Flume-based method for analyzing data of semi-structured text file Download PDF

Info

Publication number
CN106446092A
CN106446092A CN201610819060.5A CN201610819060A CN106446092A CN 106446092 A CN106446092 A CN 106446092A CN 201610819060 A CN201610819060 A CN 201610819060A CN 106446092 A CN106446092 A CN 106446092A
Authority
CN
China
Prior art keywords
flume
data
text file
class
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610819060.5A
Other languages
Chinese (zh)
Inventor
周庆勇
陈娟妮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN201610819060.5A priority Critical patent/CN106446092A/en
Publication of CN106446092A publication Critical patent/CN106446092A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Flume-based method for analyzing the data of a semi-structured text file. The method includes the following steps: when collecting the data of the file by a spooldir of a Flume, reading a piece of business data according to the business rules; analyzing and switching each piece of the business data; serializing into a class by establishing new Flume events, so as to achieve Flume interface Event Deserializer; in this class, reading each piece of data of the semi-structured text file in accordance with the rule configured by a conf file of the Flume; then analyzing and switching each piece of data in accordance with the data parsing rule configured by the conf file of the Flume, so as to finally output a piece of data which meets business needs. The Flume-based method for analyzing the data of the semi-structured text file can be used in the most possible way and reduce the development workload, and is generally applicable to the scene in which the data of the semi-structured text file is collected, analyzed and switched by the Flume.

Description

A kind of method of the data of the parsing semi-structured text file based on Flume
Technical field
The present invention relates to computer software application technical field, and in particular to a kind of parsing based on Flume is semi-structured The method of the data of text, by defining digital independent rule, the data resolution rules of service personalization, should by Flume For the collection of semi-structured text file, parsing conversion, the ability of various data receivings being customized using Flume, will process Text data afterwards are sent to various data receivings by bar.
Background technology
Flume is a High Availabitity, highly reliable, the system of distributed massive logs collection, polymerization and transmission, Flume supports to customize Various types of data sender in log system, for collecting data;Meanwhile, Flume has offer to data Simple process is carried out, and writes the ability of various data receivings (customizable).Flume provides source Spooling Directory Source, abbreviation spooldir, major function is to read the text under assigned catalogue, is converted to Flume Event, is sent to various data receivings by passage.
The source spooldir function that Flume is provided only is supported:A) by row, file is read;B) whole text text is once read Part;C) avro text is read.
But in actual business scenario, business datum often in the form of half structure, such as:
A) in text, a line is a record, and the form of a record is JSON;
B) in text, a line is a record, and a record is made up of multiple fields, is divided with separator between field Every;
C) in text, multirow is a record, and one records using special identifier as starting, a line in a record It is a field, a field is made up of Key, Value, and effective content is Value.
Function and parsing rule that the semi-structured text file of similar form is provided using existing Flume source spooldir Cannot then realize.A kind of universal method of the data of the parsing semi-structured text file based on Flume is applied to Flume In, Flume customization can be reused various easily using the half structure text data of Flume capturing service personalization The ability of data receiving, can easily by after parsing data is activation big data store (as HBase, Hive, ElasticSearch carry out in) storing, calculate, index.
Content of the invention
The technical problem to be solved in the present invention is:The present invention is directed to problem above, provides a kind of parsing based on Flume half The method of the data of structured text file.
The technical solution adopted in the present invention is:
A kind of method of the data of the parsing semi-structured text file based on Flume, methods described is using Flume's During spooldir collection file data, a business datum is read according to business rule, turn to carrying out parsing per bar business datum Change, can be easily according to the document analysis rule that the customization of business datum characteristic is common;
By newly-built Flume sequence of events class, realize Flume interface EventDeserializer, the apoplexy due to endogenous wind according to The rule of the conf file configuration of Flume reads every data of semi-structured text file, for every data, according still further to The data resolution rules class of the conf file configuration of Flume carries out parsing conversion, and final output meets a number of service needed According to.
Such as:A) in text, a line is a record, and the form of a record is JSON;B) in text, a line is One record, a record is made up of multiple fields, uses separators between field;C) in text, multirow is a note Record, a record is using special identifier as starting, and in a record, a line is a field, and a field is by Key, Value structure Become, effective content is Value.
Methods described operating procedure is as follows:
1) self defined interface IBdeEventParser;
2) according to the rule of business datum, self-defined text file analysis class, realize self defined interface IBdeEventParser;
3) newly-built Flume sequence of events class BdeLineDeserializer, realizes Flume interface EventDeserializer.
The interface IBdeEventParser is described as follows:
public interface IBdeEventParser{
public void build(Context context);
public Event handleEvent(Event event);
}
Wherein build (Context context):For obtaining Flume's from the context Context of Flume The resolution rules of conf file configuration;
handleEvent(Event event):For according to the resolution rules of the conf file configuration of Flume to Flume Every data carry out parsing conversion.
Construction BdeLineDeserializer (Context context, ResettableInputStream in) process As follows:
1) the text file analysis class for configuring in the conf file for reading Flume, configures eventParser item, and value is certainly One of defining interface IBdeEventParser realizes class;
2) the parsing class of instantiation eventParser configuration, and call method build (Context context) is carried out Initialization;
3) in readEvents (int numEvents), readEvent () is called to obtain a Flume event, if Flume is configured with method handleEvent (Event event) that text file analysis class then calls text file analysis class, Parsing conversion is carried out to every data;Flume is configured without text file analysis class and does not then parse
4) regular expression of the record for configuring in the conf file of Flume in readEvent (), is read, is joined Item being put for filePattern, if regular expression is configured with, a record is read according to regular expression;If being just configured without Then expression formula then reads a line and records as one, and wherein one records series turn to Flume event Event.
Beneficial effects of the present invention are:
The present invention can be multiplexed most possibly, reduce development amount, be generally applicable to using Flume collection, parsing The scene of conversion semi-structured text file data.
Description of the drawings
Fig. 1 is the schematic diagram of custom list class of the present invention;
Fig. 2 is the schematic diagram of self defined interface class of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawings, according to specific embodiment, the present invention is further described:
Embodiment 1:
A kind of method of the data of the parsing semi-structured text file based on Flume, methods described is using Flume's During spooldir collection file data, a business datum is read according to business rule, turn to carrying out parsing per bar business datum Change, can be easily according to the document analysis rule that the customization of business datum characteristic is common;
By newly-built Flume sequence of events class, realize Flume interface EventDeserializer, the apoplexy due to endogenous wind according to The rule of the conf file configuration of Flume reads every data of semi-structured text file, for every data, according still further to The data resolution rules class of the conf file configuration of Flume carries out parsing conversion, and final output meets a number of service needed According to.
Such as:A) in text, a line is a record, and the form of a record is JSON;
B) in text, a line is a record, and a record is made up of multiple fields, is divided with separator between field Every;
C) in text, multirow is a record, and one records using special identifier as starting, a line in a record It is a field, a field is made up of Key, Value, and effective content is Value.
Embodiment 2
On the basis of embodiment 1, the present embodiment methods described operating procedure is as follows:
1) self defined interface IBdeEventParser;
2) according to the rule of business datum, self-defined text file analysis class, realize self defined interface IBdeEventParser;
3) newly-built Flume sequence of events class BdeLineDeserializer, realizes Flume interface EventDeserializer.
Embodiment 3
As shown in figure 1, on the basis of embodiment 2, described in the present embodiment, interface IBdeEventParser is described as follows:
public interface IBdeEventParser{
public void build(Context context);
public Event handleEvent(Event event);
}
Wherein build (Context context):For obtaining Flume's from the context Context of Flume The resolution rules of conf file configuration;
handleEvent(Event event):For according to the resolution rules of the conf file configuration of Flume to Flume Every data carry out parsing conversion.
Embodiment 4
On the basis of embodiment 2, the present embodiment construction BdeLineDeserializer (Context context, ResettableInputStream in) process is as follows:
1) the text file analysis class for configuring in the conf file for reading Flume, configures eventParser item, and value is certainly One of defining interface IBdeEventParser realizes class, as shown in Figure 2;
2) the parsing class of instantiation eventParser configuration, and call method build (Context context) is carried out Initialization;
3) in readEvents (int numEvents), readEvent () is called to obtain a Flume event, if Flume is configured with method handleEvent (Event event) that text file analysis class then calls text file analysis class, Parsing conversion is carried out to every data;Flume is configured without text file analysis class and does not then parse
4) regular expression of the record for configuring in the conf file of Flume in readEvent (), is read, is joined Item being put for filePattern, if regular expression is configured with, a record is read according to regular expression;If being just configured without Then expression formula then reads a line and records as one, and wherein one records series turn to Flume event Event.
Embodiment 5
Exemplary scene
In text, multirow is a record, and a record is using special identifier as starting, and in a record, a line is One field, a field is made up of Key, Value, and effective content is Value.
A data in text, is expressed as below example:
&&&&&&&&&&&&&&&&&&
【Move in the date】2016-05-07
【Province】Shandong
【City】Jinan
【Hotel name】XX hotel
【Address】High and new technology industrial development zone XX
Expected record above-mentioned multirow is correctly read as a data, is separated with tab key, solution between the field of a data Content after analysis is as follows:
2016-05-07 Jinan, Shandong Province XX hotel high and new technology industrial development zone XX
For exemplary scene, the conf configuration file of Flume is as follows:
Test.sources.s1.type=spooldir
Test.sources.s1.spoolDir=/test/data
Test.sources.s1.deserializer=BdeLineDeserializer $ Builder
Test.sources.s1.deserializer.filePattern=^ (( &&&&&&&&&&&&&&&&&&) ( s|\\S)).*
Test.sources.s1.deserializer.eventParser=MultilineEventP arser
Test.sources.s1.deserializer.eventParser.ignorePatternLi ne=true
Test.sources.s1.deserializer.eventParser.needParseLine=t rue
Test.sources.s1.deserializer.eventParser.lineKVDelimiter="】"
Test.sources.s1.deserializer.eventParser.fieldDelimiter=" t "
The reading rule of every data of text:The multirow of Yi &&&&&&&&&&&&&&&&&& starting is recorded as one Data.
Parsing class MultilineEventParser of every data of file, realizes self defined interface IBdeEventParser, the rule according to configuration carries out parsing conversion to every data.MultilineEventParser's is each Implementing for method is described as follows:
A) resolution rules of the conf file configuration of Flume, in method build (Context context), are obtained:
Configuration item eventParser.ignorePatternLine:It is worth for true or false, whether needs for configuration Give up identification row.Need in exemplary scene to give up.
Configuration item eventParser.needParseLine:Being worth for true or false, parsing whether is needed for configuration Each row of data.Need in exemplary scene to parse virtual value from each row of data.
Configuration item eventParser.lineKVDelimiter:For configuring the separation of Key, Value of each row of data Symbol.In exemplary scene, virtual value is】Content afterwards.
Configuration item eventParser.fieldDelimiter:For configuring the separator between virtual value.Exemplary scene Middle use tab key is used as separator.
B) in method handleEvent (Event event), according to the resolution rules of the conf file configuration of Flume Flume event Event is changed, and by the virtual value after conversion by the separators that specifies, generates new Flume thing Part.
Embodiment is merely to illustrate the present invention, and not limitation of the present invention, about the ordinary skill of technical field Personnel, without departing from the spirit and scope of the present invention, can also make a variety of changes and modification, therefore all equivalents Technical scheme fall within scope of the invention, the scope of patent protection of the present invention should be defined by the claims.

Claims (4)

1. a kind of based on Flume parsing semi-structured text file data method, it is characterised in that:Methods described is adopted During the spooldir collection file data of Flume, a business datum is read according to business rule, to carrying out per bar business datum Parsing conversion, by newly-built Flume sequence of events class, realizes Flume interface EventDeserializer, presses in the apoplexy due to endogenous wind Rule according to the conf file configuration of Flume reads every data of semi-structured text file, for every data, according still further to The data resolution rules class of the conf file configuration of Flume carries out parsing conversion, and final output meets a number of service needed According to.
2. the method for the data of a kind of parsing semi-structured text file based on Flume according to claim 1, which is special Levy and be, methods described operating procedure is as follows:
1)Self defined interface IBdeEventParser;
2)According to the rule of business datum, self-defined text file analysis class, self defined interface IBdeEventParser is realized;
3)Newly-built Flume sequence of events class BdeLineDeserializer, realizes Flume interface EventDeserializer.
3. the method for the data of a kind of parsing semi-structured text file based on Flume according to claim 2, which is special Levy and be, the interface IBdeEventParser is described as follows:
public interface IBdeEventParser {
public void build(Context context);
public Event handleEvent(Event event);
}
Wherein build (Context context):For obtaining the conf text of Flume from the context Context of Flume The resolution rules of part configuration;
handleEvent(Event event):For according to the resolution rules of the conf file configuration of Flume to the every of Flume Data carries out parsing conversion.
4. the method for the data of a kind of parsing semi-structured text file based on Flume according to claim 2, which is special Levy and be, construct BdeLineDeserializer (Context context, ResettableInputStream in) process As follows:
1)The text file analysis class for configuring in the conf file of Flume is read, eventParser item is configured, be worth for self-defined One of interface IBdeEventParser realizes class;
2)The parsing class of instantiation eventParser configuration, and call method build (Context context) carries out initially Change;
3)In readEvents (int numEvents), readEvent () is called to obtain a Flume event, if Flume Method handleEvent (Event event) that text file analysis class then calls text file analysis class is configured with, to per bar Data carry out parsing conversion;Flume is configured without text file analysis class and does not then parse;
4)In readEvent (), the regular expression of the record for configuring in the conf file of Flume, configuration item are read For filePattern, if regular expression is configured with, a record is read according to regular expression;If being configured without canonical table Reaching formula and a line then being read as a record, wherein one records series turn to Flume event Event.
CN201610819060.5A 2016-09-12 2016-09-12 Flume-based method for analyzing data of semi-structured text file Pending CN106446092A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610819060.5A CN106446092A (en) 2016-09-12 2016-09-12 Flume-based method for analyzing data of semi-structured text file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610819060.5A CN106446092A (en) 2016-09-12 2016-09-12 Flume-based method for analyzing data of semi-structured text file

Publications (1)

Publication Number Publication Date
CN106446092A true CN106446092A (en) 2017-02-22

Family

ID=58167726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610819060.5A Pending CN106446092A (en) 2016-09-12 2016-09-12 Flume-based method for analyzing data of semi-structured text file

Country Status (1)

Country Link
CN (1) CN106446092A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885881A (en) * 2017-11-29 2018-04-06 顺丰科技有限公司 Business datum real-time report, acquisition methods, device, equipment and its storage medium
CN108710694A (en) * 2018-05-22 2018-10-26 浪潮软件集团有限公司 Method and device for storing data as file based on flash
CN109460219A (en) * 2018-09-28 2019-03-12 西南电子技术研究所(中国电子科技集团公司第十研究所) The method of rapid serial Interface Control File
CN109685375A (en) * 2018-12-26 2019-04-26 重庆誉存大数据科技有限公司 A kind of business risk regulation engine operation method based on semi-structured text data
CN109710413A (en) * 2018-12-29 2019-05-03 重庆誉存大数据科技有限公司 A kind of integral Calculation Method of the rule engine system of semi-structured text data
CN111324688A (en) * 2020-02-24 2020-06-23 南京莱斯网信技术研究院有限公司 Semi-structured data and unstructured data acquisition system based on events
CN116644039A (en) * 2023-05-25 2023-08-25 安徽继远软件有限公司 Automatic acquisition and analysis method for online capacity operation log based on big data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005549A (en) * 2015-07-31 2015-10-28 山东蚁巡网络科技有限公司 User-defined chained log analysis device and method
CN105653662A (en) * 2015-12-29 2016-06-08 中国建设银行股份有限公司 Flume based data processing method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005549A (en) * 2015-07-31 2015-10-28 山东蚁巡网络科技有限公司 User-defined chained log analysis device and method
CN105653662A (en) * 2015-12-29 2016-06-08 中国建设银行股份有限公司 Flume based data processing method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
暗痛: "flume监控", 《博客园:HTTPS://WWW.CNBLOGS.COM/BREG/P/5649363.HTML》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885881A (en) * 2017-11-29 2018-04-06 顺丰科技有限公司 Business datum real-time report, acquisition methods, device, equipment and its storage medium
CN108710694A (en) * 2018-05-22 2018-10-26 浪潮软件集团有限公司 Method and device for storing data as file based on flash
CN109460219A (en) * 2018-09-28 2019-03-12 西南电子技术研究所(中国电子科技集团公司第十研究所) The method of rapid serial Interface Control File
CN109460219B (en) * 2018-09-28 2021-09-03 西南电子技术研究所(中国电子科技集团公司第十研究所) Method for quickly serializing interface control file
CN109685375A (en) * 2018-12-26 2019-04-26 重庆誉存大数据科技有限公司 A kind of business risk regulation engine operation method based on semi-structured text data
CN109710413A (en) * 2018-12-29 2019-05-03 重庆誉存大数据科技有限公司 A kind of integral Calculation Method of the rule engine system of semi-structured text data
CN109710413B (en) * 2018-12-29 2020-09-08 重庆誉存大数据科技有限公司 Integral calculation method of rule engine system of semi-structured text data
CN111324688A (en) * 2020-02-24 2020-06-23 南京莱斯网信技术研究院有限公司 Semi-structured data and unstructured data acquisition system based on events
CN116644039A (en) * 2023-05-25 2023-08-25 安徽继远软件有限公司 Automatic acquisition and analysis method for online capacity operation log based on big data
CN116644039B (en) * 2023-05-25 2023-12-19 安徽继远软件有限公司 Automatic acquisition and analysis method for online capacity operation log based on big data

Similar Documents

Publication Publication Date Title
CN106446092A (en) Flume-based method for analyzing data of semi-structured text file
CN109284334B (en) Real-time database synchronization method and device, electronic equipment and storage medium
US9105178B2 (en) Remote dynamic configuration of telemetry reporting through regular expressions
CN107169069B (en) Distributed hierarchical extraction multi-application method and data extraction applicator
CN109753502B (en) Data acquisition method based on NiFi
CN110532466A (en) Processing method, device, storage medium and the equipment of platform training data is broadcast live
CN101300843A (en) Digital broadcast system, receiving device and sending device
CN103544298B (en) The log analysis method and analytical equipment of component
CN111222547A (en) Traffic feature extraction method and system for mobile application
Chow Understanding SONET/SDH: Standards and Applications
CN104182541A (en) Method for showing smart phone data information
KR20150081126A (en) Big data service system based on web server and big data cluster using API driver
CN110275817A (en) A kind of journal file automatic generation method based on model-driven
CN102289445A (en) Method and device for analyzing XML (Extensible Markup Language) file and terminal
CN109241498A (en) XML file processing method, equipment and storage medium
CN105786529B (en) One type Managed Code calls the Parameters design of the labyrinth of C/C++ style function
CN108133017A (en) A kind of multi-data source acquisition configuration method and device
CN117319527A (en) Time sequence data processing method, device and medium based on identification analysis gateway
Yang et al. A programmable ROADM system for SDM/WDM networks
CN102999626B (en) A kind of data compression/decompression compression apparatus and method, system
CN116136801B (en) Cloud platform data processing method and device, electronic equipment and storage medium
CN116028574A (en) Government full life cycle big data management system and method thereof
CN111913821B (en) Method for realizing cross-data-source real-time data stream production consumption
CN110795480B (en) Traffic operation data processing method and device
CN114500676A (en) Information interaction method and device among industrial internet devices and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170222

RJ01 Rejection of invention patent application after publication