CN106446092A - Flume-based method for analyzing data of semi-structured text file - Google Patents
Flume-based method for analyzing data of semi-structured text file Download PDFInfo
- Publication number
- CN106446092A CN106446092A CN201610819060.5A CN201610819060A CN106446092A CN 106446092 A CN106446092 A CN 106446092A CN 201610819060 A CN201610819060 A CN 201610819060A CN 106446092 A CN106446092 A CN 106446092A
- Authority
- CN
- China
- Prior art keywords
- flume
- data
- text file
- class
- event
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/84—Mapping; Conversion
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Flume-based method for analyzing the data of a semi-structured text file. The method includes the following steps: when collecting the data of the file by a spooldir of a Flume, reading a piece of business data according to the business rules; analyzing and switching each piece of the business data; serializing into a class by establishing new Flume events, so as to achieve Flume interface Event Deserializer; in this class, reading each piece of data of the semi-structured text file in accordance with the rule configured by a conf file of the Flume; then analyzing and switching each piece of data in accordance with the data parsing rule configured by the conf file of the Flume, so as to finally output a piece of data which meets business needs. The Flume-based method for analyzing the data of the semi-structured text file can be used in the most possible way and reduce the development workload, and is generally applicable to the scene in which the data of the semi-structured text file is collected, analyzed and switched by the Flume.
Description
Technical field
The present invention relates to computer software application technical field, and in particular to a kind of parsing based on Flume is semi-structured
The method of the data of text, by defining digital independent rule, the data resolution rules of service personalization, should by Flume
For the collection of semi-structured text file, parsing conversion, the ability of various data receivings being customized using Flume, will process
Text data afterwards are sent to various data receivings by bar.
Background technology
Flume is a High Availabitity, highly reliable, the system of distributed massive logs collection, polymerization and transmission,
Flume supports to customize Various types of data sender in log system, for collecting data;Meanwhile, Flume has offer to data
Simple process is carried out, and writes the ability of various data receivings (customizable).Flume provides source Spooling
Directory Source, abbreviation spooldir, major function is to read the text under assigned catalogue, is converted to Flume
Event, is sent to various data receivings by passage.
The source spooldir function that Flume is provided only is supported:A) by row, file is read;B) whole text text is once read
Part;C) avro text is read.
But in actual business scenario, business datum often in the form of half structure, such as:
A) in text, a line is a record, and the form of a record is JSON;
B) in text, a line is a record, and a record is made up of multiple fields, is divided with separator between field
Every;
C) in text, multirow is a record, and one records using special identifier as starting, a line in a record
It is a field, a field is made up of Key, Value, and effective content is Value.
Function and parsing rule that the semi-structured text file of similar form is provided using existing Flume source spooldir
Cannot then realize.A kind of universal method of the data of the parsing semi-structured text file based on Flume is applied to Flume
In, Flume customization can be reused various easily using the half structure text data of Flume capturing service personalization
The ability of data receiving, can easily by after parsing data is activation big data store (as HBase, Hive,
ElasticSearch carry out in) storing, calculate, index.
Content of the invention
The technical problem to be solved in the present invention is:The present invention is directed to problem above, provides a kind of parsing based on Flume half
The method of the data of structured text file.
The technical solution adopted in the present invention is:
A kind of method of the data of the parsing semi-structured text file based on Flume, methods described is using Flume's
During spooldir collection file data, a business datum is read according to business rule, turn to carrying out parsing per bar business datum
Change, can be easily according to the document analysis rule that the customization of business datum characteristic is common;
By newly-built Flume sequence of events class, realize Flume interface EventDeserializer, the apoplexy due to endogenous wind according to
The rule of the conf file configuration of Flume reads every data of semi-structured text file, for every data, according still further to
The data resolution rules class of the conf file configuration of Flume carries out parsing conversion, and final output meets a number of service needed
According to.
Such as:A) in text, a line is a record, and the form of a record is JSON;B) in text, a line is
One record, a record is made up of multiple fields, uses separators between field;C) in text, multirow is a note
Record, a record is using special identifier as starting, and in a record, a line is a field, and a field is by Key, Value structure
Become, effective content is Value.
Methods described operating procedure is as follows:
1) self defined interface IBdeEventParser;
2) according to the rule of business datum, self-defined text file analysis class, realize self defined interface
IBdeEventParser;
3) newly-built Flume sequence of events class BdeLineDeserializer, realizes Flume interface
EventDeserializer.
The interface IBdeEventParser is described as follows:
public interface IBdeEventParser{
public void build(Context context);
public Event handleEvent(Event event);
}
Wherein build (Context context):For obtaining Flume's from the context Context of Flume
The resolution rules of conf file configuration;
handleEvent(Event event):For according to the resolution rules of the conf file configuration of Flume to Flume
Every data carry out parsing conversion.
Construction BdeLineDeserializer (Context context, ResettableInputStream in) process
As follows:
1) the text file analysis class for configuring in the conf file for reading Flume, configures eventParser item, and value is certainly
One of defining interface IBdeEventParser realizes class;
2) the parsing class of instantiation eventParser configuration, and call method build (Context context) is carried out
Initialization;
3) in readEvents (int numEvents), readEvent () is called to obtain a Flume event, if
Flume is configured with method handleEvent (Event event) that text file analysis class then calls text file analysis class,
Parsing conversion is carried out to every data;Flume is configured without text file analysis class and does not then parse
4) regular expression of the record for configuring in the conf file of Flume in readEvent (), is read, is joined
Item being put for filePattern, if regular expression is configured with, a record is read according to regular expression;If being just configured without
Then expression formula then reads a line and records as one, and wherein one records series turn to Flume event Event.
Beneficial effects of the present invention are:
The present invention can be multiplexed most possibly, reduce development amount, be generally applicable to using Flume collection, parsing
The scene of conversion semi-structured text file data.
Description of the drawings
Fig. 1 is the schematic diagram of custom list class of the present invention;
Fig. 2 is the schematic diagram of self defined interface class of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawings, according to specific embodiment, the present invention is further described:
Embodiment 1:
A kind of method of the data of the parsing semi-structured text file based on Flume, methods described is using Flume's
During spooldir collection file data, a business datum is read according to business rule, turn to carrying out parsing per bar business datum
Change, can be easily according to the document analysis rule that the customization of business datum characteristic is common;
By newly-built Flume sequence of events class, realize Flume interface EventDeserializer, the apoplexy due to endogenous wind according to
The rule of the conf file configuration of Flume reads every data of semi-structured text file, for every data, according still further to
The data resolution rules class of the conf file configuration of Flume carries out parsing conversion, and final output meets a number of service needed
According to.
Such as:A) in text, a line is a record, and the form of a record is JSON;
B) in text, a line is a record, and a record is made up of multiple fields, is divided with separator between field
Every;
C) in text, multirow is a record, and one records using special identifier as starting, a line in a record
It is a field, a field is made up of Key, Value, and effective content is Value.
Embodiment 2
On the basis of embodiment 1, the present embodiment methods described operating procedure is as follows:
1) self defined interface IBdeEventParser;
2) according to the rule of business datum, self-defined text file analysis class, realize self defined interface
IBdeEventParser;
3) newly-built Flume sequence of events class BdeLineDeserializer, realizes Flume interface
EventDeserializer.
Embodiment 3
As shown in figure 1, on the basis of embodiment 2, described in the present embodiment, interface IBdeEventParser is described as follows:
public interface IBdeEventParser{
public void build(Context context);
public Event handleEvent(Event event);
}
Wherein build (Context context):For obtaining Flume's from the context Context of Flume
The resolution rules of conf file configuration;
handleEvent(Event event):For according to the resolution rules of the conf file configuration of Flume to Flume
Every data carry out parsing conversion.
Embodiment 4
On the basis of embodiment 2, the present embodiment construction BdeLineDeserializer (Context context,
ResettableInputStream in) process is as follows:
1) the text file analysis class for configuring in the conf file for reading Flume, configures eventParser item, and value is certainly
One of defining interface IBdeEventParser realizes class, as shown in Figure 2;
2) the parsing class of instantiation eventParser configuration, and call method build (Context context) is carried out
Initialization;
3) in readEvents (int numEvents), readEvent () is called to obtain a Flume event, if
Flume is configured with method handleEvent (Event event) that text file analysis class then calls text file analysis class,
Parsing conversion is carried out to every data;Flume is configured without text file analysis class and does not then parse
4) regular expression of the record for configuring in the conf file of Flume in readEvent (), is read, is joined
Item being put for filePattern, if regular expression is configured with, a record is read according to regular expression;If being just configured without
Then expression formula then reads a line and records as one, and wherein one records series turn to Flume event Event.
Embodiment 5
Exemplary scene
In text, multirow is a record, and a record is using special identifier as starting, and in a record, a line is
One field, a field is made up of Key, Value, and effective content is Value.
A data in text, is expressed as below example:
&&&&&&&&&&&&&&&&&&
【Move in the date】2016-05-07
【Province】Shandong
【City】Jinan
【Hotel name】XX hotel
【Address】High and new technology industrial development zone XX
Expected record above-mentioned multirow is correctly read as a data, is separated with tab key, solution between the field of a data
Content after analysis is as follows:
2016-05-07 Jinan, Shandong Province XX hotel high and new technology industrial development zone XX
For exemplary scene, the conf configuration file of Flume is as follows:
Test.sources.s1.type=spooldir
Test.sources.s1.spoolDir=/test/data
Test.sources.s1.deserializer=BdeLineDeserializer $ Builder
Test.sources.s1.deserializer.filePattern=^ (( &&&&&&&&&&&&&&&&&&) (
s|\\S)).*
Test.sources.s1.deserializer.eventParser=MultilineEventP arser
Test.sources.s1.deserializer.eventParser.ignorePatternLi ne=true
Test.sources.s1.deserializer.eventParser.needParseLine=t rue
Test.sources.s1.deserializer.eventParser.lineKVDelimiter="】"
Test.sources.s1.deserializer.eventParser.fieldDelimiter=" t "
The reading rule of every data of text:The multirow of Yi &&&&&&&&&&&&&&&&&& starting is recorded as one
Data.
Parsing class MultilineEventParser of every data of file, realizes self defined interface
IBdeEventParser, the rule according to configuration carries out parsing conversion to every data.MultilineEventParser's is each
Implementing for method is described as follows:
A) resolution rules of the conf file configuration of Flume, in method build (Context context), are obtained:
Configuration item eventParser.ignorePatternLine:It is worth for true or false, whether needs for configuration
Give up identification row.Need in exemplary scene to give up.
Configuration item eventParser.needParseLine:Being worth for true or false, parsing whether is needed for configuration
Each row of data.Need in exemplary scene to parse virtual value from each row of data.
Configuration item eventParser.lineKVDelimiter:For configuring the separation of Key, Value of each row of data
Symbol.In exemplary scene, virtual value is】Content afterwards.
Configuration item eventParser.fieldDelimiter:For configuring the separator between virtual value.Exemplary scene
Middle use tab key is used as separator.
B) in method handleEvent (Event event), according to the resolution rules of the conf file configuration of Flume
Flume event Event is changed, and by the virtual value after conversion by the separators that specifies, generates new Flume thing
Part.
Embodiment is merely to illustrate the present invention, and not limitation of the present invention, about the ordinary skill of technical field
Personnel, without departing from the spirit and scope of the present invention, can also make a variety of changes and modification, therefore all equivalents
Technical scheme fall within scope of the invention, the scope of patent protection of the present invention should be defined by the claims.
Claims (4)
1. a kind of based on Flume parsing semi-structured text file data method, it is characterised in that:Methods described is adopted
During the spooldir collection file data of Flume, a business datum is read according to business rule, to carrying out per bar business datum
Parsing conversion, by newly-built Flume sequence of events class, realizes Flume interface EventDeserializer, presses in the apoplexy due to endogenous wind
Rule according to the conf file configuration of Flume reads every data of semi-structured text file, for every data, according still further to
The data resolution rules class of the conf file configuration of Flume carries out parsing conversion, and final output meets a number of service needed
According to.
2. the method for the data of a kind of parsing semi-structured text file based on Flume according to claim 1, which is special
Levy and be, methods described operating procedure is as follows:
1)Self defined interface IBdeEventParser;
2)According to the rule of business datum, self-defined text file analysis class, self defined interface IBdeEventParser is realized;
3)Newly-built Flume sequence of events class BdeLineDeserializer, realizes Flume interface
EventDeserializer.
3. the method for the data of a kind of parsing semi-structured text file based on Flume according to claim 2, which is special
Levy and be, the interface IBdeEventParser is described as follows:
public interface IBdeEventParser {
public void build(Context context);
public Event handleEvent(Event event);
}
Wherein build (Context context):For obtaining the conf text of Flume from the context Context of Flume
The resolution rules of part configuration;
handleEvent(Event event):For according to the resolution rules of the conf file configuration of Flume to the every of Flume
Data carries out parsing conversion.
4. the method for the data of a kind of parsing semi-structured text file based on Flume according to claim 2, which is special
Levy and be, construct BdeLineDeserializer (Context context, ResettableInputStream in) process
As follows:
1)The text file analysis class for configuring in the conf file of Flume is read, eventParser item is configured, be worth for self-defined
One of interface IBdeEventParser realizes class;
2)The parsing class of instantiation eventParser configuration, and call method build (Context context) carries out initially
Change;
3)In readEvents (int numEvents), readEvent () is called to obtain a Flume event, if Flume
Method handleEvent (Event event) that text file analysis class then calls text file analysis class is configured with, to per bar
Data carry out parsing conversion;Flume is configured without text file analysis class and does not then parse;
4)In readEvent (), the regular expression of the record for configuring in the conf file of Flume, configuration item are read
For filePattern, if regular expression is configured with, a record is read according to regular expression;If being configured without canonical table
Reaching formula and a line then being read as a record, wherein one records series turn to Flume event Event.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610819060.5A CN106446092A (en) | 2016-09-12 | 2016-09-12 | Flume-based method for analyzing data of semi-structured text file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610819060.5A CN106446092A (en) | 2016-09-12 | 2016-09-12 | Flume-based method for analyzing data of semi-structured text file |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106446092A true CN106446092A (en) | 2017-02-22 |
Family
ID=58167726
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610819060.5A Pending CN106446092A (en) | 2016-09-12 | 2016-09-12 | Flume-based method for analyzing data of semi-structured text file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106446092A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107885881A (en) * | 2017-11-29 | 2018-04-06 | 顺丰科技有限公司 | Business datum real-time report, acquisition methods, device, equipment and its storage medium |
CN108710694A (en) * | 2018-05-22 | 2018-10-26 | 浪潮软件集团有限公司 | Method and device for storing data as file based on flash |
CN109460219A (en) * | 2018-09-28 | 2019-03-12 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | The method of rapid serial Interface Control File |
CN109685375A (en) * | 2018-12-26 | 2019-04-26 | 重庆誉存大数据科技有限公司 | A kind of business risk regulation engine operation method based on semi-structured text data |
CN109710413A (en) * | 2018-12-29 | 2019-05-03 | 重庆誉存大数据科技有限公司 | A kind of integral Calculation Method of the rule engine system of semi-structured text data |
CN111324688A (en) * | 2020-02-24 | 2020-06-23 | 南京莱斯网信技术研究院有限公司 | Semi-structured data and unstructured data acquisition system based on events |
CN116644039A (en) * | 2023-05-25 | 2023-08-25 | 安徽继远软件有限公司 | Automatic acquisition and analysis method for online capacity operation log based on big data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105005549A (en) * | 2015-07-31 | 2015-10-28 | 山东蚁巡网络科技有限公司 | User-defined chained log analysis device and method |
CN105653662A (en) * | 2015-12-29 | 2016-06-08 | 中国建设银行股份有限公司 | Flume based data processing method and apparatus |
-
2016
- 2016-09-12 CN CN201610819060.5A patent/CN106446092A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105005549A (en) * | 2015-07-31 | 2015-10-28 | 山东蚁巡网络科技有限公司 | User-defined chained log analysis device and method |
CN105653662A (en) * | 2015-12-29 | 2016-06-08 | 中国建设银行股份有限公司 | Flume based data processing method and apparatus |
Non-Patent Citations (1)
Title |
---|
暗痛: "flume监控", 《博客园:HTTPS://WWW.CNBLOGS.COM/BREG/P/5649363.HTML》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107885881A (en) * | 2017-11-29 | 2018-04-06 | 顺丰科技有限公司 | Business datum real-time report, acquisition methods, device, equipment and its storage medium |
CN108710694A (en) * | 2018-05-22 | 2018-10-26 | 浪潮软件集团有限公司 | Method and device for storing data as file based on flash |
CN109460219A (en) * | 2018-09-28 | 2019-03-12 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | The method of rapid serial Interface Control File |
CN109460219B (en) * | 2018-09-28 | 2021-09-03 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Method for quickly serializing interface control file |
CN109685375A (en) * | 2018-12-26 | 2019-04-26 | 重庆誉存大数据科技有限公司 | A kind of business risk regulation engine operation method based on semi-structured text data |
CN109710413A (en) * | 2018-12-29 | 2019-05-03 | 重庆誉存大数据科技有限公司 | A kind of integral Calculation Method of the rule engine system of semi-structured text data |
CN109710413B (en) * | 2018-12-29 | 2020-09-08 | 重庆誉存大数据科技有限公司 | Integral calculation method of rule engine system of semi-structured text data |
CN111324688A (en) * | 2020-02-24 | 2020-06-23 | 南京莱斯网信技术研究院有限公司 | Semi-structured data and unstructured data acquisition system based on events |
CN116644039A (en) * | 2023-05-25 | 2023-08-25 | 安徽继远软件有限公司 | Automatic acquisition and analysis method for online capacity operation log based on big data |
CN116644039B (en) * | 2023-05-25 | 2023-12-19 | 安徽继远软件有限公司 | Automatic acquisition and analysis method for online capacity operation log based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106446092A (en) | Flume-based method for analyzing data of semi-structured text file | |
CN109284334B (en) | Real-time database synchronization method and device, electronic equipment and storage medium | |
US9105178B2 (en) | Remote dynamic configuration of telemetry reporting through regular expressions | |
CN107169069B (en) | Distributed hierarchical extraction multi-application method and data extraction applicator | |
CN109753502B (en) | Data acquisition method based on NiFi | |
CN110532466A (en) | Processing method, device, storage medium and the equipment of platform training data is broadcast live | |
CN101300843A (en) | Digital broadcast system, receiving device and sending device | |
CN103544298B (en) | The log analysis method and analytical equipment of component | |
CN111222547A (en) | Traffic feature extraction method and system for mobile application | |
Chow | Understanding SONET/SDH: Standards and Applications | |
CN104182541A (en) | Method for showing smart phone data information | |
KR20150081126A (en) | Big data service system based on web server and big data cluster using API driver | |
CN110275817A (en) | A kind of journal file automatic generation method based on model-driven | |
CN102289445A (en) | Method and device for analyzing XML (Extensible Markup Language) file and terminal | |
CN109241498A (en) | XML file processing method, equipment and storage medium | |
CN105786529B (en) | One type Managed Code calls the Parameters design of the labyrinth of C/C++ style function | |
CN108133017A (en) | A kind of multi-data source acquisition configuration method and device | |
CN117319527A (en) | Time sequence data processing method, device and medium based on identification analysis gateway | |
Yang et al. | A programmable ROADM system for SDM/WDM networks | |
CN102999626B (en) | A kind of data compression/decompression compression apparatus and method, system | |
CN116136801B (en) | Cloud platform data processing method and device, electronic equipment and storage medium | |
CN116028574A (en) | Government full life cycle big data management system and method thereof | |
CN111913821B (en) | Method for realizing cross-data-source real-time data stream production consumption | |
CN110795480B (en) | Traffic operation data processing method and device | |
CN114500676A (en) | Information interaction method and device among industrial internet devices and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170222 |
|
RJ01 | Rejection of invention patent application after publication |