CN108399199A

CN108399199A - A kind of collection of the application software running log based on Spark and service processing system and method

Info

Publication number: CN108399199A
Application number: CN201810091898.6A
Authority: CN
Inventors: 应时; 程国力; 张骁; 张威; 李宇航; 贾向阳
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-01-30
Filing date: 2018-01-30
Publication date: 2018-08-14

Abstract

The collection of the present invention relates to a kind of application software running log based on Spark and service processing system and method, log services are provided for the different resource at all levels, log collection service including daily record data resource layer, the user of the daily record data storage service of daily record data service layer, daily record data application layer obtains daily record data service.After it is collected into initial data by the log collection service of daily record data resource layer, it is pushed to daily record data service layer and carries out daily record data pretreatment, data are stored by daily record data storage service again, finally provide a user daily record data service in daily record data application layer.The present invention may be implemented using distributed collection strategy to collection of log data, a kind of multi-levels data storage organization is defined to store daily record data, and provide a user daily record data inquiry service, so that user is obtained useful application software running log data, the efficiency of fault diagnosis is improved by the daily record of acquisition.

Description

A kind of collection of the application software running log based on Spark and service processing system And method

Technical field

The invention belongs to daily record big data field, the collection more particularly to a kind of application software running log and service center Reason.

Background technology

In information system application, operation every time can all leave a trace, and here it is daily record, each journal file is remembered by daily record Record composition, every log recording describe the primary event individually occurred.Inside a complete information system, log system It is a very important functional component.It can record all behaviors caused by system, and according to certain specification It expresses.Daily record records necessary, valuable for IT resource correlated activations such as server, work station, fire wall and application software The record of the information of value, these information is all highly important to system monitoring, inquiry, security audit and troubleshooting.

Existing large scope software is mostly more people's exploitations, or using the software in a variety of sources, is integrated with and largely increases income The code of community, the code spice of software inhouse are difficult point after this daily record inherent logic complexity error there are many inconsistent Analysis, user is difficult that effective information is quickly obtained from daily record, it is difficult to achieve the purpose that improve fault diagnosis efficiency.More at present Development community adds some log processing methods in their Log Administration System, but all there are many problems, such as：Daily record The storage of data is chaotic；User oriented service is related to seldom, and user is difficult to customize daily record data according to demand.Therefore, such as What quickly and effectively handles daily record, and useful log information return to be current log services research must examine The problem considered.

It is on the increase with the running log of application software, more and more researchers begin to focus on log collection kimonos The research work of business processing this respect, while the software and system, Scribe that also occur largely for this respect research are The result collection system that Facebook increases income, has got a lot of applications inside Facebook.It can be from various daily records Collector journal on source.Logstash be an application log, the transmission of event, processing, management and search platform, can be with Management is collected to application log come unified with it, provides web interface for inquiring and counting.Flume is The High Availabitity that Cloudera is provided, highly reliable, distributed massive logs acquisition, polymerization and the system transmitted, Flume supports to customize Various types of data sender in log system, for collecting data.But most of these softwares are only inclined to There is no subsequent processing for daily record after collector journal, collection, it is difficult to reach the demand that user refines daily record data.

Invention content

For the studies above background and problem, the present invention provides a kind of application software running log based on Spark It collects and service processing frame：A daily record data service level frame is proposed, is provided for the different resource at all levels Log services include the log collection service of daily record data resource layer, the daily record data storage service of daily record data service layer, day The user of will data application layer obtains daily record data service.It is collected into original by the log collection service of daily record data resource layer After beginning data, it is pushed to daily record data service layer and carries out daily record data pretreatment, then number is stored by daily record data storage service According to finally providing a user daily record data service in daily record data application layer.

Technical scheme is as follows：

A kind of collection of the application software running log based on Spark and service processing system, which is characterized in that including day The log collection service unit of will data resource layer, the daily record data storage service unit and daily record data of daily record data service layer The user of pre-processing service unit, daily record data application layer obtains daily record data service unit, wherein：

Log collection service unit：For acquiring the original day generated in application software operation on daily record data resource layer Will data；

Daily record data pre-processing service unit：For log data according to demand to be rejected unnecessary information, The message of user's needs is left, including following three aspects pretreatment work is made to initial data：Data filtering, data deduplication with Log recording is segmented；

Daily record data storage service unit：For being responsible for storing initial data and pretreated data；

User obtains daily record data service unit：For providing multi-condition inquiry service interface, inquiry clothes are provided a user Business；

A kind of collection of the application software running log based on Spark and service processing method, which is characterized in that including such as Lower step：

Step 1：Log collection service collects log data using distributed collection strategy；

Step 2：Daily record data pre-processing service pre-processes the log data being collected into.

Step 3：Daily record data storage service receives log data and pretreated daily record data is stored in respectively In different databases；

Step 4：User obtains daily record data service, is used to provide multi-condition inquiry service interface, provides a user and look into Ask service；

Collection in a kind of above-mentioned application software running log based on Spark and service processing method, the step Rapid 1 comprises the steps of：

Step 1.1：Log collection service connects log services node using failover modes, can automatically select available Node connects.When some node goes wrong in daily record service node cluster, then the daily record data being collected into can be passed to it His service node.When cluster message service node is unavailable, other available message clothes will be selected automatically by failover Business node processing.

Step 1.2：When carrying out collection of log data, log collection module sets journal file path, daily record number first Can be completed in each child node according to collection work, collect after the completion of convergence synthesis one big log data set, here in order to Meet the needs of user is for daily record data, a filter is set in each child node.

Step 1.3：Before starting collection, it is first determined the data source of collector journal judges after determining data source Whether master nodes start, if do not started, start master nodes by changing configuration file.If started Master nodes are then selected, and determine that, using one or multiple master nodes, master nodes start according to system requirements Afterwards, agent nodes are set, and user customizes agent nodes according to self-demand, and customization includes three aspects, customizes source, real When data source, the channel of the caching of channel, that is, real-time logs data, the output of sink, that is, real time data.Setting includes The name of each source, channel and sink, type and its attribute.The connection of agent is carried out after being provided with, Start the collection work that all agent nodes start daily record data after the completion.

Collection in a kind of above-mentioned application software running log based on Spark and service processing method, the step Rapid 2 comprise the steps of：

Step 2.1：The pretreatment that daily record data pre-processing service carries out daily record data, including to initial data make with Lower three aspect pretreatment works：Data filtering, data deduplication and log recording are segmented.

Step 2.2：After simply pre-processing, selection carries out data point using Algorithm of documents categorization to daily record data Class processing, TFIDF algorithms are responsible for building VSM models to complete text vector, then realize data point by KNN algorithms Class.

Collection in a kind of above-mentioned application software running log based on Spark and service processing frame, step 2.1 In, daily record data pretreatment is divided into three subdivisions to complete, specifically：

Step A, first it is data filtering part, it can include a large amount of unnecessary records that a log data, which is concentrated, If system control position records, it is very common resource in daily record that user, which asks the URL etc. of static resource, these resources, still These are for ordinary user, it is not essential however to, it is therefore desirable to processing is filtered to log data, is effectively subtracted Light subsequent log processing pressure；

Step B, followed by daily record duplicate removal part, a log data concentration can include the record largely repeated, example Such as when encountering remote service request interruption, identical log recording can be repeatedly returned to, these records are only repeated first The log recording of return carries out subsequent daily record data analysis for user and does not help, it is therefore desirable to remember the daily record repeated Record removal；

Step C, it is finally log record sorter processing, general Log data format：When m- logging level-service name- The event of generation, such log data user can not read, it is therefore desirable to classify to log recording, no Same classification represents different meanings.

Collection in a kind of above-mentioned application software running log based on Spark and service processing method, the step Rapid 3 comprise the steps of：

Step 3.1：Daily record data is divided into two kinds：One is log datas, although this data general value density It is low but very valuable if further excavate.This kind of log data is filed herein, not due to such data It can be accessed by high frequency, it is also not high to requirement of real-time, therefore store into MySQL database；Another kind is cleaned by extracting Filtering screening simultaneously carries out pretreated daily record data, and such data are to analyze directly related data with subsequent user, therefore These data will be by frequent visit, therefore stores to the high distributed NoSQL databases HBase of performance；

Step 3.2：Log data is a kind of irregular semi-structured data form, different types of data format It is inconsistent, in order to cooperate with the work of two distinct types of database, it is necessary to carry out standardization processing, standardization to daily record data The purpose of format analysis processing is in order to reach following two purposes：Scalability is simplified.The purpose of scalability is to accommodate difference The application log of type makes to aim at day not constraining in type, simplification be in order to the daily record data after standardizing by with Family, which is brought, can improve efficiency when doing log analysis.

Step 3.3：Log data format specificationization can will be divided into three parts：Daily record indexes, log recording body And logging level.Daily record index is the core of entire daily record data storage format, we define one for each log services Service ID, as major key index convenient for quick-searching position log recording, improve pretreatment and later inquiry service Efficiency；Log recording body stores the information of daily record data itself, and log recording body is transparent, daily record for log system System carries out fault diagnosis by the analysis to these information；Logging level is divided into logging level and is divided into three grades INFO, ERROR, DEBUG.

Collection in a kind of above-mentioned application software running log based on Spark and service processing method, the step Rapid 4 comprise the steps of：

Step 4.1：By providing multi-condition inquiry service, inquiry service is provided a user, user can be according to different Demand carrys out inquiry log data using multi-condition inquiry.

Step 4.2：Daily record data enquiry module creates index to the daily record data that log collection module collection comes first, so The interface that user is provided by daily record data enquiry module afterwards can be retrieved by simple multi-condition inquiry and find out some server On all daily record datas, in real time from daily record storage data set in retrieve oneself required daily record data, every daily record Data record can generally check application program either with or without going wrong by log recording message, equally also can get the daily record Data come from the information such as which platform server.

The present invention may be implemented to define collection of log data using distributed collection strategy a kind of multi-levels data and deposit Storage structure stores daily record data, and provides a user daily record data inquiry service, so that user is obtained useful application soft Part running log data improve the efficiency of fault diagnosis by the daily record of acquisition.

Description of the drawings

Fig. 1 is daily record data service framework figure of the present invention.

Fig. 2 is daily record data service framework technology realization figure of the present invention.

Fig. 3 is data collection flow figure of the present invention.

Fig. 4 is daily record data querying flow figure of the present invention.

Specific implementation mode

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

As shown in Figure 1, it can be seen from the figure that when in each child node log data generate after, can be by working as Agent in preceding child node passes to collector, and each Agent is by source, channel, sink three parts group here At, source is data source, and channel is the channel that data are transmitted, and sink is used to transfer data to appointed place, It is triggered and is coordinated by event between this three, data can polymerize after Collector receives data, generate one The data flow of a bigger, is finally delivered in database.

Daily record data is generally divided into two classes, and the first kind is exactly the log data acquired from each application software.This Class daily record data information content is very big but all in tumble.This method files this kind of log data, due to such data It will not be accessed by high frequency, it is also not high to requirement of real-time, therefore can be stored in MySQL clusters after standardization.Second class It is exactly by pretreated daily record data, such data have very important significance for the work of the log analysis such as fault diagnosis, because This these data will be by frequent visit.Therefore this kind of data consider that the higher distributed data base of performance is arrived in storage herein In Hbase.How relevant database and non-relational database to be combined, and designs the specification of suitable daily record data storage Change the core that data structure is daily record data storage service.

First initial data is stored in relevant database cluster herein in service layer as shown in Figure 2, the day on resource layer Log data is imported service layer by will data aggregation service, calls daily record data storage service by data herein in service layer It stores in Mysql clusters, uses MySQL Cluster in this method here to build data-base cluster.

It is serviced by collection of log data after collecting log data, log data is inquired for convenience Modification, log data is sent in MySQL database cluster by this method, the format of log data generally comprise as When m- logging level-generation event, such daily record data is very lack of standardization, if directly to such log data It is pre-processed, it will waste a large amount of time, cause the inefficiency of entire daily record data processing service, it is therefore necessary to day Log data format is divided into three parts by will data format specifications, this method：Daily record indexes, log recording body and day Will grade.

In Fig. 2 it can be seen that the daily record data after standardization can be distributed to different ndbd sections after entering in MySQL clusters On point, storage operation is then carried out by NDB engines.Management node (can also claim management server) is mainly responsible for management number According to node and SQL nodes, also cluster profile and cluster log file.It monitors the working condition of other nodes, can Start, close or restart some node.Back end is for storing data.SQL nodes are with general MySQL server , we can carry out SQL operations by it.

After initial data is stored to MySQL clusters, Spark is imported immediately, by the RDD modules in Spark to original Pretreated result is then imported Hbase and carried out by the pretreatment work that data are filtered, duplicate removal and log recording are segmented Storage.

In HBase, by way of dividing region, to manage large-scale distributed database.It is pretreated Data can be that unit is stored according to record, herein be led pretreated data using the MapReduce Job modes of customization Enter Hbase.Since the document storage system of HBase bottoms is HDFS.Therefore still have HDFS high it is fault-tolerant the advantages of, simultaneously HBase provides Indexing Mechanism, the other application access log data being also convenient for.

After in the storage to Hbase databases of pretreated daily record data, user needs to look into log database It askes to obtain required daily record data.Hbase itself only supports the inquiry mode based on rowkey, and user, which is not aware that, to be needed The rowkey of data is inquired, daily record data can only be inquired by keyword, in order to meet this many condition of user Query demand.This method provides service to the user using the HBase multi-condition inquiries based on Solr, by Solr by HBase numbers Be encapsulated according to different condition according to data in library, user can according to different demands to by pretreated daily record data into Row condition query obtains oneself required daily record data.

The field of condition filter involved in Hbase tables and rowkey are established index by this method as shown in Figure 4 in Solr, It is quickly obtained the rowkey values for meeting filter condition by the multi-condition inquiry of Solr, takes these rowkey later in Hbase In inquired by specified rowkey, finally by result returned data collection to user.Since daily record data is a kind of big data The data of amount, therefore traditional singly-bound index has not been suitable for the inquiry of daily record data, therefore this method will be on Solr Key combination index is established, the characteristics of for daily record data after standardization, herein in these three keys of Time, Skype and ID Index is established in field, and carries out descending sort according to Time.

In order to better illustrate the present invention, case deployment is carried out to certain integrated disaster reduction spatial Information Service application system and ground Study carefully.In order to carry out verification feasibility to collection of log data proposed in this paper and processing method, test environment is by three physics Mechanism is at this three physical machines constitute a Spark cluster, wherein two conducts, from node, one is used as host node host Operating system be all Ubuntu, host node is only the part really calculated from node as existing for dispatching distribution task.It is main Node is DELL PowerEdge M630, it has 26 core E5-2609 v3 processors, 1.9GHz, 15M caching, 64G DDR4 Memory, 2 pieces of 300G 10K 2.5``SAS hard disks.It is DELL PowerEdge M630,28 core Xeon E5-2640 from node V3 processors, 2.6GHz, 20M caching, 128G DDR4 memories, 2 pieces of 300G 10K 2.5``SAS hard disks.These hardware devices it Between by ten thousand Broadcoms be connected；This three physical machines have also built MySQL clusters simultaneously, are responsible for providing relevant database clothes Business.

Data used herein use the daily record data of certain integrated disaster reduction spatial Information Service application system, the application system The target of system is that risk and the loss of natural calamity are visualized from the dimension of room and time, for each rank of every disaster management work Section provides intuitive information, and provides the services such as product, technology, decision, has ensured the effective progress for work of preventing and reducing natural disasters.It provides Data be divided into three kinds, respectively sample data (100MB), one month data (700MB), 1 year partial data (4.96GB)

It should be understood that the part that this specification does not elaborate belongs to the prior art.

It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention Profit requires under protected ambit, can also make replacement or deformation, each fall within protection scope of the present invention, this hair It is bright range is claimed to be determined by the appended claims.

Claims

1. collection and the service processing system of a kind of application software running log based on Spark, which is characterized in that including daily record The log collection service unit of data resource layer, the daily record data storage service unit and daily record data of daily record data service layer are pre- Service unit is handled, the user of daily record data application layer obtains daily record data service unit, wherein：

Log collection service unit：For acquiring the original log number generated in application software operation on daily record data resource layer According to；

Daily record data pre-processing service unit：For unnecessary information to be rejected to log data according to demand, leave The message that user needs, including following three aspects pretreatment work is made to initial data：Data filtering, data deduplication and daily record Record segmentation；

User obtains daily record data service unit：For providing multi-condition inquiry service interface, inquiry service is provided a user.

2. collection and the service processing method of a kind of application software running log based on Spark, which is characterized in that including as follows Step：

Step 2：Daily record data pre-processing service pre-processes the log data being collected into；

Step 3：Daily record data storage service receives log data and pretreated daily record data is stored in difference respectively Database in；

Step 4：User obtains daily record data service, is used to provide multi-condition inquiry service interface, provides a user inquiry clothes Business.

3. collection and the service processing method of a kind of application software running log based on Spark according to claim 2, It is characterized in that, the step 1 comprises the steps of：

Step 1.1：Log collection service connects log services node using failover modes, can automatically select available node Connection；When some node goes wrong in daily record service node cluster, then the daily record data being collected into can be passed to other clothes Business node；When cluster message service node is unavailable, other available messenger service sections will be selected automatically by failover Point processing；

Step 1.2：When carrying out collection of log data, log collection module sets journal file path first, and daily record data is received Collection work can be completed in each child node, one big log data set of convergence synthesis after the completion of collecting, here in order to meet User sets a filter in each child node for the demand of daily record data；

Step 1.3：Before starting collection, it is first determined the data source of collector journal judges that master is saved after determining data source Whether point starts, if do not started, starts master nodes by changing configuration file；It is selected if starting Master nodes, and determined using one or multiple master nodes, after master nodes start, setting according to system requirements Agent nodes, user customize agent nodes according to self-demand, and customization includes three aspects, customizes source, real time data Source, the channel of the caching of channel, that is, real-time logs data, the output of sink, that is, real time data；Setting includes each The name of source, channel and sink, type and its attribute；The connection that agent is carried out after being provided with, is completed Start the collection work that all agent nodes start daily record data afterwards.

4. collection and the service processing method of a kind of application software running log based on Spark according to claim 2, It is characterized in that, the step 2 comprises the steps of：

Step 2.1：The pretreatment that daily record data pre-processing service carries out daily record data, including following three are made to initial data Aspect pretreatment work：Data filtering, data deduplication and log recording are segmented；

Step 2.2：After simply pre-processing, selection carries out at data classification daily record data using Algorithm of documents categorization Reason, TFIDF algorithms are responsible for building VSM models to complete text vector, then realize that data are classified by KNN algorithms.

5. a kind of collection of application software running log based on Spark according to claim 2 and service processing frame, It is characterized in that, in step 2.1, daily record data pretreatment is divided into three subdivisions to complete, specifically：

Step A, first it is data filtering part, it can include a large amount of unnecessary records that a log data, which is concentrated, need Processing is filtered to log data；

Step B, followed by daily record duplicate removal part, a log data concentration can include the record largely repeated, such as When encountering remote service request interruption, identical log recording can be repeatedly returned to, only first return is repeated in these records Log recording, subsequent daily record data analysis is carried out for user and is not helped, it is therefore desirable to remove the log recording repeated It removes；

Step C, it is finally log record sorter processing, general Log data format：When m- logging level-service name-generation Event, such log data user can not read, it is therefore desirable to classify to log recording, it is different Classification represents different meanings.

6. collection and the service processing method of a kind of application software running log based on Spark according to claim 2, It is characterized in that, the step 3 comprises the steps of：

Step 3.1：Daily record data is divided into two kinds：One is log data, this data general value density is although low, but It is very valuable if further excavate；This kind of log data is filed herein, since such data will not be by High frequency accesses, also not high to requirement of real-time, therefore stores into MySQL database；Another kind is by extracting cleaning filtering Pretreated daily record data is screened and carries out, such data are to analyze directly related data, therefore these with subsequent user Data will be by frequent visit, therefore stores to the high distributed NoSQL databases HBase of performance；

Step 3.2：Log data is a kind of irregular semi-structured data form, and different types of data format differs It causes, in order to cooperate with the work of two distinct types of database, it is necessary to standardization processing be carried out to daily record data, standardize format The purpose of processing is in order to reach following two purposes：Scalability is simplified；The purpose of scalability is to accommodate different type Application log make to aim at day not constraining in type, simplification is in order to which the daily record data after standardizing is taken by user Efficiency can be improved when doing log analysis；

Step 3.3：Log data format specificationization can will be divided into three parts：Daily record index, log recording body and Logging level；Daily record index is the core of entire daily record data storage format, we define a service for each log services ID, as major key index convenient for quick-searching position log recording, improve pretreatment and later inquiry service efficiency； Log recording body stores the information of daily record data itself, and log recording body is transparent, log system for log system Fault diagnosis is carried out by the analysis to these information；Logging level is divided into logging level and is divided into three grades INFO, ERROR, DEBUG.

7. collection and the service processing method of a kind of application software running log based on Spark according to claim 2, It is characterized in that, the step 4 comprises the steps of：

Step 4.1：By providing multi-condition inquiry service, inquiry service is provided a user, user can be according to different needs Carry out inquiry log data using multi-condition inquiry；

Step 4.2：Daily record data enquiry module creates index to the daily record data that log collection module collection comes first, then uses The interface that family is provided by daily record data enquiry module can be retrieved by simple multi-condition inquiry and be found out on some server All daily record datas retrieve oneself required daily record data, every daily record data out of daily record storage data set in real time Record can generally check application program either with or without going wrong by log recording message, equally also can get the daily record data Come from the information such as any platform server.