CN109033196A

CN109033196A - A kind of distributed data scheduling system and method

Info

Publication number: CN109033196A
Application number: CN201810689485.8A
Authority: CN
Inventors: 王肖磊; 王志超; 李敬轩
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2018-12-18

Abstract

The present invention provides a kind of distributed datas to dispatch system and method, which includes at least one scheduling component, and the scheduling component is divided into multiple sub- logs suitable for obtaining offline logs to be processed from file system, and by the offline logs to be processed；The scheduling component is further adapted for for the multiple sub- log being distributed to multiple data mining units, and excavates log metadata from the sub- log according to preset rules by the data mining unit；The scheduling component, be further adapted for storing the log metadata and other procedural informations to include multiple initialized data bases storage assembly in.Solve the problems, such as in the prior art as central equipment focus on task and caused by single-point, when mission failure or execute equipment and break down, continued to execute by other equipment node, realizing the automatic multimachine of mission failure retries, guarantee task in time, correct operation.Alert notice can also be carried out to user by the system by occurring other problems during task execution.

Description

A kind of distributed data scheduling system and method

Technical field

The present invention relates to field of computer technology, dispatch system and method more particularly to a kind of distributed data.

Background technique

Heimdall is that the mass data with entirely autonomous intellectual property is excavated and analysis system, the system can be with Realize the excavation and processing to mass data, and provide easy-to-use tool to make for data mining personnel and OA operation analysis personnel With.For present analysis personnel using the system when inquiring file, what is found be file is usually original log, therefore is also needed Original log is processed again, handles, analyze, this undoubtedly will increase the workload of analysis personnel, is unfavorable for improving and divide The working efficiency of analysis personnel needs directly real in Heimdall system at this time in order to provide convenience for analysis personnel etc. Further extraction, the refinement of existing original log.

But script/meter is all based on when carrying out data mining processing or data pick-up task using the system at present It calculates platform/hard coded mode and strings whole flow process.In this kind of mode, all data scheduling duties are all unified by central equipment It completes, data processing task scheduling information concentration is aggregated into this management node of central equipment, causes information flow crowded, if should Management node, which breaks down, will affect the data processing task of whole system, also, the data processing task efficiency of current system It is low, it is unable to satisfy the demand for directly further extracting, refining to original log within the system.

Summary of the invention

In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind State the distributed data scheduling system and corresponding method of problem.

According to one aspect of the present invention, a kind of distributed data scheduling system, including at least one scheduling group are provided Part,

The scheduling component, suitable for obtaining offline logs to be processed from file system, and by the offline day to be processed Will is divided into multiple sub- logs；

The scheduling component is further adapted for the multiple sub- log being distributed to multiple data mining units, and by the number Log metadata is excavated from the sub- log according to preset rules according to unit is excavated；

The scheduling component is further adapted for storing the log metadata and other procedural informations to including multiple preset In the storage assembly of database.

Optionally, the scheduling component, is further adapted for:

The offline logs to be processed are divided into multiple sub- logs in conjunction with the source of the offline logs to be processed, and according to The multiple sub- log is distributed to corresponding data mining unit by the operating status of each data mining unit.

Optionally, the data mining unit, is further adapted for:

The sub- log is excavated based on MapReduce model and using Spark engine, and extracts log member number According to.

Optionally, the offline logs to be processed obtained from file system include at least one of:

The log that client accesses log caused by the behavior of server-side, sample flyback behavior generates.

Optionally, the content of the log metadata includes at least one of:

User identity information, Log Types.

Optionally, the scheduling component, is further adapted for:

Each data mining unit is monitored to the mining process of corresponding sub- log, and different monitoring any mining process operation Chang Shi starts other data mining units automatically and continues to excavate corresponding sub- log.

Optionally, the scheduling component is further adapted for:

If multiple initialized data bases in the storage assembly include mysql database, etcd database and redis number According to library, then the log metadata is stored to the mysql database and/or etcd database, and other processes are believed Breath is stored into the redis database.

Optionally, the system, further includes:

At least one front end unit, the front end unit are suitable for showing the mining process of each data mining unit, and monitor The display state executes alert notice in the display abnormal state.

Optionally, the front end unit, is further adapted for:

The triggering of user is received to suspend the mining process executed needed for the mining process being carrying out or starting.

Optionally, the system, further includes:

Data processing unit generates corresponding Virtual table suitable for sort out merging by the offline logs, and to described Log metadata is counted to obtain corresponding statistical information.

Optionally, the data processing unit, is further adapted for:

The offline logs sort out merging according at least one of described log content metadata and are generated accordingly Virtual table.

Optionally, the data processing unit, is further adapted for:

Polymerization calculating is carried out according to preset rules to the offline logs, obtains the log of specific format；

The log of the specific format is sorted out to merge and generates corresponding Virtual table.

Optionally, the preset rules include:

The data processing unit carries out polymerization calculating to the offline logs received according to prefixed time interval, obtains spy The log for the formula that fixes.

According to one aspect of the present invention, a kind of distributed data dispatching method is additionally provided, comprising:

Offline logs to be processed are obtained from file system, and the offline logs to be processed are divided into multiple sub- logs；

The multiple sub- log is distributed to multiple data mining units, and by the data mining unit according to default rule Log metadata is then excavated from the sub- log；

By the log metadata and other procedural informations store to include multiple initialized data bases storage assembly in.

Optionally, the offline logs to be processed are divided into multiple sub- logs, comprising:

The offline logs to be processed are divided into multiple sub- logs in conjunction with the source of the offline logs to be processed.

Optionally, the multiple sub- log is distributed to multiple data mining units, comprising:

According to the operating status of each data mining unit, the multiple sub- log is distributed to corresponding data mining list Member.

Optionally, log metadata is excavated from the sub- log according to preset rules, comprising:

Optionally, pre-stored multiple offline logs include at least one of in the file system:

Optionally, the content of the log metadata includes at least one of:

User identity information, Log Types.

Optionally, the method, further includes:

Optionally, the log metadata and other procedural informations are stored to the storage for including multiple initialized data bases In component, comprising:

Optionally, the method, further includes:

It shows the mining process of each data mining unit, and monitors the display state, held in the display abnormal state Row alert notice.

Optionally, the method, further includes:

The offline logs sort out merging and generate corresponding Virtual table, and the log metadata is counted Obtain corresponding statistical information.

Optionally, the offline logs sort out merging and generate corresponding Virtual table, comprising:

Optionally, the offline logs sort out merging and generate corresponding Virtual table, further includes:

Optionally, the preset rules include:

According to another aspect of the invention, a kind of computer storage medium, the computer storage medium are additionally provided It is stored with computer program code, when the computer program code is run on computers, the computer is caused to execute Distributed data dispatching method described in any of the above embodiments.

According to another aspect of the invention, a kind of calculating equipment is additionally provided, comprising: processor；It is stored with computer The memory of program code；When the computer program code is run by the processor, the calculating equipment is caused to execute Distributed data dispatching method described in any of the above embodiments.

Distributed data of the invention dispatches system, including at least one scheduling component, multiple data mining units and It include the storage assembly of multiple initialized data bases.Firstly, offline logs to be processed are obtained from file system by scheduling component, And acquired offline logs to be processed are divided into multiple sub- logs.Further, above-mentioned multiple sub- logs are divided by scheduling component It is sent to multiple data mining units, it, can be according to preset rules from the son after data mining unit receives corresponding sub- log Log metadata is excavated in log.Data mining unit can also generate other processes letter during excavating to sub- log Breath, in the mining process of data mining unit, can by log metadata that scheduling component will acquire in real time and its His procedural information store to include multiple initialized data bases storage assembly in.It can be seen that the present invention passes through at least one Scheduling component, multiple data mining units and include multiple initialized data bases storage assembly distributed structure/architecture, by adjusting The offline logs that degree component will acquire are divided into multiple sub- logs, and execute mining task parallel by multiple data mining units, from Corresponding log metadata is excavated in each sub- log, solve in the prior art as central equipment focus on task and caused by Single-point problem, also, the structure due to executing task using multi-node parallel, when data processing task failure or execution equipment go out When existing failure, the task can be continued to execute by other equipment node, the automatic multimachine of mission failure is realized and retries, ensure that and appoint Timely, the correct operation of business.On the other hand, the present invention is deposited by carrying out classification to log metadata and other procedural informations Storage searches convenient for subsequent data analysis personnel or other data processing equipments and obtains corresponding informance, further improves from side The timeliness of data processing task.In addition, distributed data provided by the invention scheduling system can also automatically will be extracted Data task code classification granting, facilitates the processing of each data task, is further concerned about how code runs without user, only needs It is uploaded to data dispatch system of the invention, the code of required operation can be classified automatically and is dispatched to suitable machine Upper operation further realizes the function that mission failure retries automatically.More, it asks when there are other in data task implementation procedure Topic can also carry out alert notice to user by the system.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

According to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings, those skilled in the art will be brighter The above and other objects, advantages and features of the present invention.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 is the structural schematic diagram of distributed data scheduling system according to an embodiment of the invention；

Fig. 2 is another structural schematic diagram of distributed data scheduling system according to an embodiment of the invention；

Fig. 3 is another structural schematic diagram of distributed data scheduling system according to an embodiment of the invention；

Fig. 4 is the design structure schematic diagram of distributed scheduling system according to an embodiment of the invention；And

Fig. 5 is the flow chart of distributed data dispatching method according to an embodiment of the invention.

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

In order to solve the above-mentioned technical problem, the embodiment of the invention provides a kind of distributed datas to dispatch system.Fig. 1 is shown The structural schematic diagram of distributed data scheduling system according to an embodiment of the invention.Referring to Fig. 1, the distribution of the present embodiment Formula data dispatch system includes at least one scheduling component 10, multiple data mining units 20, and including multiple preset data The storage assembly 30 in library.

Now introduce the embodiment of the present invention based on distributed data scheduling system each component part function and each portion Connection relationship between point:

Component 10 is dispatched, suitable for obtaining offline logs to be processed from file system, and will be acquired to be processed offline Log is divided into multiple sub- logs, and above-mentioned multiple sub- logs are further distributed to data mining unit 20；

Data mining unit 20 is coupled with scheduling component 10, the sub- log of correspondence distributed suitable for receiving scheduling component 10, and Log metadata is excavated from sub- log according to preset rules；

Storage assembly 30 is coupled with scheduling component 10, is dispatched from data mining unit 20 suitable for will dispatch component 10 Log metadata and other procedural informations are stored into multiple initialized data bases.

It should be noted that in the present embodiment, for convenience and clear, Fig. 1 illustrates only a scheduling component 10 with the connection relationship of other each sections, it is to be understood that other any scheduling in the distributed data scheduling system of the present embodiment Component 10 has identical structure and function, each connection relationship and Fig. 1 with the scheduling component 10 enumerated in the embodiment It is identical in example, it no longer excessively repeats herein, the connection relationship in relation to other scheduling components 10 in figure is also no longer shown.

The present invention is by least one scheduling component 10, multiple data mining unit 20 and includes multiple preset data The distributed structure/architecture of the storage assembly 30 in library is divided into multiple sub- logs by the offline logs that scheduling component 10 will acquire, and by more A data mining unit 20 executes mining task parallel, and corresponding log metadata is excavated from each sub- log, is solved existing In technology as central equipment focus on task and caused by single-point problem, also, due to executing task using multi-node parallel Structure, when data processing task failure or execute equipment break down when, this can be continued to execute by other equipment node Business, realizes the automatic multimachine of mission failure and retries, ensure that timely, the correct operation of task.On the other hand, the present invention by pair Log metadata and other procedural informations carry out classification storage, are convenient for subsequent data analysis personnel or other data processing equipments It searches and obtains corresponding informance, the timeliness of data processing task is further improved from side.In addition, distribution provided by the invention Formula data dispatch system can also facilitate the processing of each data task automatically by extracted data task code classification granting, Further be concerned about how code runs without user, only need to be uploaded to data dispatch system of the invention, can will needed for The code of operation, which is classified automatically and is dispatched on suitable machine, to be run, and the function that mission failure retries automatically is further realized. More, when occur in data task implementation procedure other problems can also by the system to user carry out alert notice.

Specifically, in an embodiment of the present invention, it is obtained from file system by scheduling component 10 first to be processed offline Log.In the embodiment, file system can be hdfs (the HadoopDistributed File for being stored with massive logs System, distributed file system), the file system such as S3 (Simple Storage Service, simple storage service), certainly It can also be other file system.In addition, pre-stored massive logs may include such as client visit in file system Ask the log that log caused by the behavior of server-side, sample flyback behavior generate etc. log.The embodiment of the present invention is to log Type do not do specific restriction.

Further, when scheduling component 10 gets day caused by the behavior of client access server-side from file system After the offline logs such as the log that will, sample flyback behavior generate, acquired offline logs can be further processed.? In one embodiment of the invention, the source of offline logs to be processed can be combined to be classified as multiple sub- logs by scheduling component 10, And according to the operating status of each data mining unit 20, above-mentioned multiple sub- logs are distributed to corresponding data mining unit 20.It specifically, can be directly using log caused by the behavior of client access server-side as a sub- log and by sample The log that this flyback behavior generates divides other offline logs as another sub- log, and according to the source of other offline logs For multiple sub- logs.In addition, in an alternative embodiment, client can also be accessed caused by the behavior of server-side Log is divided into multiple sub- logs according further to other rules, and the log that sample flyback behavior is generated is according further to other rule Then it is divided into multiple sub- logs etc..In addition, scheduling component 10 can also be according to any other feasible rule to be processed offline Log is classified, by the way that offline logs to be processed are divided into multiple sub- logs, realize to based on script computing platform it is hard The scheme that the mode of coding strings whole flow process has done the improvement of essence, by the way that entire flow chart of data processing to be divided at modularization Reason, so that can not only improve data-handling efficiency between each data processing module with parallel processing, and avoids the list of system Point failure problem executes data processing task in time, steadily.

It in an embodiment of the present invention, can also be by scheduling component after offline logs to be processed being divided into multiple sub- logs Multiple sub- logs according to the operating status of each data mining unit 20, are distributed to corresponding data mining unit 20 by 10, into And excavation processing is carried out to corresponding sub- log by data mining unit 20.Specifically, according to each data mining unit 20 When state is distributed multiple sub- logs, it can first determine whether each data mining unit 20 is currently carrying out task. If distributed data scheduling system is in rigid starting state, it is generally the case that most of data mining unit 20 is in the free time State, at this point, if after scheduling component 10 gets offline logs to be processed and be classified as multiple sub- logs from file system, it can To choose the data mining unit 20 of required number arbitrarily in multiple data mining units 20 being in idle condition with waiting Sub- log is received, it can also be according to the address information of each data mining unit 20 or other unique identification informations to each data mining list Member 20 is ranked up, and then multiple sub- logs are distributed to ordering preset number data mining unit by scheduling component 10 In 20.It should be noted that the above description of the present embodiment is only to enumerate, and do not constitute a limitation of the invention.

It, can be by data mining unit after data mining unit 20 receives the correspondence sub- log that scheduling component 10 is distributed 20 pairs of corresponding sub- logs are further processed.It specifically, in an embodiment of the present invention, can be first by data mining unit 20 According to the corresponding running environment of the received sub- log creation of institute, further by data mining unit 20 according to preset rules from sub- log Middle excavation log metadata.For example data mining unit 20 can be based on MapReduce model and using Spark engine antithetical phrase day Will is excavated, and extracts log metadata.Wherein, MapReduce model is a kind of programming model, is used for large-scale dataset The concurrent operation of (being greater than 1TB), Spark engine are the computing engines for the Universal-purpose quick for aiming at large-scale data processing and designing. In the present embodiment, log metadata may include the identification information of user and the type of log, according to the needs of users, day The metadata of will can also include the contents such as the generation time of log.

Further, during above-mentioned data mining unit 20 carries out data mining, scheduling component 10 can be monitored And the treatment process of each 20 neutron log of data mining unit is recorded, and when monitoring any sub- log processing exception, it can be with Automatically start other data mining units 20 to continue to carry out data mining processing to the sub- log.Specifically, starting it automatically When his data mining unit 20, unit 20 can be excavated with random start any data and continue with the sub- log, it can also be into one Step combines the status information of other data mining units 20 that suitable data mining unit 20 is selected to continue with, can also basis Preset rule chooses corresponding data mining unit 20 and handles corresponding sub- log, and the present embodiment only need to be in any data When excavation 20 neutron log processing process of unit is abnormal, starts other data mining units 20 automatically and continue with the sub- day Will with guarantee the sub- log can be able in time, properly process.

In addition, in data mining unit 20 after excavating log metadata in sub- log, it can be by scheduling component 10 by institute The log metadata of excavation and other procedural informations are stored to the storage assembly 30 for including multiple initialized data bases.Specifically Ground, in the present embodiment, multiple initialized data bases of storage assembly 30 can for etcd database, mysql database or Redis database, wherein etcd database is the key assignments storage system an of High Availabitity, is mainly used for configuration sharing kimonos Business discovery；Mysql database is mainly used for storing the metadata information of some data, will such as deposit after the metadata statistics of log Storage is into mysql database；Redis database is in the use ANSI C language an of open source writes, supports network, can be based on Deposit also can persistence log type, Key-Value database.In the present embodiment, when storage unit 30 include etcd database, When mysql database and redis database, then log metadata is stored to etcd database and/or mysql database In, and other procedural informations are stored into redis database.By to log metadata generated and other processes Information carries out classification storage, so that data processing task performed by distributed data scheduling system is definitely changed, and more just In the inquiry and extraction of subsequent analysis personnel or data processing equipment to corresponding information.

Further, Fig. 2 shows distributed data according to an embodiment of the invention scheduling system another Structural schematic diagram, as shown in Fig. 2, the distributed data scheduling system of the present embodiment further includes at least one front end unit 40.This The front end unit 40 of embodiment couples with scheduling component 10, is suitable for by data processing staff development code, and the generation that will be developed Code is uploaded to file system and is stored.

In addition, in an embodiment of the present invention, front end unit 40 is further adapted for showing the antithetical phrase day of each data mining unit 20 The mining process of will, and the display state is monitored, further alert notice can be executed to user when showing abnormal state.Separately On the one hand, front end unit 40 can also receive the triggering of user to suspend performed by data mining unit 20 to any sub- log Mining task, or starting needed for execute any mining task.

In an embodiment of the present invention, Fig. 3 shows distributed data scheduling system according to an embodiment of the invention Another structural block diagram.As shown in figure 3, distributed data scheduling system further includes data processing unit 50, the data processing Unit 50 is coupled with storage assembly 30, corresponding virtual suitable for the offline logs in storage assembly 30 are carried out classification merging generation Table, and the log metadata in storage assembly 30 is counted to obtain corresponding statistical information.In an embodiment of the present invention, For data processing unit 50 when carrying out sorting out the merging corresponding Virtual table of generation to offline logs, can be based on patent frame will The offline logs received carry out classification merging, for example, being sorted out according to the different-format of offline logs, field etc. to log Merge and generate corresponding Virtual table, wherein field herein can be organized to the corresponding data shape for needing to carry out structuring Formula.Corresponding Virtual table, the same day are generated in another example can carry out sorting out according at least one of log content metadata merging When will metadata includes the contents such as log generation time, user identity information, Log Types, data processing unit 50 can also be according to Log sort out merging according at least one of log generation time, user identity information, Log Types and is generated accordingly Virtual table.

In the embodiment, Virtual table may include following several:

Basic, sample Basic Information Table such as historical query amount, are first appeared including the sample key message of log Time, rank etc. can quickly understand the significance level of a sample, also, can also be realized based on the Virtual table subsequent The quick search of sample；

Specimen_detail, specimen details table；

Specimen_cloud_detail, specimen cloud look into static attribute information table；

Upload, file upload information table of tracing to the source；

Cloud_info, sample cloud look into information table, look into relevant information including the sample cloud of log, as file path, History rank, PV (Page View, page browsing amount), UV (Unique Visitor, independent access number of users) etc.；

Network_behavior, network_behavior table, the network of samples behavior including log are relevant Information；

Proc_behavior, process behavior table；

Proc_chain, chain of processes information table；

Dropped_files, file discharge information table；

Scan_log, scanning information table；Scan_info scanning information table；pestring；mid2ip；file_ Relations, document relationship table；Specimen has collected sample information table；Information table can be performed in pe_info, including sample Originally relevant table of executable information etc..

In addition, in an embodiment of the present invention, data processing unit 50 is carrying out offline logs to sort out merging generation phase When the Virtual table answered, polymerization calculating first can also be carried out according to preset rules to offline logs to be processed, and obtain specific format Log, and then again the log of specific format is sorted out to merge and generates corresponding Virtual table.Specifically, the data in the embodiment Processing unit 50 has data pick-up and the ability polymerizeing, and can be taken out from different data sources according to simple configuration file Log is taken, and polymerization calculating is carried out to log, finally accumulates the log of specific format.For example, the log of specific format can be with It is the log of json format, certainly, polymerize the log that the log being calculated can also be extended formatting.

In the embodiment, when carrying out polymerization calculating according to preset rules to offline logs, it can be according between preset time Every carrying out polymerization calculating to offline logs, prefixed time interval, which can be, daily calculates once, either offline logs polymerization It calculates primary to offline logs polymerization every two days or offline logs polymerization was calculated in one month primary etc..

Distributed data scheduling system in foregoing embodiments is actually that offline logs are carried out with the core of processed offline Equipment is based primarily upon a distributed data pick-up and aggregation framework, can handle the offline day of magnanimity in file system Will can run dozens of task daily, handle the hundreds of TB of data volume, can be extraction feature trillion.

It is described in detail below with a specific embodiment to distributed data scheduling system of the invention.

The distributed data scheduling system of the present embodiment is realized based on distributed scheduling system, first below to distribution Scheduling system is illustrated.The core of distributed scheduling system is distributed task dispatching, and task may be data conversion task It is also likely to be other tasks, belongs to an infrastructure component system.Distributed scheduling system is set based on master slave structure Meter, have and restore simultaneously retray function after executing mission failure automatically, and can support multiple-task type, is such as based on MapReduce model is simultaneously extracted log metadata, scheduling and is downloaded from the offline logs of file system using Spark engine File, load stored in hdfs etc..Referring to fig. 4, the specific work process of distributed scheduling system is now introduced, it is distributed Data dispatch system is in a distributed manner based on scheduling system, and the respective course of work is similar, is first introduced below distributed The specific work process of scheduling system.

In distributed scheduling system, etcd cluster, i.e. master (the scheduling component i.e. in the present invention) cluster, There are multiple master in master cluster.Any master can be stored from file extracts to be processed appoint in (S3/hdfs) Be engaged in task, and the waiting task of extraction is distributed to and corresponding worker (the data mining unit i.e. in the present invention) In node, for example, the task of extraction can be distributed in 4 corresponding worker nodes by master leader, this 4 A worker node can be parallel execution task.

During worker node execution task, master can be to the current task and corresponding each of executing The implementation procedure of worker node is recorded.In addition, the task member number that master can also will be generated during the task of execution Store according to (such as log metadata) into etcd, mysql database, and by generation other record (such as task quantity), Temporary information is stored into memory/redis database.

Data processing task is dispatched using the distributed scheduling system of the embodiment of the present invention, it can be in a node tasks Other nodes re-execute task after failure, effectively prevent single-point problem, also, also greatly facilitate data processing and appoint Business, without being concerned about how task executes, as long as task is uploaded to distributed scheduler, task Automatic dispatching to suitable machine It is run on device, and can be carried out failure and retry.

The embodiment of the invention also provides a kind of distributed datas to dispatch system, below to distributed data scheduling system Workflow is specifically introduced.The core function of distributed data scheduling system is to carry out the scheduling and conversion of data, The scheduling of data analysis task can be carried out using distributed scheduling system above, this task can be conversion task, It can be other tasks.I.e. using after distributed scheduling system scheduler task, as distributed scheduling system dispatch it is to be processed Offline logs, then the further data processing of task progress by distributed data scheduling system to scheduling, such as offer elasticity/can Process flow, data stream monitoring, storage easy to use, modular data mart modeling process of programming etc..Distributed data tune Degree system can be based on data processing shelf design above.

In the embodiment, distributed data dispatches system and task has been re-started definition, such as node, rdd, meta, In, node can represent a kind of mode that data processing is collected, and the output of a node can be used as the input of next node, Each node is logically independent, but can by configure/xml strings together.Node may include such as Types Below:

Filter, the node of filtration types can handle the rdd of input with customized filter condition；

Event, the node of event type customized can extract result according to customized event；

Fill has mended the node of type, customized can mend rule to handle the rdd of input；

Map/reduce carries out the node of data processing by map/reduce program；

Spark carries out the node of data processing by spark program；

Script carries out the node of data processing by script.

Rdd is derived from the concept of spark, elasticity distribution formula data set, and a results set of node is exactly rdd, and rdd can be certainly Definition storage, or data volume can be defined and automatically select storage, in addition, rdd also can define the rule cut, cut output.

Meta metadata, the data type that each node can be handled, such as processing sample, can be the data structure of sample It is defined as the form of Virtual table described above.

It is hereby achieved that distributed data scheduling its core function of system is to be configured to execute according to node in individual node Corresponding service logic.

Data pick-up task is described with simply example below.Such as data pick-up task is to extract the sample of Baidu This.

Firstly, extracting md5 (message-digest algorithm 5, message digest algorithm 5) from cloud killing log It is then calculated and is corresponded to according to md5, sha1 value of extraction with sha1 (secure hash algorithm, Secure Hash Algorithm) The daily pv/uv of sample, and obtain parent_url (the parent_uniform resource of sample of the pv/uv greater than 100w Locator, parent uniform resource locator).If the sample comprising Baidu, its all subprocess is obtained, and extract previous The details of hundred subprocess are shown.

The specific of above-mentioned data pick-up task is executed using the distributed data scheduling system of the embodiment of the present invention and executes step Suddenly it may is that

Step1, monitoring pv are greater than 1000000 sample；Specific code can be

Step2, the parent_url attribute for pulling sample；Specific code can be

Step3, the sample that parent_url includes Baidu is filtered out；Specific code can be

Step5, the sample of previous hundred subprocess is shown in front end.

In the embodiment of the present invention, distributed data dispatches system and realizes that the Scheduling Core of logic can be according to whole configuration Above-mentioned each step is stringed together, and is responsible for the relevant storage of management rdd, the task of each node is distributed to each node and is held Row.In the embodiment, single node can be independently executed.

In addition, data processing system can also provide the function of visual edit by setting front end page, visualization The conf (configuration file) that editor's configuration generates json format is submitted to Scheduling Core.Also, front end page can not only be shown The progress of each node can also provide the function of being started manually by the user the stopping single node of the task.

Based on the same inventive concept, the embodiment of the invention also provides a kind of distributed data dispatching method, Fig. 5 is shown The flow chart of distributed data dispatching method according to an embodiment of the invention.As shown in figure 5, the distributed data dispatching party Method includes at least step S502 to step S506:

Step S502, offline logs to be processed are obtained from file system, and offline logs to be processed are divided into multiple sons Log；

Step S504, multiple sub- logs are distributed to multiple data mining units, and by data mining unit according to default Rule excavates log metadata from sub- log；

Step S506, log metadata and other procedural informations are stored to the storage group for including multiple initialized data bases In part.

In an embodiment of the present invention, execute step S502 when, can also after getting offline logs to be processed, in conjunction with Above-mentioned offline logs to be processed are divided into multiple sub- logs by the source of offline logs to be processed.Further, step S504 is executed, it will Multiple sub- logs are distributed to multiple data mining units, specifically, can be according to the operating status of each data mining unit, will Multiple sub- logs are distributed to corresponding data mining unit.

It further,, can be by data when step S504 is specifically executed after data mining unit receives corresponding sub- log It excavates unit and excavates log metadata from sub- log according to preset rules.In an embodiment of the present invention, it can be based on MapReduce model simultaneously excavates sub- log using Spark engine, and extracts log metadata.The file of the present embodiment Pre-stored multiple offline logs may include log caused by the behavior of client access server-side, sample time in system Sweep the log etc. of behavior generation.The content of log metadata may include user identity information, Log Types etc..

In an embodiment of the present invention, in above-mentioned steps implementation procedure, each data can also be monitored by scheduling unit and are dug Unit is dug to the mining process of corresponding sub- log, and when monitoring any mining process operation exception, starts other numbers automatically Continue to excavate corresponding sub- log according to unit is excavated, to realize mission failure retries automatically in the system function.

It further, can be by extracted day in data mining unit after extracting log metadata in sub- log Other procedural informations caused by will metadata and mining process are stored to the storage assembly for including multiple initialized data bases In.Specifically, if multiple initialized data bases in storage assembly include mysql database, etcd database and redis data Library then stores log metadata to mysql database and/or etcd database, and by other procedural informations store to In redis database.

In addition, can also show the excavation of each data mining unit by front end unit in above-mentioned steps implementation procedure Process, and the display state is monitored, further alert notice is executed to user when monitoring display abnormal state.In addition, also The triggering of user can be received by front end unit to suspend the excavation executed needed for the mining process being carrying out or starting Journey.

Further, after above steps execution terminates, offline logs can also be carried out by data processing unit Sort out to merge and generate corresponding Virtual table, and log metadata is counted to obtain corresponding statistical information.It specifically, will be from Line log, which sort out, to be merged when generating corresponding Virtual table, can by offline logs according in log content metadata at least It is a kind of to carry out sorting out the corresponding Virtual table of merging generation.

It on the other hand, can also be to by data when carrying out sorting out the merging corresponding Virtual table of generation by offline logs Reason unit carries out polymerization calculating according to preset rules to offline logs, obtains the log of specific format.Further to specific format Log sort out to merge and generate corresponding Virtual table.In the present embodiment, preset rules may include: data processing unit according to Prefixed time interval carries out polymerization calculating to the offline logs received, obtains the log of specific format.

The embodiment of the invention also provides a kind of computer storage medium, computer storage medium is stored with computer program Code causes calculating equipment to execute point in any embodiment above when computer program code is run on the computing device Cloth data dispatching method.

In addition, the embodiment of the invention also provides a kind of calculating equipment, including processor；It is stored with computer program code Memory；When computer program code is run by processor, calculating equipment is caused to execute point in any embodiment above Cloth data dispatching method.

According to the combination of any one above-mentioned preferred embodiment or multiple preferred embodiments, the embodiment of the present invention can reach It is following the utility model has the advantages that

The present invention is by least one scheduling component, multiple data mining unit and includes multiple initialized data bases The distributed structure/architecture of storage assembly is divided into multiple sub- logs by the offline logs that scheduling component will acquire, and is dug by multiple data Pick unit executes mining task parallel, and corresponding log metadata is excavated from each sub- log, is solved in the prior art in Entreat equipment centralized processing task and caused by single-point problem, also, due to using multi-node parallel execute task structure, work as number When according to processing mission failure or executing equipment failure, the task can be continued to execute by other equipment node, realizes and appoints The automatic multimachine of business failure retries, and ensure that timely, the correct operation of task.On the other hand, the present invention passes through to log metadata And other procedural informations carry out classification storage, search acquisition pair convenient for subsequent data analysis personnel or other data processing equipments Information is answered, the timeliness of data processing task is further improved from side.In addition, distributed data scheduling provided by the invention System can also facilitate the processing of each data task automatically by extracted data task code classification granting, without user into One step is concerned about how code runs, and only need to be uploaded to data dispatch system of the invention, can be by the code of required operation Automatically classify and be dispatched on suitable machine and run, further realize the function that mission failure retries automatically.More, work as number According to occurring other problems during task execution alert notice can also be carried out to user by the system.

It is apparent to those skilled in the art that the specific work of the system of foregoing description, equipment and unit Make process, can refer to corresponding processes in the foregoing method embodiment, for brevity, does not repeat separately herein.

In addition, each functional unit in each embodiment of the present invention can be physically independent, can also two or More than two functional units integrate, and can be all integrated in a processing unit with all functional units.It is above-mentioned integrated Functional unit both can take the form of hardware realization, can also be realized in the form of software or firmware.

Those of ordinary skill in the art will appreciate that: if integrated functional unit is realized in the form of software and as only Vertical product when selling or using, can store in a computer readable storage medium.Based on this understanding, this hair Bright technical solution is substantially or all or part of the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium comprising some instructions, with (such as personal so that calculating equipment Computer, server or network etc.) all or part of the steps of execution various embodiments of the present invention method in operating instruction. And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or The various media that can store program code such as person's CD.

Alternatively, realizing that all or part of the steps of preceding method embodiment can be (all by the relevant hardware of program instruction Such as personal computer, the calculating equipment of server or network etc.) it completes, described program instruction can store to be calculated in one In machine read/write memory medium, when described program instruction is executed by the processor of calculating equipment, the calculating equipment executes sheet Invent all or part of the steps of each embodiment the method.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Present invention has been described in detail with reference to the aforementioned embodiments for pipe, those skilled in the art should understand that: at this Within the spirit and principle of invention, it is still possible to modify the technical solutions described in the foregoing embodiments or right Some or all of the technical features are equivalently replaced；And these are modified or replaceed, and do not make corresponding technical solution de- From protection scope of the present invention.

Based on one aspect of the present invention, a kind of distributed data scheduling system of A1., including at least one scheduling are provided Component,

A2. system according to a1, wherein the scheduling component is further adapted for:

A3. the system according to A2, wherein the data mining unit is further adapted for:

A4. according to the described in any item systems of A1-A3, wherein the offline day to be processed obtained from file system Will includes at least one of:

A5. according to the described in any item systems of A1-A3, wherein the content of the log metadata include it is following at least it One:

User identity information, Log Types.

A6. according to the described in any item systems of A1-A3, wherein the scheduling component is further adapted for:

A7. according to the described in any item systems of A1-A3, wherein the scheduling component is further adapted for:

A8. according to the described in any item systems of A1-A3, wherein further include:

A9. the system according to A8, wherein the front end unit is further adapted for:

A10. according to the described in any item systems of A1-A3, wherein further include:

A11. the system according to A10, wherein the data processing unit is further adapted for:

A12. according to the described in any item systems of A1-A3, wherein the data processing unit is further adapted for:

A13. the system according to A12, wherein the preset rules include:

Based on another aspect of the present invention, a kind of distributed data dispatching method of B14. is additionally provided, comprising:

B15. method according to b14, wherein the offline logs to be processed are divided into multiple sub- logs, comprising:

B16. the method according to B15, wherein the multiple sub- log is distributed to multiple data mining units, is wrapped It includes:

B17. method according to b14, wherein log metadata is excavated from the sub- log according to preset rules, Include:

B18. according to the described in any item methods of B14-B17, wherein pre-stored multiple offline in the file system Log includes at least one of:

B19. according to the described in any item methods of B14-B17, wherein the content of the log metadata include it is following at least One of:

User identity information, Log Types.

B20. according to the described in any item methods of B14-B17, wherein further include:

B21. according to the described in any item methods of B14-B17, wherein deposit the log metadata and other procedural informations Store up to include multiple initialized data bases storage assembly in, comprising:

B22. according to the described in any item methods of B14-B17, wherein further include:

B23. the method according to B22, wherein further include:

B24. according to the described in any item methods of B14-B17, wherein further include:

B25. the method according to B24, wherein it is corresponding virtual that the offline logs are subjected to classification merging generation Table, comprising:

B26. according to the described in any item methods of B14-B17, wherein carry out the offline logs to sort out merging generation phase The Virtual table answered, further includes:

B27. the method according to B26, wherein the preset rules include:

Polymerization calculating is carried out to the offline logs received according to prefixed time interval by data processing unit, is obtained specific The log of format.

Based on another aspect of the present invention, a kind of computer storage medium of C28., the computer storage are additionally provided Media storage has computer program code, when the computer program code is run on computers, leads to the computer Perform claim requires the described in any item distributed data dispatching methods of B14-B27.

Based on an additional aspect of the present invention, a kind of calculating equipment of D29. is additionally provided, comprising: processor；It is stored with meter The memory of calculation machine program code；When the computer program code is run by the processor, lead to the calculating equipment Perform claim requires the described in any item distributed data dispatching methods of B14-B27.

Claims

1. a kind of distributed data dispatches system, including at least one scheduling component,

The scheduling component divides suitable for obtaining offline logs to be processed from file system, and by the offline logs to be processed For multiple sub- logs；

The scheduling component is further adapted for the multiple sub- log being distributed to multiple data mining units, and is dug by the data It digs unit and excavates log metadata from the sub- log according to preset rules；

The scheduling component is further adapted for storing the log metadata and other procedural informations to including multiple preset data In the storage assembly in library.

2. system according to claim 1, wherein the scheduling component is further adapted for:

The offline logs to be processed are divided into multiple sub- logs in conjunction with the source of the offline logs to be processed, and according to each The multiple sub- log is distributed to corresponding data mining unit by the operating status of data mining unit.

3. system according to claim 2, wherein the data mining unit is further adapted for:

The sub- log is excavated based on MapReduce model and using Spark engine, and extracts log metadata.

4. system according to claim 1-3, wherein the offline day to be processed obtained from file system Will includes at least one of:

5. system according to claim 1-3, wherein the content of the log metadata include it is following at least it One:

User identity information, Log Types.

6. system according to claim 1-3, wherein the scheduling component is further adapted for:

Each data mining unit is monitored to the mining process of corresponding sub- log, and be operating abnormally monitoring any mining process When, start other data mining units automatically and continues to excavate corresponding sub- log.

7. system according to claim 1-3, wherein the scheduling component is further adapted for:

If multiple initialized data bases in the storage assembly include mysql database, etcd database and redis data Library then stores the log metadata to the mysql database and/or etcd database, and by other procedural informations It stores into the redis database.

8. a kind of distributed data dispatching method, comprising:

The multiple sub- log is distributed to multiple data mining units, and by the data mining unit according to preset rules from Log metadata is excavated in the sub- log；

9. a kind of computer storage medium, the computer storage medium is stored with computer program code, when the computer When program code is run on computers, lead to distributed data dispatching method described in the computer perform claim requirement 8.

10. a kind of calculating equipment, comprising: processor；It is stored with the memory of computer program code；When the computer program When code is run by the processor, lead to distributed data dispatching method described in the calculating equipment perform claim requirement 8.