CN104298771B

CN104298771B - A kind of magnanimity web daily record datas inquiry and analysis method

Info

Publication number: CN104298771B
Application number: CN201410596395.6A
Authority: CN
Inventors: 马廷淮; 瞿晶晶; 田伟; 薛羽; 曹杰
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Beijing Zhirong Shidai Information Technology Co ltd
Priority date: 2014-10-30
Filing date: 2014-10-30
Publication date: 2017-09-05
Anticipated expiration: 2034-10-30
Also published as: CN104298771A

Abstract

The present invention discloses inquiry and the analysis method of a kind of magnanimity web daily record datas based on Hadoop and Hive using high reliability, high scalability, high efficiency and the high fault tolerance of Hadoop/Hive Distributed Computing Platforms.The present invention comprises the following steps：Data to each data source are parsed；Data are loaded into data warehouse；Receive HiveQL sentences；Optimized to receiving sentence, obtain preliminary map results；Sentence will be received to be converted into MapReduce tasks carryings and store Query Result；Data are split；Analysis mining is carried out to data；Data are loaded into Mysql databases.The present invention is directed to the web daily record datas of magnanimity, realizes accurately inquiry and data analysis, can realize the scalability and high efficiency of mass data storage query analysis, the problem of job skewness overall performances for also avoiding data skew from bringing decline.

Description

A kind of magnanimity web daily record datas inquiry and analysis method

Technical field

The invention belongs to technical field of computer information processing, and in particular to a kind of magnanimity based on Hadoop and Hive Web daily record datas are inquired about and analysis method.

Background technology

With developing rapidly for Internet technologies, the various application and service run on Internet are also a large amount of therewith Emerge in large numbers, the epoch of big data have arrived.Each website is an independent information system in itself, and network is passed through in these websites After interconnection so that whole internet becomes a huge information system.Client can leave it during browsing web sites The vestige that accesses, these vestiges can preserve in the form of web journal files.Various systems, program, O＆M, transaction etc. Obtaining daily record becomes more and more important, because it is the important evidence of the operations such as system recovery, error tracking, safety detection.

Because data source is numerous, each system user is various, frequent operation, TB grades even PB grades can be produced daily Magnanimity web daily record datas, and traditional database can not have been met and moved now due to the limitation of scalability and process performance Often tens of G, hundreds of G, the requirement of the storage analyzing and processing of even upper T data volume.And in a lot of non-structured daily record texts Inside part, how quick-searching goes out data, how fast searching is to useful data, how to daily record progress statistical analysis, into For urgent problem to be solved.Existing big data querying method can only directly carry out the search of line unit simply by HBase and borrow Hive HQL is helped to be retrieved, retrieval time delay is very big, and data results are also inaccurate, it is impossible to meet current demand.

The content of the invention

To solve the above problems, the present invention utilizes the high reliability of Hadoop/Hive Distributed Computing Platforms, high extension Property, high efficiency and high fault tolerance, disclose a kind of inquiry and analysis of the magnanimity web daily record datas based on Hadoop and Hive Method.

Open Framework Hadoop is an extensive use and very unique instrument, and user is write by oneself MapReduce programs, and a task is divided into many more fine-grained subtasks by dispatching, and these subtasks are distributed Different nodes into cluster, with parallel progress.So, also can obtain in the case of big data set user receiving when Between be spaced.Hadoop causes the benefit that the user for being ignorant of Distributed Calculation can also make full use of Distributed Calculation to bring.Hive Increased income first by Facebook in 2008, once release, allowing for Hive use becomes very popular, Hadoop user Exploitation Hive can be used according to the data processing needs of oneself.Hive defines simple class SQL query languages, claims For HiveQL, it is allowed to be familiar with SQL user's inquiry data.Meanwhile, this language also allows to be familiar with MapReduce exploitations What the customized mapper and reducer of exploitation of person can not complete to handle built-in mapper and reducer Complicated analysis work.Hive mainly includes user interface, metadata storage, interpreter, compiler, optimizer, actuator etc. Deng.Hadoop distributed file system file system HDFS is stored in by the plan of interpreter, compiler, optimizer generation In, actuator calls MapReduce programs to complete sentence and calls analysis.

The present invention is directed to web daily record data magnanimity features, and correlation is carried out to magnanimity web daily record datas according to actual conditions Inquiry and analysis, using the HiveQL of optimization as the important means of inquiry, with data segmentation and the combination of genetic algorithm to magnanimity day Will data are analyzed, and realize the efficient excavation of big data.

In order to achieve the above object, the present invention provides following technical scheme：

A kind of magnanimity web daily record datas inquiry and analysis method, comprise the following steps：

Step（1）, the data of each data source are parsed with the ETL in Hive, resolving includes extracting, clear Four steps are washed, converted and loaded, when being cleaned to data, useful information therein is carried out with MapReduce programs Distributed extraction processing；

Step（2）, the data extracted are loaded into data warehouse；

Step（3）, Hive part Driver reception HiveQL sentences；

Step（4）, optimized for tilt data to receiving sentence, preliminary map obtained after carry out table attended operation As a result；

Step（5）, the HiveQL sentences received are converted into MapReduce tasks carryings and Query Result is stored；

Step（6）, data segmentation is carried out for the web daily record datas of magnanimity；

Step（7）, the genetic algorithm searched for using the global randomization of highly-parallel is to data progress analysis mining；

Step（8）, the data that data query and analysis part are drawn are loaded into Mysql databases.

Further, the step（4）In optimization operation include to inclined data use map join connection data Table, not inclined data are with common join connection tables of data.

Further, the step（5）In composite function combiner is introduced during map, realize that local key's is poly- Close, the key that map is exported is sorted, value is iterated.

Further, the combiner function setups are transported before or after the map results produced merge operation OK.

Compared with prior art, the invention has the advantages that and beneficial effect：

The present invention is directed to magnanimity web daily record datas, and the scalability for taking storage mass data system into consideration also has data Structure it is unstructured, and data with existing processing method advantage and disadvantage, the high-performance based on Hadoop/Hive distributed systems Calculate and split based on data and genetic algorithm data analysis technique, contribute to inquiry in the web daily record datas of magnanimity with Analysis, realizes accurately inquiry and data analysis.For example, the daily record data that can analyze search engine web site obtains user's click Ranking of the order with URL.This method has carried out Hive optimizations, compensate for directly carrying out line unit simply by HBase in the past Search and carry out retrieval time delay very big shortcoming by Hive HQL；Meanwhile, to split analysis using data, while using Record in Analysis of Genetic Algorithms daily record data, makes data results more accurate.Both sides is combined, and can realize magnanimity The scalability and high efficiency of data store query analysis, are also avoided under the job skewness overall performances that data skew is brought The problem of drop.Relative to traditional daily record data query analysis method, can allow to carry out daily record data analysis inquiry company or Person client can accurately understand web situations, and for example can find Top Site with URL ranking according to user's click order is carried out The advertisement putting of trade company.The present invention realizes the data mining of big data, for example, can realize web recommendation and ecommerce Marketing.

Brief description of the drawings

Fig. 1 is the inventive method steps flow chart schematic diagram；

Fig. 2 is the list structure figure of webpage.

Embodiment

The technical scheme provided below with reference to specific embodiment the present invention is described in detail, it should be understood that following specific Embodiment is only illustrative of the invention and is not intended to limit the scope of the invention.

Client can leave the vestige that they are accessed during browsing web sites, and these vestiges can be with web journal files Form is preserved.This example is directed to these data, using the ETL language in Hive, the Hive SQL queries of optimization, introduces The MapReduce of combiner functions, the genetic algorithm based on data cutting techniques provide daily record data inquiry with dividing come accurate The result of analysis.As shown in figure 1, this method is comprised the following steps that：

Step 10, the data of each data source are parsed with the ETL in Hive.ETL processes include entering data Four steps that row extracts, cleans, converts and load.The extraction stage parses source data in deposit Hive, uses Hadoop And Hive programs extract the data that may be used to Transform layers from source data；Wash phase will based on Hive programs The field that may subsequently use extracts Load layers of deposit, discards the data and repeated data that will not be used；Loading Stage is exactly the table being stored in the data handled well in Hive, and deletes source data.But only extracting data with ETL instruments can not The requirement of speed is met, the present invention uses MapReduce programmings for the cleaning process of data, will be carried out per data Read, extract respective field.When handling initial data, distributed cleaning treatment is carried out using MapReduce programs, in cluster One NameNode of middle setting（JobTracker）To serve as data distributing server, DataNode is set（TaskTracker） To deposit and handle the data distributed via NameNode.It is 128,000,000 that data to be processed are divided into size by NameNode Block, each block number according to two backups are set, then according to certain algorithm by Hadoop systems from being about to block number according to deposit DataNode is the further processing that data are carried out in data processing server.This step be related to Hive and MapReduce application, it is built upon on the basis of distributed file system HDFS, and subsequent data warehouse is from various dimensions pair Mass data is modeled, as requested inquiry or analyze data.

Step 20, the data parsed through step 10 are stored among the table in data warehouse Hive designing and building up. The number of table in Hive is designed, it is necessary to create multiple tables according to the actual conditions of data, and each table visioning procedure is substantially the same. Such as field of the table of storage Apache format logs：Visitor IP, viewer's mark, user name, access time, the side accessed Document that method, request are accessed etc..The present invention sets up first number that a relational database metastore is specifically used to storage table It is believed that breath.

Step 30, the Driver that Hive systems are carried receives HiveQL sentences, administers the life of HiveQL sentences Life cycle, including compiling, optimization and execution to HiveQL sentences, its detailed process is as follows：

Step 40, for tilt data problem, optimized to receiving sentence, carry out table connection（join）Obtained after operation Preliminary map results:The problem of invalid id will encounter data skew in association, such as daily about 2,000,000,000 the whole network day Will, visitor IP therein is major key, can be lost during log collection, the situation that major key is null occurs, if taking it In visitor IP and viewer sign associate, the problem of data skew will being encountered.During reason is Hive, major key is null values Item can be taken as identical Key and distribute into same calculating map, Calculation bottleneck can be caused.Met according to the distribution of data Sociology statistical law, inclined key will not be too many, and the present invention is optimized when carrying out Hive inquiry join sentences, inclined Data map join, i.e., carry out a segmentation to the major key of tilt data, it is to avoid inclined major key is all distributed into a meter Calculate, carry out distributed table attended operation；Not inclined data are with common join, i.e., the connection behaviour for directly pressing major key carry out table Make, final merging obtains complete result.

Step 50, optimize obtained preliminary map results according to step 40, then compiler is called with Driver Compiler, the strategy that the HiveQL sentences received are converted into being made up of the DAG of MapReduce tasks, strategy Operated and constituted by metadata and HDFS, be finally submitted to task on enforcement engine with topological order, complete the analysis of data Calculating task --- carry out distributed query according to querying condition.MapReduce input, which comes from, has been imported into HDFS File in cluster, these files are evenly distributed in all nodes, one MapReduce program of operation can first in part or In all nodes of person run mapping tasks, all mapping tasks be all it is of equal value, each mapping tasks all without with Other mapping exchange information, other mapping presence are also may not realize that, after the completion of the mapping stages, between node The middle key-value pair of generation may be intercoursed, and will possess identical key value, such as same visitor IP is submitted to During same Reducer, whole MapReduce, communicating between node occurs with regard to only possible in this step, with Mapping tasks are the same, and reducing tasks will not also be communicated with other reducing tasks, Hadoop MapReduce Ensure the reliability of tasks carrying by being automatically performed data transfer and restarting thousand business on failure node.On this basis, After map processes, before reduce processes, we may be incorporated into composite function combiner, to the number of map the output of process According to optimizing, local key polymerization is realized, the key that map is exported is sorted, value is iterated；To producing during map Data can carry out one merging merge operation, by the data of generation press major key merge, combiner functions can also basis The result for needing to be arranged on map generations run before or after merge, especially large result when, greatly reduce Data copy of the map tasks to Reduce tasks.

Step 60, data are split：Test sample collection is averagely divided into M parts (born by InputFormat first Data block is divided into InputSplit by duty), and unified (be formatted as is carried out to data format<Id,<X, Y>>, wherein, id Represent the numbering being made up of visitor IP and access date；Y represents the page of user's current accessed；X represents reference, i.e. user The page stopped before accession page Y.Then, map operations are that each record of input is scanned, by data set Initialized according to above-mentioned form；After map is operated, intermediate result is obtained<<X, Y>, 1>, that is, have a user from page Face Y have accessed page X；Reduce is operated then by intermediate result according to identical<X, Y>Page jump access mode carry out Merging obtains output result<<X, Y>, n>, wherein, n represent access path X->Y frequency.Secondly, each sub-group（I.e. The data block that previous segmentation is obtained）The result that Reduce is operated is converted into list structure respectively, chained list head preserves k values. List structure figure is as shown in Fig. 2 wherein, i.e. k represents chromosome chained list length；X, Y, Z, R represent webpage.

Step 70, selected, intersected inside the genetic algorithm sub-group searched for using the global randomization of highly-parallel Etc. genetic evolutionary operations：2 chromosomes are randomly choosed first from parent chromosome, then random generation insertion position Ins, Delete position Del, insert and delete length Len.Then whether isometric compare 2 sections of chromosomes, if equal, judge be end to end It is no to have coincidence, have, then the new chromosome of connection generation, otherwise, does not generate child chromosome；If Length discrepancy, insertion is judged It is whether identical with 2 sections of genes of deletion, if identical, item chromosome is merged into as new chromosome, otherwise, is not given birth to Into child chromosome.When genetic algebra is 50 multiple, marriage operation is carried out between colony.Each sub-group is repeated always Operation is stated, until k values no longer change, when genetic algebra is 50 multiple, marriage operation is carried out between colony.Each sub-group Aforesaid operations are repeated always, until k values no longer change, exit genetic algorithm.Page can be obtained by aforesaid operations Access path, and the Web log file sizes handled do not interfere with the validity of algorithm.

Above-mentioned steps 70 and 80 combine data cutting techniques and genetic algorithm in data analysis process, special to use Web log analysis in Hadoop/Hive cluster environment.

Step 80, data data query and analysis part drawn are loaded into Mysql databases, as desired will The result of data analysis shows user in friendly interface form.For example, some website, the page, data center's access times Inquiry, the analysis of visitor's situation, other access the ratio of failure such as a certain web page in certain a period of time in the past Example, or user's click order with URL ranking, can query analysis come out.

Technological means disclosed in the present invention program is not limited only to the technological means disclosed in above-mentioned embodiment, in addition to Constituted technical scheme is combined by above technical characteristic.It should be pointed out that for those skilled in the art For, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications are also considered as Protection scope of the present invention.

Claims

1. a kind of magnanimity web daily record datas inquiry and analysis method, it is characterised in that comprise the following steps：

Step（1）, the data of each data source are parsed with the ETL in Hive, resolving include extract, cleaning, Four steps of conversion and loading, the extraction stage parses source data in deposit Hive, with Hadoop and Hive programs from source The data that may be used are extracted in data to Transform layers；Wash phase, which is based on Hive programs, may subsequently use To field extract deposit Load layers, discard the data and repeated data that will not be used；Load phase will be handled well Table in data deposit Hive, and delete source data；When being cleaned to data, useful information therein is used MapReduce programs carry out distributed extraction processing；

Step（2）, the data extracted are loaded into data warehouse；

Step（3）, Hive part Driver reception HiveQL sentences；

Step（4）, optimized for tilt data to receiving sentence, preliminary map results obtained after carry out table attended operation；

Step（7）, the genetic algorithm searched for using the global randomization of highly-parallel is to data progress analysis mining：First from father It is then random to generate insertion position Ins, delete position Del, insert and delete length for 2 chromosomes are randomly choosed in chromosome Len；Then whether isometric compare 2 sections of chromosomes, if equal, judge whether there is coincidence end to end, have, then connection generation is new Chromosome, otherwise, do not generate child chromosome；If Length discrepancy, judge whether the 2 sections of genes for inserting and deleting are identical, If identical, item chromosome is merged into as new chromosome, otherwise, child chromosome is not generated, when genetic algebra is During 50 multiple, marriage operation is carried out between colony, each sub-group repeats aforesaid operations always, until k values no longer change；

2. magnanimity web daily record datas inquiry according to claim 1 and analysis method, it is characterised in that：The step（4） In optimization operation include using inclined data map join connection tables of data, not inclined data are with common join companies Connect tables of data.

3. magnanimity web daily record datas inquiry according to claim 1 or 2 and analysis method, it is characterised in that：The step （5）During map introduce composite function combiner, realize local key polymerization, to map export key sort, Value is iterated.

4. magnanimity web daily record datas inquiry according to claim 3 and analysis method, it is characterised in that：It is described Combiner function setups are run before or after the result that union operation is produced is merged.