CN106484813B

CN106484813B - A kind of big data analysis system and method

Info

Publication number: CN106484813B
Application number: CN201610848904.9A
Authority: CN
Inventors: 韦天瀚; 刘国庆; 李海威; 黄震廷; 吴华
Original assignee: GUANGDONG GANG XIN SCIENCE AND TECHNOLOGY Co Ltd
Current assignee: GUANGDONG GANG XIN SCIENCE AND TECHNOLOGY Co Ltd
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2017-10-31
Anticipated expiration: 2036-09-23
Also published as: CN106484813A

Abstract

The invention discloses a kind of big data analysis system and method.Big data analysis system includes：Data retrieval module, data filter out module, data clusters module, and, information extraction modules.The data retrieval module, for data retrieval, the data attribute and property value in data set is demarcated to come, and builds double-deck index structure.It is described according to retrieval module, be that the attribute of data intensive data sets up upper layer index first；Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ tree index structures, if character type data just builds inverted index.The present invention is credible to the cluster result of big data in example and is of practical significance based on the k means clustering methods for improving predicted intensity.

Description

A kind of big data analysis system and method

Technical field

The present invention relates to computer science and technology field, more particularly to a kind of big data analysis system and method.

Background technology

Currently, internet is all connected the computer of all networkings, fundamentally have impact on the production and living of people, this It is the first choice for obtaining various data at present.The pattern for obtaining data by client to server by internet may be summarized to be The pattern of " request "+" response ".This is the basic model of the Internet, applications agreement.

It is exactly to send order, then conducting interviews to click on mouse, and everyone access record is recorded clear in detail Look in daily record, including the specific data such as time, request content, address.Data on internet are connected by these access records Collectively constitute together, this is same reason by following the trail of vestige to catch prey with hunter, and access log is contained huge Value.Therefore, this is also one of important sources of big data.

Several Internet enterprises such as Google, Amazon, Facebook, Twitter the biggest in the world etc. are just dominated entirely The internet industry of ball, why they, which so succeed, a common factor, that is, superpower data analysis capabilities. These enterprises analyze and process substantial amounts of data message daily, using big data as means, excavate commercial opportunity therein, and Google is It is most typical in these enterprises to represent.According to statistics, the search of Google monthly is analyzed up to more than one hundred billion time, and to search information And processing, handled data volume reach 600PB (GB of 1,PB=,100 ten thousand, it is said that equivalent to 1,000,000 years news morning of this information content The summation of report).All contents searched for by google search engine and data message can all be used by its analysis.Such as, with When Google is scanned for, keyword is keyed in search box, the information related with search content can be shown, if input " big data ", search result can point out the contents such as " big data concept ", " big data epoch ", " big data technology ".This is big The result analyzed on the basis of amount historical search information using big data technology.If in addition, input be error message, Or directly inputted in phonetic mode, Google can correct search content automatically, then provide correct suggestion, this function of search fortune With same search theory.

Compared with traditional enterprise operation data, big data has two differences.

First, data volume is huge, but different from the data message such as traditional sales volume, quantity in stock, Google, Facebook Difference is very big in analysis and management method when being handled Deng Internet enterprises the data of website click generation.At big data The core of reason, is not structural data, but the data produced on above-mentioned website clickstream data and social networks, Yi Jichuan The data stored in sensor data, it is impossible to be stored in lane database, are referred to as unstructured data.

Second, from the point of view of the type of business of data processing, really grasp huge data storage and analytical technology is not to pass The entity industry of system, but emerging Internet enterprises (Google), social networks (Facebook) and electric business enterprise (Amazon) etc..The former can entrust the latter to carry out big data information analysis and processing service for it.

Facebook can produce 30PB data volume, and the data volume that Wal-Mart produces only has 2.5PB, not only in data In amount, while difference is also very big in the diversity of data and the speed of generation.From the foregoing, it will be observed that Large-Scale Interconnected net enterprise is in interconnection In net booming period, easily neglected for other enterprises

Depending on data value, the technology of low cost storage and processing can be developed in time, and will wherein valuable letter Breath is extracted, integration apply in operation flow, gradually formed the competitive advantage of itself, in Internet enterprises take off grain husk and Go out.At present, increasing with the influence of these Internet enterprises, more enterprises start to pay attention to the analysis of big data, utilize Big data is by providing new services, to improve customer satisfaction, and then improves the competitive advantage of enterprise.

Big data is in two or three years penetrated into rapidly in different industries, different field with swift and violent developing state short, is made Production efficiency is largely increased, and the raising of the development trend and productivity of big data is closely bound up.

Data volume exponentially increases.Many common achievements in research of research institutions show, global metadata total amount will be It is interior over the next several years exponentially to increase.Estimate according to U.S. advisory organization Mai Kenxi, the new data of global enterprise storage in 2010 Amount stores the new data more than 6EB more than 7EB on client personal computer.

The big data intensity and content of different industries are had nothing in common with each other.The data volume of industry-by-industry storage is different from, big number According to growth according to the difference of industry, produce and the data type of storage also differed.There is card in the maximum field of memory data output Certificate, investment consultation and bank and other financial mechanism, the number that the department such as communication common carrier, media mediation and public institution of government produces It is also very big according to scale.These industries for possessing data assets have very big value potentiality in big data using aspect.

Existing trend will continue to press on data growth.Between different areas and industry, relevant enterprise is all accelerating speed Degree collects data, while also having promoted the growth of traditional transaction database；Multimedia is wide the people's livelihood such as health care field General application, is added significantly to the generation of big data；The commonly used and Internet of Things of network social intercourse extensively should in production and living With all promoting the continuous growth of big data, the cross-applications of these different industries further have stimulated big data growth and The rapid expansion of data pool.

Big data is the new frontline technology of following promotion productivity dynamics.Big data, which wants to turn into, has stronger competition Power, productivity, innovation ability, it is desirable to have appropriate policy is promoted, this is also the key element for creating consumer surplus.In medical treatment Health industry, makes full use of big data, can reduce operating cost, it is to avoid unnecessary treatment, reduces what treatment accident occurred Probability, is improved and lifting medical service quality；In public administration field, revenue department can promote tax revenue work using big data The development of work, improves the operating efficiency of related department of paying taxes；In retail trade, the efficiency for improving and improving industry can be by supplying The big data application of chain and business is answered to realize；In marketing field, big data is made full use of, is consumer with more suitably Price finds the product for meeting its demand, improves value-added content of service.

Now, data are also a kind of assets, can material assets and human capital shoulder to shoulder, while it be also a kind of production will Element.With the development of the emerging industry such as multimedia, Internet of Things in social life, enterprise will be collected into more from these media Information, so as to bring increasing rapidly for data.Big data is in commerce services and for that can give play to huge in consumer's creation of value Big potentiality.

The content of the invention

The technical problems to be solved by the invention are that there is provided a kind of big data analysis system and method.It is of the invention big In data analysing method, using hybrid index with reference to using and having continued B+ trees and advantage both inverted index, while avoid again Their own shortcoming.Logarithm value type data can also be realized while the speed and the space utilisation that improve index construct Range query function.Data of the present invention filter out the scoring feature that projects are extracted by the means of project vector compression, effectively The sparse sex chromosome mosaicism in commending system is solved, while drastically increasing the computational efficiency of item similarity.Finally, reality is passed through Test and the improvement effect of equal model is verified, test result indicates that the equal model after the present invention is improved is less for scoring Project possesses more preferable recommendation effect, more meets the application demand of real system.

In order to solve the above technical problems, the invention provides a kind of big data analysis system, including：Data retrieval module, Data filter out module, data clusters module, and, information extraction modules.

The data retrieval module, for data retrieval, the data attribute and property value in data set is demarcated to come, structure Build double-deck index structure.

It is described according to retrieval module, be that the attribute of data intensive data sets up upper layer index first；

Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ trees index knot Structure, if character type data just builds inverted index.

The data filter out module, are filtered out for the data after data retrieval；The data are filtered out, and take following equal model Variation：Assuming that project i to be transformed scoring vector is I_i={ r_1i, r_2i, r_3i..., r_miConverted through equal model, vector Ii is converted to equal model representation：

I′_i={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...) ...；

Wherein, t₀For the 0th layer of only element of equal model, (t₁₀,t₁₁) it is the 1st layer of two elements, (t₂₀,t₂₁,t₂₂, t₂₃) it is the 2nd layer of four elements；By that analogy, project scoring vector is converted to the equal model of the specified number of plies.

The data clusters module, the data clusters analysis after being filtered out for data；

The data clusters analysis, using the analysis method of predicted intensity；The predicted intensity method is as follows:

(1) initial data to be clustered is randomly divided into training set and test set；

(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster；

(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster；

(4) in the class that test set itself is polymerized to, examine or check it is any to sample point i and i ' whether II types cluster in quilt Mistake point records the ratio correctly divided in different classes；

(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.

In order to solve the above technical problems, present invention also offers a kind of big data analysis method, including：The step of data retrieval Suddenly, the step of the step of data are filtered out, data clusters, and, the step of information extraction.

The step of data retrieval, for data retrieval, the data attribute and property value in data set are demarcated to come, Build double-deck index structure.

It is described according to retrieval the step of, be that the attribute of data intensive data sets up upper layer index first；

The step of data are filtered out, filters out for the data after data retrieval；The data are filtered out, and take following equal mould The variation of type：Assuming that project i to be transformed scoring vector is I_i={ r_1i, r_2i, r_3i..., r_miConverted through equal model, to Amount Ii is converted to equal model representation：

The step of data clusters, the data clusters analysis after being filtered out for data；

The present invention is beneficial to be had technical effect that：

(1) hybrid index of the present invention is with reference to using and having continued B+ trees and advantage both inverted index, while avoid again Their own shortcoming.The model of logarithm value type data can also be realized while the speed and the space utilisation that improve index construct Enclose query function.

(2) present invention model data filters out the scoring feature that projects are extracted by the means of project vector compression, has The sparse sex chromosome mosaicism in commending system is solved to effect, while drastically increasing the computational efficiency of item similarity.Finally, lead to Experiment is crossed to verify the improvement effect of equal model, test result indicates that the present invention improve after equal model for score compared with Few project possesses more preferable recommendation effect, more meets the application demand of real system.

(3) present invention based on improve predicted intensity k- means clustering methods be to the cluster result of big data in example can Believe and be of practical significance.On the basis of k- means clustering algorithms, improved predicted intensity is introduced, and cluster with this determination Variable and cluster numbers.Clustering to big data website column mean residence time shows that this improved big data is clustered The cluster of method, which ties up fruit, has more clear and definite practical significance, the more conventional clustering method of clustering method of the present invention is preferably be used for into The clustering of row big data.

Brief description of the drawings

Fig. 1 is two-layer hybrid big data index structure figure described in the embodiment of the present invention；

Fig. 2 is user items rating matrix-vector compression schematic diagram described in the embodiment of the present invention；

Fig. 3 is user items rating matrix-vector compression schematic diagram of dimensionality reduction described in the embodiment of the present invention；

Fig. 4 is equal model vector transfer process figure described in the embodiment of the present invention；

Fig. 5 is that model algorithm assesses figure (100K) described in the embodiment of the present invention；

Embodiment

Describe embodiments of the present invention in detail below with reference to embodiment, whereby to the present invention how application technology hand Section solves technical problem, and reaches the implementation process of technique effect and can fully understand and implement according to this.

It should be noted that writing length to save specification, it is to avoid unnecessary repetition and waste, in the feelings not conflicted Under condition, the feature in embodiment and embodiment in the application can be mutually combined.

First, data retrieval

The present invention proposes a kind of hybrid index structure based on inverted index and B+ trees.The leafy node of B+ trees is ordered into , this causes it to have obvious advantage when logarithm value type data carry out range retrieval, can bear substantial amounts of workload, has There are relatively stable I/O expenses.Inverted index can not be provided the range retrieval for completing numeric type data and supported well, but because It realizes that relatively easy, inquiry velocity is fast, and retrieval can provide good branch with one-time positioning to the index construct of character type data Hold.

On the basis of tradition index, the thought of stratification index is introduced, by the data attribute and attribute in data set Value, which is demarcated, to be come, and builds double-deck index structure.Attribute first for data intensive data sets up upper layer index.Secondly upper strata is belonged to Property corresponding to data value set up index, if numeric type data just builds B+ tree index structures, if character type data Just build inverted index.So, not all data all set up tree index and reduce the storage as caused by node split The problem of space waste, in addition, the use shared by interim node produced by decreasing during node split, are extra Memory space, accelerates the speed for building index, improves the utilization rate of memory space.Enter line range when logarithm value type data to look into During inquiry, the tree index that will be directly targeted to lower floor is completed, and reduces data query time and cost.

The hybrid index that the present invention is designed is with reference to using and having continued B+ trees and advantage both inverted index, while avoid again Their own shortcoming.Logarithm value type data can also be realized while the speed and the space utilisation that improve index construct Range query function.

The two-layer hybrid big data index structure of the present invention is as shown in Figure 1：

The tree index structure on upper strata is set up primarily directed to the attribute included in data set, in the layer index The specific object of data is stored entirely in n omicronn-leaf child node, and three partial informations are then stored in all leafy nodes of B+ trees A_i, PType, Pointer, the implication of expression is respectively：

(1)A_iIt is the specific object of directoried data set, wherein n is the number of all properties, i ∈ [1, n]；

(2) what PType was represented is pointer type, and particular type has PType { Inverted_index, B+ tree }；

(3) Pointer is points to the pointer of lower layer index, and according to the difference of data type, the pointer points to different ropes Guiding structure, that is, point to the root node of inverted list gauge outfit or B+ trees.

2nd layer index is for the index constructed by the data value corresponding to the 1st layer of attribute, including to build for numeric type data Vertical B+ trees index structure and the inverted list index set up for character type data.Specific data value is stored in B+ trees index knot In the n omicronn-leaf child node of structure, and leafy node be ordered arrangement and the three partial information A comprising index file_RV_S、Loc、 Doc, represents to be meant that respectively：

(1)A_RV_SFor the S property value of the R attribute, R ∈ [1, n₂], S ∈ [1, p], n₂For what is included in data set The number of numerical attribute, P is the data amount check of the R attribute.

(2) Loc is to include the positional information where the file of this property value.

(3) Doc is the reference number of a document comprising searching keyword, and Doc is unique.

Inverted index is divided into two parts, and one is " dictionary ", is a concordance list being made up of different index word, record Different Chinese keywords and their relevant information.Another is " record sheet ", have recorded and each index terms occurred Collection of document and the relevant information such as their storage address.A is specifically included in the inverted index structure of the second layer_iV_j、Doc、 The partial information of Loc, F tetra-, the implication of expression is respectively：

(1)A_iV_jFor j-th of property value of ith attribute, i ∈ [1, n₁], j ∈ [1, m],

n₁For the number of character attibute, m is the number for the property value that ith attribute is included.

(2) Doc is the reference number of a document comprising searching keyword, and Doc is unique.

(3) Loc is to include the position where searching keyword file.

(4) F is the frequency that searching keyword occurs in data set.

The establishment process of index：

Step1 analyzes the data to be indexed for its foundation first, if not having the data in the index built, mixed Close one new index node of the first layer building of index.

Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ trees index for it；If Character type attribute then sets up inverted index structure for it.

Step3 repeats Step1, if there is current attribute in the index built before, then no longer to index first The new node of layer increase, is only added to the data of the attribute in the corresponding index of the second layer.

Step4 repeats above step, untill setting up index completion for all data.

Search index method：

Analysis querying condition first obtains keyword, and searching keyword is handed to indexed lexicon, if index marker position For Fales, return to null value and represent that the data to be inquired about are not present in index file, then judge that the query word is returned if True The data type of result is returned, different index is navigated to according to different type, the numbering of the vocabulary is read and comprising vocabulary document Number, the relevant information of querying condition is obtained by these.Read further according to vocabulary numbering in B+ trees index or inverted index Content, integrates obtained retrieval content, finally carries out correlation comparison with search condition, result ranking is most terminated Fruit returns to user.Using the key assignments term_id in tables of data as the input value of search algorithm, Boolean, specific mistake are output as Journey is as follows：

(1-1) using root, term_id, layer as input parameter, call lookup function treeSearch (root, Term_id, layer), lookup result is assigned to leaf page record record.

(1-2) directly returns to null value if record is sky；Otherwise, real lookup result rid is returned.

Using current page currentPage as the input for searching function treeSearch, key is key for searching and layer is The initial number of plies, may include key for searching key leaf record leafRecord as the output of function, detailed process is as follows：

(2-1) searches key keys, and provide lookup knot if what is be currently located is leaf page using binary chop algorithm Really.

(2-2) performs step (2-3) and arrives (2-6) if current page is not leaf page.

(2-3) presses currentPage and key values, selects the subtree containing key assignments, obtains the page number pageNo of child node.

(2-4) reads the child node page subTreePage that it is included according to page number in the buffer.

(2-5) is if the child node page found is leaf page, then return (2-1).

SubTreePage, key, layer are subtracted 1 and are used as newly defeated by (2-6) if child node page is branch's page Enter, recursive call function returns to output result.

The validation verification of hybrid index

The quality of index construct will directly influence tissue effect and the Query Result efficiency of data, proposed by the present invention Two-layer hybrid index structure is compared and analyzed from the time performance of index construct in validation verification.

Time performance analysis is with being compared

If n₁、n₂The number of Numeric Attributes and the mean number of its property value, n respectively in data set₃、n₄Respectively The number of character type attribute and the mean number of property value.Then the total number of property value is N=n₁×n₂+n₃×n₄.Assuming that the One layer is k rank B+ trees index, and the second layer indexes for m rank B+ trees.

The first layer B+ height of trees degree of hybrid index structure is log_k(n₁+n₃), it is assumed that B+ trees each node in addition to leaf node There is k child node.Now first layer B+ trees index needs the node into line splitting just to have FB_div, calculated by formula (3-1) Go out：

The height of second layer B+ trees is log_m n₂, it is assumed that B+ trees index each node in addition to leaf node has m son knot Point.Now B+ trees need the node into line splitting to have SB_divIt is individual, calculated by formula (3-2)：

Then

The number of all division nodes is a total of：

If the whole index of data set is all indexed using traditional B+ tree constructions, as all property values are all built Vertical tree index index, then the total number of split vertexes is：

Formula (3-3) and formula (3-4) are compared and understood, hybrid index structure of the present invention is on the index creation time There is more obvious superiority for relatively single index structure.

2nd, data are filtered out

Although traditional Collaborative Filtering Recommendation Algorithm has been achieved for good effect in actual applications, it is still There are problems that sparse sex chromosome mosaicism, operation efficiency it is low and.The present invention proposes the data based on equal model and filtered out Algorithm, the algorithm is started with from Item vectors are long, it is proposed that a kind of method that equal model represents Item vectors, is effectively shortened The calculating time of item similarity improves treatment effeciency of the commending system to big data, can preferably apply in large-scale data On collection.

The general principle of equal model

The essence of equal model is the scoring average by layering, is extracted the main scoring feature of project, is ensureing to recommend While precision, project scoring vector length is have compressed, so that recommendation efficiency greatly improved.Equal model is commented user-project The compression process of sub-matrix is respectively such as Fig. 2, shown in Fig. 3, wherein m>>t.

The vectorial transformation model that 3.1 equal models extract project scoring feature by being layered average is defined, the form of expression is one Orderly complete binary tree.It is an empty tree when project is without corresponding score information, the otherwise left child node in the binary tree Respectively less than father node, right child node is all higher than father node, and each subtree is also satisfied by above rule.

3.2 are defined in the distinguishing hierarchy of equal model, the root node of binary tree is the 0th layer of equal model, be project scoring The grand mean of vector, represents the aggregate level that user scores the project, is considered as the main scoring feature of project；By that analogy, Each dtex that other level averages of equal model represent project scoring is levied.

Equal model variation：

Assuming that project i to be transformed scoring vector is I_i={ r_1i, r_2i, r_3i..., r_miConverted through equal model, vectorial Ii turns It is changed to equal model representation：

I′_i={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...) ....

Wherein, t₀For the 0th layer of only element of equal model, (t₁₀,t₁₁) it is the 1st layer of two elements, (t₂₀,t₂₁,t₂₂, t₂₃) it is the 2nd layer of four elements.By that analogy, the vector that just project can score is converted to the equal model of the specified number of plies.

Equal model transformation for mula is：

Wherein, F_kFor the conversion formula of kth (k >=0) layer, card (I_i) be project i scoring number.Equal model vector conversion Flow is as shown in Figure 4.

Equal model conversion algorithm

Input：Original item scoring vector I_i={ r_1i, r_2i, r_3i..., r_mi, conversion number of plies k.

Output：Equal model item vector I '_i={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...) ....

Step：

(1) the 0th layer,

(2) first, according to t₀By vectorial I_iIt is divided into two vectors：

(3) and then two elements of equal the 1st layer of model are calculated：

(4) similarly according to t₁₀And t₁₁Respectively by vectorial I¹⁰ _iAnd I¹¹ _iIt is divided into vectorial I²⁰ _i、I²¹ _iAnd I²² _i、I²³ _i, then distinguish Calculate four element t of the 2nd layer of equal model₂₀,t₂₁,t₂₂,t₂₃；

(5) by that analogy, equal model vector is obtained

I′_i={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...）....

For the effect of the fully checking present invention model, the present invention is using MovieLens 100K and MovieLens 1M (being shown in Table 1) two classical data sets are contrast experiment to the improvement effect of equal model.Data set is divided at random in experimentation For 5 equal portions, tested by the way of five folding cross validations.

The experimental data of table 1 is gathered

The embodiment of the present invention evaluates and tests the present invention model respectively by MAE values, tri- evaluation metricses of recall rate and NDCG Prediction accuracy, the classification degree of accuracy and the sequence degree of accuracy of (Improved MM).

First, two layers of equal model (level1 includes the 0th layer and the 1st layer) of MAE values comparative analysis and three layers of equal model are passed through (level2 includes the 0th layer to the 2nd layer) before improvement with the recommendation precision after improvement.Then, contrasted by recall rate and NDCG Model (Improved MM), cloud model (Cloud_Model) and classic algorithm Cosine algorithms is in IBCF after analysis is improved Application effect in algorithm, so as to be evaluated and tested from the auxiliary that multi-angle is carried out to Improved MM.

As shown in figure 5, level1_Improved MM and level2_Improved MM are with improving the equal of the preceding corresponding number of plies Model is compared, and has obvious lifting on precision is recommended.However, on 1M data sets, the improvement effect phase of equal model To smaller, or even level1_Improved MM almost identical with level1_MM recommendation effect.Experiment as a result, this hair Improvement effect of the bright equal model on 100K data sets is obvious, but the improvement effect on 1M data sets weakens.

3rd, data clusters

Cluster is the major issue in data mining, is also core also one of problem of big data analysis.Means clustering algorithm It is a kind of simple and effective distance algorithm, thus application is quite varied.Different from hierarchical clustering algorithm, changing algorithm needs every time The distance between any two points are calculated, so it has faster convergence rate than hierarchy distance.But k- means clustering algorithms have Two defects, one is to need that cluster numbers are determined in advance, and two be larger by also being influenceed in initial clustering.

The present invention proposes a kind of cluster numbers based on predicted intensity and determines method.

The definition of predicted intensity is

Wherein, X_tr, X_te, represent to carry out initial data the training set and test set obtained by random division respectively；c(X_tr, K) cluster process of training set is represented, k classes are copolymerized into；A_k1,A_k2,…,A_kkThe heart represents the k classes that test set itself is polymerized to, i, i ' be Sample point in same class, n_kjIt is A_kjThe number of middle sample point；D[C(X_tr, k), X_te] a k x k matrix is represented, it i-th The element of row and the i-th ' row takes 0 or 1, and value 0 is represented not in same class, the expression training set pair of value 1:I and i ' are clustered； Ps (k) represents the predicted intensity of cluster result when cluster numbers are k, and interval is [0,1].

Predicted intensity calculating process is as follows:

Obviously, the ability for being intuitively meant that the correctly predicted new sample point of current cluster result energy of predicted intensity.In reality In, can W predicted intensities be object function, W cluster numbers and variable subset are the factor of influence predicted intensity, force what is worked as by selection Cluster numbers and variable subset, maximize predicted intensity.

In the calculating process of predicted intensity, because training set and test set are divided with change, some accidentalia of institute W Considerable influence may be produced to the result of calculation of predicted intensity.In order to reduce the influence of accidentalia, the present invention is changed using one kind Enter method and calculate predicted intensity, specific practice is:Data set is randomly divided into some deciles first, by each decile alternately as Test set, is obtained after respective predicted intensity, then takes its average value to be the predicted intensity under this cluster numbers.

It is credible to the cluster result of big data in example and has a reality based on the k- means clustering method for improving predicted intensity Border meaning.On the basis of k- means clustering algorithms, improved predicted intensity is introduced, and with this determination clustering variable and gather Class number.Clustering to big data website column mean residence time shows, this improved big data clustering method it is poly- Class, which ties up fruit, has more clear and definite practical significance, and the more conventional clustering method of the inventive method is preferably to be used for carrying out the poly- of big data Alanysis.

4th, information extraction

Information extraction is exactly information extraction (the Information Extraction often said in fact:), IE that is, need Information inside the data source to be extracted carries out the processing of some structurings and can be organized into be easy to what people's inquiry was utilized Form.Among real life and work, information source has popularity, and the form that it is showed is also ever-changing crisscross It is complicated, particularly in this big data epoch, tend not to correctly using information source and make a policy.It is therefore necessary to Effective information extraction is carried out to these complicated information sources.

The info web source handled well for clustering, is that the label useless to user is removed first, some mistakes Or irregular label carries out reparation arrangement, such as comment tag "<script>" wait script file.Nowadays substantial amounts of webpage All integrated with TABLE or DIV labels, thus the present invention carry out data processing when just according among both A kind of label construct tree, html file therein is exactly the root node of tree, it is son to send two kinds of corresponding web page blocks of label Node.

Then the semanteme contained to a few partial content tundish is analyzed.Step is, before this to the tag tree in root node Comprising DIV or TABLE nodes carry out Data Collection, certainly extract information when can only extract the node content of this layer.

In the same level label extracted, need further to detect it.That is, if the son mark extracted After Semantic detection is carried out or content that it includes can be detected and the content relation degree of user's request is little in label Or basic user is not related at all, then information redundancy part can be regarded it as, directly redundant content can be abandoned and deleted.

Detect Bu Sudden followed by divider, user when to the processing of label using delamination process, also That those data messages unrelated with user's expectation have been deleted before saying, so for detection data message block quantity Just relatively seldom, operating efficiency and data processing speed are improved.

After above step, web page contents have been divided into DIV or TABLE labeled markers relatively not Unified semantic block message, if necessary to carry out deeper processing to these semantic block messages is accomplished by them to be converted into Complete DOM number form formulas, data message extraction is carried out with recurrence method step by step to the dom tree comprising respective different content.

All labels progress time that can be included during the main contents for extracting data block with word frequency co-occurrence method to dom tree Go through, if it find that some block of information contents are little with the desired data message degree of relationship of user, also among ergodic process It is information redundancy part, then the data message that user expects to obtain can be removed it and retained.

All above-mentioned this intellectual properties of primarily implementation, the not this new product of implementation of setting limitation other forms And/or new method.Those skilled in the art will utilize this important information, the above modification, to realize similar execution feelings Condition.But, all modifications or transformation belong to the right of reservation based on new product of the present invention.

Claims

1. a kind of big data analysis system, it is characterised in that including：Data retrieval module, data filter out module, data clusters mould Block, and, information extraction modules；The data retrieval module, for data retrieval, by the data attribute and property value in data set Demarcate and, build double-deck index structure；The data retrieval module, the first attribute for data intensive data set up upper strata rope Draw；

Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ tree index structures, If character type data just builds inverted index；

The establishment process of index is specifically included：

Step1 analyzes the data to be indexed for its foundation first, if not having the data in the index built, in mixing One new index node of the first layer building of index；

Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ trees index for it；If character Type attribute then sets up inverted index structure for it；

Step3 repeats Step1, if there is current attribute in the index built before, then no longer increases to index first layer Plus new node, only the data of the attribute are added in the corresponding index of the second layer；

Step4 repeats above step, untill setting up index completion for all data.

2. big data analysis system according to claim 1, it is characterised in that the data filter out module, for data inspection Data after rope are filtered out；The data are filtered out, and take the variation of following equal model：Assuming that project i to be transformed scoring to Measure as I_i={ r_1i, r_2i, r_3i..., r_miConverted through equal model, vectorial Ii is converted to equal model representation：

I_i'={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...) ...；

Wherein, t₀For the 0th layer of only element of equal model, (t₁₀,t₁₁) it is the 1st layer of two elements, (t₂₀,t₂₁,t₂₂,t₂₃) be 2nd layer of four elements；By that analogy, project scoring vector is converted to the equal model of the specified number of plies.

3. big data analysis system according to claim 1, it is characterised in that the data clusters module, for data filter Data clusters analysis after going out；

(4) in each class that test set itself is polymerized to, whether examination is any is divided sample point i and i ' in II types cluster by mistake In different classes, and record the ratio correctly divided；

4. a kind of big data analysis method, it is characterised in that including：The step of data retrieval, the step of data are filtered out, data are gathered The step of class, and, the step of information extraction；The step of data retrieval, for data retrieval, the data in data set are belonged to Property and property value demarcate and, build double-deck index structure；The step of data retrieval, the first category for data intensive data Property set up upper layer index；

The establishment process of index is specifically included：

Step1 analyzes the data to be indexed for its foundation first, if not having the data in the index built, in mixing rope The one new index node of the first layer building drawn；

Step4 repeats above step, untill setting up index completion for all data.

5. big data analysis method according to claim 4, it is characterised in that the step of data are filtered out, for data Data after retrieval are filtered out；The data are filtered out, and take the variation of following equal model：Assuming that project i to be transformed scoring Vector is I_i={ r_1i, r_2i, r_3i..., r_miConverted through equal model, vectorial Ii is converted to equal model representation：

6. big data analysis method according to claim 4, it is characterised in that the step of the data clusters, for data Data clusters analysis after filtering out；