CN106484813A

CN106484813A - A kind of big data analysis system and method

Info

Publication number: CN106484813A
Application number: CN201610848904.9A
Authority: CN
Inventors: 韦天瀚; 刘国庆; 李海威; 黄震廷; 吴华
Original assignee: GUANGDONG GANG XIN SCIENCE AND TECHNOLOGY Co Ltd
Current assignee: GUANGDONG GANG XIN SCIENCE AND TECHNOLOGY Co Ltd
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2017-03-08
Anticipated expiration: 2036-09-23
Also published as: CN106484813B

Abstract

The invention discloses a kind of big data analysis system and method.Big data analysis system includes：Data retrieval module, data leaches module, data clusters module, and, information extraction modules.Described data retrieval module, for data retrieval, the data attribute in data set and property value is divided and comes, build double-deck index structure.Described according to retrieval module, set up upper layer index for the attribute of data intensive data first；Secondly index is set up to the data value corresponding to the attribute of upper strata, if numeric type data just builds B+ tree index structure, if character type data just builds inverted index.The present invention is credible to the cluster result of big data in example and is of practical significance based on the k means clustering method improving predicted intensity.

Description

A kind of big data analysis system and method

Technical field

The present invention relates to computer science and technology field, particularly to a kind of big data analysis system and method.

Background technology

Currently, the Internet is all connected the computer of all networkings, fundamentally have impact on the productive life of people, this It is the current first-selection obtaining various data.Be may be summarized to be by the pattern that client to server obtains data by the Internet The pattern of " request "+" response ".This is the basic model of internet, applications agreement.

Clicking on mouse is exactly to send order, then conducting interviews, everyone access record records in detail clear Look in daily record, including the concrete data such as time, request content, address.Data on the Internet is all to access record even by these Collectively constitute, this catches prey with hunter by tracking vestige is same reason, and access log is contained huge together Value.Therefore, this is also one of important sources of big data.

Several Internet enterprises such as Google, Amazon, Facebook, Twitter the biggest in the world etc. just dominate entirely The internet industry of ball, why they so successfully have a common factor, that is, superpower data analysis capabilities. These enterprises analyze and process substantial amounts of data message daily, with big data as means, excavate commercial opportunity therein, Google is Most typical representative in these enterprises.According to statistics, Google search monthly reaches more than one hundred billion time, and search information is analyzed And process, handled data volume reaches 600PB, and (1,PB=,100 ten thousand GB, this quantity of information is it is said that be equivalent to 1,000,000 years news early The summation of report).All content data information searched for by google search engine all can be used by its analysis.Such as, with When Google scans for, search box is keyed in key word, can show with the related information of search content, if input " big data ", Search Results can point out the contents such as " big data concept ", " big data epoch ", " big data technology ".This is big The result being analyzed using big data technology on the basis of amount historical search information.If additionally, input is error message, Or directly inputted in phonetic mode, Google can revise search content automatically, then provides correct suggestion, this function of search fortune With same search theory.

Compared with traditional enterprise operation data, big data has two differences.

First, data volume is huge, but different from data messages such as traditional sales volume, quantitys in stock, Google, Facebook When the data clicking on generation to website Deng Internet enterprises is processed, in analysis and management method, difference is very big.At big data The core of reason, is not structural data, but the data producing on above-mentioned website clickstream data and social networkies, Yi Jichuan In sensor data, the data of storage, it is impossible to be stored in lane database, is referred to as unstructured data.

Second, from the point of view of the type of business of data processing, really grasp huge data storage and analytical technology is not to pass The entity industry of system, but emerging Internet enterprises (Google), social networkies (Facebook) and electric business enterprise (Amazon) etc..The former can entrust the latter is that it carries out big data information analysiss and processes service.

Facebook can produce the data volume of 30PB, and the data volume that Wal-Mart produces only has 2.5PB, not only in data In amount, in the multiformity of data and the speed of generation, difference is also very big simultaneously.From the foregoing, it will be observed that Large-Scale Interconnected net enterprise is in interconnection In the period that net flourishes, other enterprises are easily neglected

Depending on data value, low cost storage and the technology processing can be developed in time, and will wherein valuable letter Breath extracts, and integrates and applies in operation flow, has gradually formed the competitive advantage of itself, de- grain husk in Internet enterprises and Go out.At present, the impact with these Internet enterprises is increasing, and more enterprises start to pay attention to the analysis of big data, utilize Big data is passed through to provide new services, to improve customer satisfaction, and then improves the competitive advantage of enterprise.

Big data is in two or three years penetrated into rapidly different industries, in different field, makes short with swift and violent developing state Production efficiency is largely increased, and the development trend of big data is closely bound up with the raising of the productivity.

Data volume exponentially increases.The common achievement in research of a lot of research institutions shows, global metadata total amount will be Interior over the next several years exponentially increase.Estimate according to U.S. advisory organization Mai Kenxi, the new data of global enterprise storage in 2010 Amount, more than 7EB, client personal computer stores the new data more than 6EB.

The big data intensity of different industries and content are had nothing in common with each other.The data volume of industry-by-industry storage is different from, several greatly According to growth according to the difference of industry, produce and the data type of storage also differ.There is card in the maximum field of memory data output Certificate, investment consultation and bank and other financial mechanism, the number that the department such as communication common carrier, media mediation and public institution of government produces Also very big according to scale.The industry that these have data assets has very big value potentiality in big data using aspect.

Existing trend will continue to press on data and increase.Between different areas and industry, relevant enterprise is all accelerating speed Degree collects data, has also promoted the growth of traditional transaction database simultaneously；Multimedia is wide the people's livelihood such as health care field General application, is added significantly to the generation of big data；The commonly used and Internet of Things of network social intercourse extensively should in productive life With all promoting the continuous growth of big data, the cross-application of these different industries have stimulated further big data growth and The rapid expansion of data pool.

Big data is the following new frontline technology promoting productivity dynamics.Big data will want to be has stronger competition Power, the productivity, innovation ability, it is desirable to have suitable policy promotes, this is also the key element creating consumer surplus.In medical treatment Health industry, makes full use of big data, can reduce operating cost, it is to avoid unnecessary treatment, reduces what treatment accident occurred Probability, improves and lifting medical service quality；In public administration field, revenue department can promote tax revenue work using big data The development made, improves the work efficiency of related department of paying taxes；In retail trade, the efficiency improving and improving industry can be by supplying Chain and the big data of business is answered to apply to realize；In marketing field, make full use of big data, be consumer with more suitably Price finds the product meeting its demand, improves value-added content of service.

Now, data is also a kind of assets, can material assets and human capital shoulder to shoulder, simultaneously it be also a kind of production will Element.With the development of the emerging industry such as multimedia, Internet of Things in social life, enterprise will collect more from these media Information, thus bring increasing rapidly of data.Big data is in commerce services and huge for having given play in consumer's creation of value Big potentiality.

Content of the invention

The technical problem to be solved is, there is provided a kind of big data analysis system and method.The present invention is big In data analysing method, combine the advantage using and having continued both B+ tree and inverted index using hybrid index, avoid simultaneously again The shortcoming of each of which.Logarithm value type data can also be achieved while the speed and the space utilisation that improve index construct Range query function.Data of the present invention leaches the scoring feature extracting projects by the means of project vector compression, effectively Solve the sparse sex chromosome mosaicism in commending system, drastically increase the computational efficiency of item similarity simultaneously.Finally, by reality Test the improvement effect to equal model to be verified, test result indicate that the equal model after the present invention improves is less for scoring Project has more preferable recommendation effect, more meets the application demand of real system.

For solving above-mentioned technical problem, the invention provides a kind of big data analysis system, including：Data retrieval module, Data leaches module, data clusters module, and, information extraction modules.

Described data retrieval module, for data retrieval, the data attribute in data set and property value is divided and comes, structure Build double-deck index structure.

Described according to retrieval module, set up upper layer index for the attribute of data intensive data first；

Secondly index is set up to the data value corresponding to the attribute of upper strata, if numeric type data just builds B+ tree index knot Structure, if character type data just builds inverted index.

Described data leaches module, leaches for the data after data retrieval；Described data leaches, and takes following equal model Variation：Assume project i to be transformed scoring vector be I_i={ r_1i, r_2i, r_3i..., r_miThrough the conversion of equal model, vector Ii is converted to equal model representation form：

I′_i={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...) ...；

Wherein, t₀For the 0th layer of only element of equal model, (t₁₀,t₁₁) it is the 1st layer of two elements, (t₂₀,t₂₁,t₂₂, t₂₃) it is the 2nd layer of four elements；By that analogy, project scoring vector is converted to the equal model of the specified number of plies.

Described data clusters module, the data clusters analysis after leaching for data；

Described data clusters analysis, using the analysis method of predicted intensity；Described predicted intensity method is as follows:

(1) initial data to be clustered is randomly divided into training set and test set；

(2) take cluster numbers to be k, above-mentioned two subset is clustered, cluster result is designated as I type cluster；

(3) with the cluster result of training set, test set is differentiated, result is designated as II type cluster；

(4) the apoplexy due to endogenous wind being polymerized in test set itself, examination arbitrary to sample point i and i ' whether quilt in II type cluster Mistake point is in different classes, and records the ratio correctly being divided；

(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.

For solving above-mentioned technical problem, present invention also offers a kind of big data analysis method, including：The step of data retrieval Suddenly, the step that data leaches, the step of data clusters, and, the step of information retrieval.

The step of described data retrieval, for data retrieval, the data attribute in data set and property value is divided and comes, Build double-deck index structure.

The described step according to retrieval, the first attribute for data intensive data set up upper layer index；

The step that described data leaches, leaches for the data after data retrieval；Described data leaches, and takes following equal mould The variation of type：Assume project i to be transformed scoring vector be I_i={ r_1i, r_2i, r_3i..., r_miConvert through equal model, to Amount Ii is converted to equal model representation form：

The step of described data clusters, the data clusters analysis after leaching for data；

Beneficial the having technical effect that of the present invention：

(1) hybrid index of the present invention combines with and has continued the advantage of both B+ tree and inverted index, avoids simultaneously again The shortcoming of each of which.The model of logarithm value type data can also be achieved while the speed and the space utilisation that improve index construct Enclose query function.

(2) the equal model data of the present invention leaches the scoring feature extracting projects by the means of project vector compression, has Solve to effect the sparse sex chromosome mosaicism in commending system, drastically increase the computational efficiency of item similarity simultaneously.Finally, lead to Cross experiment the improvement effect of equal model is verified, test result indicate that the equal model after present invention improvement is for scoring relatively Few project has more preferable recommendation effect, more meets the application demand of real system.

(3) the k- means clustering method based on improvement predicted intensity for the present invention is can to the cluster result of big data in example Believe and be of practical significance.On the basis of k- means clustering algorithm, introduce improved predicted intensity, and cluster is determined with this Variable and cluster numbers.Cluster analyses to big data website column mean residence time show, this improved big data cluster The cluster of method is tied up fruit and is had a more clear and definite practical significance, the more conventional clustering method of clustering method of the present invention is preferably be used for into The cluster analyses of row big data.

Brief description

Fig. 1 is two-layer hybrid big data index structure figure described in the embodiment of the present invention；

Fig. 2 is user items rating matrix-vector compression schematic diagram described in the embodiment of the present invention；

Fig. 3 is the user items rating matrix-vector compression schematic diagram of dimensionality reduction described in the embodiment of the present invention；

Fig. 4 is equal model vector transformation process figure described in the embodiment of the present invention；

Fig. 5 is that described in the embodiment of the present invention, (100K) is schemed in all model algorithm assessments；

Specific embodiment

To describe embodiments of the present invention below with reference to embodiment in detail, whereby to the present invention how application technology handss Section is solving technical problem, and reaches realizing process and fully understanding and implement according to this of technique effect.

It should be noted that writing length for saving description, it is to avoid unnecessary repetition and waste, in the feelings do not conflicted Under condition, the embodiment in the application and the feature in embodiment can be mutually combined.

First, data retrieval

The present invention proposes a kind of hybrid index structure based on inverted index and B+ tree.The leafy node of B+ tree is ordered into , this makes it have obvious advantage when logarithm value type data carries out range retrieval, can bear substantial amounts of live load, tool There is relatively stable I/O expense.Inverted index can not provide to the range retrieval completing numeric type data and support well, but because It is fast that it realizes relatively easy, inquiry velocity, and retrieval can provide good with one-time positioning to the index construct of character type data Hold.

On the basis of tradition index, introduce the thought of stratification index, by the data attribute in data set and attribute Value division is come, and builds double-deck index structure.It is the upper layer index of attribute foundation of data intensive data first.Secondly upper strata is belonged to Property corresponding to data value set up index, if numeric type data just builds B+ tree index structure, if character type data Just build inverted index.So, not all data is all set up tree index and is reduced storage caused by node split The problem of space waste, in addition, the use decreasing shared by produced interim node during node split is extra Memory space, accelerates the speed of index building, improves the utilization rate of memory space.Enter line range when logarithm value type data to look into During inquiry, the tree index that will be directly targeted to lower floor completes, and reduces data query time and cost.

The hybrid index of present invention design combines with and has continued the advantage of both B+ tree and inverted index, avoids simultaneously again The shortcoming of each of which.Logarithm value type data can also be achieved while the speed and the space utilisation that improve index construct Range query function.

The two-layer hybrid big data index structure of the present invention is as shown in Figure 1：

The tree index structure on upper strata is that the attribute being primarily directed to included in data set is set up, in this layer index The specific object of data is stored entirely in n omicronn-leaf child node, and then stores three partial informations in all leafy nodes of B+ tree A_i, PType, Pointer, the implication of expression is respectively：

(1)A_iIt is the specific object of directoried data set, wherein n is the number of all properties, i ∈ [1, n]；

(2) what PType represented is pointer type, and particular type has PType { Inverted_index, B+ tree }；

(3) Pointer is the pointer pointing to lower layer index, and according to the difference of data type, this pointer points to different ropes Guiding structure, that is, point to the root node of inverted list gauge outfit or B+ tree.

2nd layer index is the index constructed by the data value corresponding to the 1st layer of attribute, builds including for numeric type data Vertical B+ tree index structure and the inverted list index set up for character type data.Specific data value is stored in B+ tree index knot In the n omicronn-leaf child node of structure, and leafy node is ordered arrangement and three partial informations A that comprise index file_RV_S、Loc、 Doc, represents respectively and is meant that：

(1)A_RV_SFor the S property value of R attribute, R ∈ [1, n₂], S ∈ [1, p], n₂For comprise in data set The number of numerical attribute, P is the data amount check of R attribute.

(2) Loc is the positional information that the file comprising this property value is located.

(3) Doc is the reference number of a document comprising searching keyword, and Doc is unique.

Inverted index is divided into two parts, and one is " dictionary ", is a concordance list being made up of different index word, record Different Chinese keywords and their relevant information.Another is " log ", have recorded and each index terms Collection of document and the relevant information such as their storage address.A is specifically comprised in the inverted index structure of the second layer_iV_j、Doc、 Loc, F tetra- partial information, the implication of expression is respectively：

(1)A_iV_jFor j-th property value of ith attribute, i ∈ [1, n₁], j ∈ [1, m],

n₁The number of the property value comprising for ith attribute for the number of character attibute, m.

(2) Doc is the reference number of a document comprising searching keyword, and Doc is unique.

(3) Loc is the position comprising searching keyword file place.

(4) frequency that F occurs in data set for searching keyword.

The establishment process of index：

It will be that it sets up the data of index that Step1 analyzes first, if not this data in the index of structure, mixed Close one new index node of the first layer building of index.

Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ tree index for it；If Character type attribute then sets up inverted index structure for it.

Step3 repeats Step1, if there is current attribute before in the index building, then no longer to index first Layer increases new node, only the data of this attribute is added in the corresponding index of the second layer.

Step4 repeats above step, until setting up till index completes for all of data.

Search index method：

Analysis querying condition first obtains key word, searching keyword is handed to indexed lexicon, if index marker position For Fales, return null value and represent in index file, there is not data to be inquired about, then judge that this query word returns if True Return the data type of result, navigate to different index according to dissimilar, read the numbering of this vocabulary and comprise vocabulary document Number, obtains the relevant information of querying condition by these.Number further according to vocabulary and read in B+ tree index or inverted index Content, integrates the retrieval content obtaining, finally carries out dependency with search condition and compare, result ranking is terminated most Fruit returns to user.Using the key assignments term_id in tables of data as the input value of search algorithm, it is output as Boolean, concrete mistake Journey is as follows：

(1-1) using root, term_id, layer as |input paramete, call lookup function treeSearch (root, Term_id, layer), lookup result is assigned to leaf page record record.

If (1-2) record is sky, directly return null value；Otherwise, return real lookup result rid.

Using current page currentPage as the input searching function treeSearch, key is key for searching and layer is The initial number of plies, may comprise the output as function for the leaf record leafRecord of key for searching key, detailed process is as follows：

If (2-1) be currently located is leaf page, key key is searched using binary chop algorithm, and provide lookup knot Really.

If (2-2) current page is not leaf page, execution step (2-3) arrives (2-6).

(2-3) press currentPage and key value, select the subtree containing key assignments, obtain the page number pageNo of child node.

(2-4) in the buffer the child node page subTreePage that it is comprised is read according to page number.

If the child node page (2-5) finding is leaf page, then return (2-1).

If (2-6) this child node page is branch's page, subTreePage, key, layer are all subtracted 1 as new defeated Enter, recursive call function returns output result.

The validation verification of hybrid index

The quality of index construct will directly influence tissue effect and the Query Result efficiency of data, proposed by the present invention Two-layer hybrid index structure, in validation verification, compares from the time performance of index construct and analyzes.

Time performance analysis with compare

If n₁、n₂It is respectively the mean number of the number of Numeric Attributes and its property value in data set, n₃、n₄It is respectively The number of character type attribute and the mean number of property value.Then the total number of property value is N=n₁×n₂+n₃×n₄.Assume the One layer is k rank B+ tree index, and the second layer is that m rank B+ tree indexes.

The ground floor B+ height of tree degree of hybrid index structure is log_k(n₁+n₃) it is assumed that B+ tree each node in addition to leaf node There is k child node.Now ground floor B+ tree index needs the node into line splitting just to have FB_div, calculated by formula (3-1) Go out：

The height of second layer B+ tree is log_mn₂It is assumed that B+ tree index each node in addition to leaf node has m son knot Point.Now B+ tree needs the node into line splitting to have SB_divIndividual, calculated by formula (3-2)：

Then

The number of all division nodes is a total of：

If the whole index of data set is all indexed using traditional B+ tree construction, as all of property value is all built Vertical tree index index, then the total number of split vertexes is：

Formula (3-3) and formula (3-4) are compared and understand, hybrid index structure of the present invention is on the index creation time There is for relatively single index structure more obvious superiority.

2nd, data leaches

Although traditional Collaborative Filtering Recommendation Algorithm has been achieved for good effect in actual applications, it is still Have such problems as that sparse sex chromosome mosaicism, operation efficiency be low and poor expandability.The present invention is proposed and is leached based on the data of equal model Algorithm, this algorithm is started with it is proposed that a kind of method of equal model representation Item vector from Item vector is long, is effectively shortened The calculating time of item similarity improves the treatment effeciency to big data for the commending system, can preferably apply in large-scale data On collection.

The ultimate principle of equal model

The essence of all models is the scoring average by layering, is extracted the main scoring feature of project, is ensureing to recommend While precision, have compressed project scoring vector length, thus recommendation efficiency is greatly improved.All models are commented to user-project The compression process of sub-matrix is respectively as Fig. 2, shown in Fig. 3, wherein m>>t.

Define 3.1 equal models to pass through to be layered the vectorial transformation model that average extracts project scoring feature, the form of expression is one Orderly complete binary tree.When project no corresponding score information, it is an empty tree, otherwise left child node in this binary tree It is respectively less than father node, right child node is all higher than father node, and each subtree is also satisfied by above rule.

Define 3.2 in the distinguishing hierarchy of equal model, the root node of binary tree is the 0th layer of equal model, be project scoring The grand mean of vector, represents the aggregate level that user scores to this project, is considered as the main scoring feature of project；By that analogy, Each dtex that other level averages of all models represent project scoring is levied.

All model variations：

Assume project i to be transformed scoring vector be I_i={ r_1i, r_2i, r_3i..., r_miThrough the conversion of equal model, vectorial Ii turns It is changed to equal model representation form：

I′_i={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...) ....

Wherein, t₀For the 0th layer of only element of equal model, (t₁₀,t₁₁) it is the 1st layer of two elements, (t₂₀,t₂₁,t₂₂, t₂₃) it is the 2nd layer of four elements.By that analogy, just project scoring vector can be converted to the equal model of the specified number of plies.

All model transformation for mula are：

Wherein, F_kFor the conversion formula of kth (k >=0) layer, card (I_i) for project i scoring number.All model vector conversions Flow process is as shown in Figure 4.

All model conversion algorithms

Input：The vectorial I of original item scoring_i={ r_1i, r_2i, r_3i..., r_mi, change number of plies k.

Output：All model item vector I '_i={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...) ....

Step：

(1) the 0th layer,

(2) first, according to t₀By vectorial I_iIt is divided into two vectors：

(3) then calculate the 1st layer of equal model two elements：

(4) in the same manner according to t₁₀And t₁₁Respectively by vectorial I¹⁰ _iAnd I¹¹ _iIt is divided into vectorial I²⁰ _i、I²¹ _iAnd I²² _i、I²³ _i, then distinguish Calculate four element t of the 2nd layer of equal model₂₀,t₂₁,t₂₂,t₂₃；

(5) by that analogy, obtain equal model vector

I′_i={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...）....

For the abundant effect verifying the equal model of the present invention, the present invention is using MovieLens 100K and MovieLens 1M (being shown in Table 1) two classical data sets are contrast experiment to the improvement effect of equal model.Data set is divided by experimentation at random For 5 equal portions, tested by the way of five folding cross validations.

Table 1 experimental data gathers

The embodiment of the present invention, by MAE value, recall rate and tri- evaluation metricses of NDCG, evaluates and tests the equal model of the present invention respectively The prediction accuracy of (Improved MM), classification accuracy and sequence accuracy.

First, by the equal model of MAE value relative analyses two-layer (level1 comprises the 0th layer and the 1st layer) and three layers of equal model (level2 comprises the 0th layer to the 2nd layer) before improvement with improve after recommendation precision.Then, contrasted by recall rate and NDCG After analysis improvement, all model (Improved MM), cloud model (Cloud_Model) and classic algorithm Cosine algorithms are in IBCF Application effect in algorithm, thus evaluate and test from the auxiliary that Improved MM is carried out with multi-angle.

As shown in figure 5, level1_Improved MM and level2_Improved MM with improve before the corresponding number of plies equal Model is compared, and has obvious lifting on recommending precision.However, on 1M data set, the improvement effect phase of equal model To less, or even level1_Improved MM is almost identical with the recommendation effect of level1_MM.Experiment as a result, this Improvement effect on 100K data set for the bright equal model is obvious, but the improvement effect on 1M data set weakens.

3rd, data clusters

Cluster is the major issue in data mining, is also core also one of problem of big data analysis.Means clustering algorithm It is a kind of simple and effective distance algorithm, thus application is quite varied.Different from hierarchical clustering algorithm, changing algorithm needs every time Calculate the distance between any two points, so it has faster convergence rate than hierarchy distance.But k- means clustering algorithm has Two defects, one is to need cluster numbers are determined in advance, and two is larger by also being affected in initial clustering.

The present invention proposes and a kind of determines method based on the cluster numbers of predicted intensity.

The definition of predicted intensity is

Wherein, X_tr, X_te, represent training set and the test set that initial data is carried out with random division gained respectively；c(X_tr, K) represent the cluster process of training set, be copolymerized into k class；A_k1,A_k2,…,A_kkThe heart represents the k class that test set itself is polymerized to, i, i ' be Sample point in same class, n_kjIt is A_kjThe number of middle sample point；D[C(X_tr, k), X_te] representing a k x k matrix, it i-th Row and the i-th ' element arranging take 0 or 1, and value 0 represents not in same class, value 1 expression training set pair:I and i ' is clustered； Ps (k) represents the predicted intensity that cluster numbers are cluster result during k, and interval is [0,1].

Predicted intensity calculating process is as follows:

Obviously, the ability being intuitively meant that the correctly predicted new sample point of current cluster result energy of predicted intensity.In reality In, can W predicted intensity be object function, W cluster numbers and variable subset are the factor of impact predicted intensity, be worked as by selecting to force Cluster numbers and variable subset, make predicted intensity maximize.

In the calculating process of predicted intensity, because training set and test set are to divide with change, some accidentalia of institute W Considerable influence may be produced to the result of calculation of predicted intensity.In order to reduce the impact of accidentalia, the present invention is changed using one kind Enter method and calculate predicted intensity, specific practice is:First data set is randomly divided into some deciles, by each decile alternately as Test set, after obtaining respective predicted intensity, then takes the predicted intensity that its meansigma methods is under this cluster numbers.

It is credible to the cluster result of big data in example and has reality based on the k- means clustering method improving predicted intensity Border meaning.On the basis of k- means clustering algorithm, introduce improved predicted intensity, and clustering variable is determined with this and gathers Class number.Cluster analyses to big data website column mean residence time show, this improved big data clustering method poly- Class is tied up fruit and is had more clear and definite practical significance, the more conventional clustering method of the inventive method preferably for carrying out the poly- of big data Alanysis.

4th, information retrieval

Information retrieval is exactly information extraction (the Information Extraction often saying in fact:), IE that is, need Information inside data source to be extracted carry out some structurized process and can be organized into be easy to people inquiry utilize Form.In the middle of real life and work, information source has popularity, and the form that it is showed is also ever-changing crisscross Complicated, particularly in this big data epoch, tend not to correctly using information source and make a policy.It is therefore necessary to Effective information retrieval is carried out to these complicated information sources.

The info web source handled well for cluster analyses, is that the label useless to user removes, some mistakes first Or irregular label carries out repairing arrangement, such as comment tag "<script>" wait script file.Nowadays substantial amounts of webpage All integrated with TABLE or DIV label, thus the present invention when carrying out data processing just according to both in the middle of A kind of label constructing tree, html file therein is exactly the root node of tree, send the corresponding web page blocks of two kinds of labels to be son Node.

Then the semanteme a few partial content tundish being contained is analyzed.Step is, before this to the tag tree in root node DIV the or TABLE node comprising carries out data collection, only can extract the node content of this layer when certainly extracting information.

In the same level label extracting, need it is detected further.That is, if the son mark extracting In label after carrying out Semantic detection or to detect the content that it comprises little with the content relation degree of user's request Or basic user is not related at all, then information redundancy part can be regarded it as, directly redundant content can be abandoned and delete.

Detect step followed by divider, user in the process to label using delamination process, also Be say before deleted those with user the unrelated data messages of expectation, so for the quantity of the data message block of detection Just relatively little, improve work efficiency and data processing speed.

After above step, web page contents have been divided into DIV or TABLE labeled marker relatively not Unified semantic block message, is accomplished by them to be converted into if necessary to carry out deeper process to these semantic block messages Complete DOM number form formula, carries out data message extraction with recurrence method step by step to the dom tree comprising respective different content.

Can carry out time with all labels that word frequency co-occurrence method comprises to dom tree during the main contents extracting data block Go through, if it find that some chunk contents are little with user's desired data message degree of relationship, also in the middle of ergodic process It is information redundancy part, then the data message that user's expectation obtains can be removed it and retain.

All above-mentioned this intellectual properties of primary enforcement, do not set this new product of enforcement limiting other forms And/or new method.Those skilled in the art will be using this important information, and the above is changed, to realize similar execution feelings Condition.But, all modifications or transformation belong to the right of reservation based on new product of the present invention.

Claims

1. a kind of big data analysis system is it is characterised in that include：Data retrieval module, data leaches module, data clusters mould Block, and, information extraction modules.

2. big data analysis system, it is characterised in that described data retrieval module, is examined for data according to claim 1 Rope, the data attribute in data set and property value is divided and comes, and builds double-deck index structure.

3. according to claim 1 big data analysis system it is characterised in that described according to retrieval module, first for data set The attribute of middle data sets up upper layer index；

Secondly index is set up to the data value corresponding to the attribute of upper strata, if numeric type data just builds B+ tree index structure, If character type data just builds inverted index.

4. big data analysis system, it is characterised in that described data leaches module, is examined for data according to claim 1 Data after rope leaches；Described data leaches, and takes the variation of following equal model：Assume the scoring of project i to be transformed to Measure as I_i={ r_1i, r_2i, r_3i..., r_miThrough the conversion of equal model, vectorial Ii is converted to equal model representation form：

I_i'={ t₀, (t₁₀, t₁₁), (t₂₀, t₂₁, t₂₂, t₂₃), (t₃₀, t₃₁...) ...；

Wherein, t₀For the 0th layer of only element of equal model, (t₁₀,t₁₁) it is the 1st layer of two elements, (t₂₀,t₂₁,t₂₂,t₂₃) be 2nd layer of four elements；By that analogy, project scoring vector is converted to the equal model of the specified number of plies.

5. big data analysis system, it is characterised in that described data clusters module, is filtered for data according to claim 1 Data clusters analysis after going out；

(4) the apoplexy due to endogenous wind being polymerized in test set itself, whether examination is arbitrary is divided by wrong to sample point i and i ' in II type cluster In different classes, and record the ratio correctly being divided；

6. a kind of big data analysis method is it is characterised in that include：The step of data retrieval, the step that data leaches, data is gathered The step of class, and, the step of information retrieval.

7. according to claim 6 big data analysis method it is characterised in that the step of described data retrieval, for data Retrieval, the data attribute in data set and property value is divided and comes, and builds double-deck index structure.

8. according to claim 6 big data analysis method it is characterised in that described according to retrieval step, first for data The attribute of intensive data sets up upper layer index；

9. according to claim 6 big data analysis method it is characterised in that the step that leaches of described data, for data Data after retrieval leaches；Described data leaches, and takes the variation of following equal model：Assume the scoring of project i to be transformed Vector is I_i={ r_1i, r_2i, r_3i..., r_miThrough the conversion of equal model, vectorial Ii is converted to equal model representation form：

10. according to claim 6 big data analysis method it is characterised in that the step of described data clusters, for data Data clusters analysis after leaching；