CN106484813A - A kind of big data analysis system and method - Google Patents
A kind of big data analysis system and method Download PDFInfo
- Publication number
- CN106484813A CN106484813A CN201610848904.9A CN201610848904A CN106484813A CN 106484813 A CN106484813 A CN 106484813A CN 201610848904 A CN201610848904 A CN 201610848904A CN 106484813 A CN106484813 A CN 106484813A
- Authority
- CN
- China
- Prior art keywords
- data
- index
- cluster
- retrieval
- attribute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2237—Vectors, bitmaps or matrices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2272—Management thereof
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of big data analysis system and method.Big data analysis system includes:Data retrieval module, data leaches module, data clusters module, and, information extraction modules.Described data retrieval module, for data retrieval, the data attribute in data set and property value is divided and comes, build double-deck index structure.Described according to retrieval module, set up upper layer index for the attribute of data intensive data first;Secondly index is set up to the data value corresponding to the attribute of upper strata, if numeric type data just builds B+ tree index structure, if character type data just builds inverted index.The present invention is credible to the cluster result of big data in example and is of practical significance based on the k means clustering method improving predicted intensity.
Description
Technical field
The present invention relates to computer science and technology field, particularly to a kind of big data analysis system and method.
Background technology
Currently, the Internet is all connected the computer of all networkings, fundamentally have impact on the productive life of people, this
It is the current first-selection obtaining various data.Be may be summarized to be by the pattern that client to server obtains data by the Internet
The pattern of " request "+" response ".This is the basic model of internet, applications agreement.
Clicking on mouse is exactly to send order, then conducting interviews, everyone access record records in detail clear
Look in daily record, including the concrete data such as time, request content, address.Data on the Internet is all to access record even by these
Collectively constitute, this catches prey with hunter by tracking vestige is same reason, and access log is contained huge together
Value.Therefore, this is also one of important sources of big data.
Several Internet enterprises such as Google, Amazon, Facebook, Twitter the biggest in the world etc. just dominate entirely
The internet industry of ball, why they so successfully have a common factor, that is, superpower data analysis capabilities.
These enterprises analyze and process substantial amounts of data message daily, with big data as means, excavate commercial opportunity therein, Google is
Most typical representative in these enterprises.According to statistics, Google search monthly reaches more than one hundred billion time, and search information is analyzed
And process, handled data volume reaches 600PB, and (1,PB=,100 ten thousand GB, this quantity of information is it is said that be equivalent to 1,000,000 years news early
The summation of report).All content data information searched for by google search engine all can be used by its analysis.Such as, with
When Google scans for, search box is keyed in key word, can show with the related information of search content, if input
" big data ", Search Results can point out the contents such as " big data concept ", " big data epoch ", " big data technology ".This is big
The result being analyzed using big data technology on the basis of amount historical search information.If additionally, input is error message,
Or directly inputted in phonetic mode, Google can revise search content automatically, then provides correct suggestion, this function of search fortune
With same search theory.
Compared with traditional enterprise operation data, big data has two differences.
First, data volume is huge, but different from data messages such as traditional sales volume, quantitys in stock, Google, Facebook
When the data clicking on generation to website Deng Internet enterprises is processed, in analysis and management method, difference is very big.At big data
The core of reason, is not structural data, but the data producing on above-mentioned website clickstream data and social networkies, Yi Jichuan
In sensor data, the data of storage, it is impossible to be stored in lane database, is referred to as unstructured data.
Second, from the point of view of the type of business of data processing, really grasp huge data storage and analytical technology is not to pass
The entity industry of system, but emerging Internet enterprises (Google), social networkies (Facebook) and electric business enterprise
(Amazon) etc..The former can entrust the latter is that it carries out big data information analysiss and processes service.
Facebook can produce the data volume of 30PB, and the data volume that Wal-Mart produces only has 2.5PB, not only in data
In amount, in the multiformity of data and the speed of generation, difference is also very big simultaneously.From the foregoing, it will be observed that Large-Scale Interconnected net enterprise is in interconnection
In the period that net flourishes, other enterprises are easily neglected
Depending on data value, low cost storage and the technology processing can be developed in time, and will wherein valuable letter
Breath extracts, and integrates and applies in operation flow, has gradually formed the competitive advantage of itself, de- grain husk in Internet enterprises and
Go out.At present, the impact with these Internet enterprises is increasing, and more enterprises start to pay attention to the analysis of big data, utilize
Big data is passed through to provide new services, to improve customer satisfaction, and then improves the competitive advantage of enterprise.
Big data is in two or three years penetrated into rapidly different industries, in different field, makes short with swift and violent developing state
Production efficiency is largely increased, and the development trend of big data is closely bound up with the raising of the productivity.
Data volume exponentially increases.The common achievement in research of a lot of research institutions shows, global metadata total amount will be
Interior over the next several years exponentially increase.Estimate according to U.S. advisory organization Mai Kenxi, the new data of global enterprise storage in 2010
Amount, more than 7EB, client personal computer stores the new data more than 6EB.
The big data intensity of different industries and content are had nothing in common with each other.The data volume of industry-by-industry storage is different from, several greatly
According to growth according to the difference of industry, produce and the data type of storage also differ.There is card in the maximum field of memory data output
Certificate, investment consultation and bank and other financial mechanism, the number that the department such as communication common carrier, media mediation and public institution of government produces
Also very big according to scale.The industry that these have data assets has very big value potentiality in big data using aspect.
Existing trend will continue to press on data and increase.Between different areas and industry, relevant enterprise is all accelerating speed
Degree collects data, has also promoted the growth of traditional transaction database simultaneously;Multimedia is wide the people's livelihood such as health care field
General application, is added significantly to the generation of big data;The commonly used and Internet of Things of network social intercourse extensively should in productive life
With all promoting the continuous growth of big data, the cross-application of these different industries have stimulated further big data growth and
The rapid expansion of data pool.
Big data is the following new frontline technology promoting productivity dynamics.Big data will want to be has stronger competition
Power, the productivity, innovation ability, it is desirable to have suitable policy promotes, this is also the key element creating consumer surplus.In medical treatment
Health industry, makes full use of big data, can reduce operating cost, it is to avoid unnecessary treatment, reduces what treatment accident occurred
Probability, improves and lifting medical service quality;In public administration field, revenue department can promote tax revenue work using big data
The development made, improves the work efficiency of related department of paying taxes;In retail trade, the efficiency improving and improving industry can be by supplying
Chain and the big data of business is answered to apply to realize;In marketing field, make full use of big data, be consumer with more suitably
Price finds the product meeting its demand, improves value-added content of service.
Now, data is also a kind of assets, can material assets and human capital shoulder to shoulder, simultaneously it be also a kind of production will
Element.With the development of the emerging industry such as multimedia, Internet of Things in social life, enterprise will collect more from these media
Information, thus bring increasing rapidly of data.Big data is in commerce services and huge for having given play in consumer's creation of value
Big potentiality.
Content of the invention
The technical problem to be solved is, there is provided a kind of big data analysis system and method.The present invention is big
In data analysing method, combine the advantage using and having continued both B+ tree and inverted index using hybrid index, avoid simultaneously again
The shortcoming of each of which.Logarithm value type data can also be achieved while the speed and the space utilisation that improve index construct
Range query function.Data of the present invention leaches the scoring feature extracting projects by the means of project vector compression, effectively
Solve the sparse sex chromosome mosaicism in commending system, drastically increase the computational efficiency of item similarity simultaneously.Finally, by reality
Test the improvement effect to equal model to be verified, test result indicate that the equal model after the present invention improves is less for scoring
Project has more preferable recommendation effect, more meets the application demand of real system.
For solving above-mentioned technical problem, the invention provides a kind of big data analysis system, including:Data retrieval module,
Data leaches module, data clusters module, and, information extraction modules.
Described data retrieval module, for data retrieval, the data attribute in data set and property value is divided and comes, structure
Build double-deck index structure.
Described according to retrieval module, set up upper layer index for the attribute of data intensive data first;
Secondly index is set up to the data value corresponding to the attribute of upper strata, if numeric type data just builds B+ tree index knot
Structure, if character type data just builds inverted index.
Described data leaches module, leaches for the data after data retrieval;Described data leaches, and takes following equal model
Variation:Assume project i to be transformed scoring vector be Ii={ r1i, r2i, r3i..., rmiThrough the conversion of equal model, vector
Ii is converted to equal model representation form:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,
t23) it is the 2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
Described data clusters module, the data clusters analysis after leaching for data;
Described data clusters analysis, using the analysis method of predicted intensity;Described predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) take cluster numbers to be k, above-mentioned two subset is clustered, cluster result is designated as I type cluster;
(3) with the cluster result of training set, test set is differentiated, result is designated as II type cluster;
(4) the apoplexy due to endogenous wind being polymerized in test set itself, examination arbitrary to sample point i and i ' whether quilt in II type cluster
Mistake point is in different classes, and records the ratio correctly being divided;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
For solving above-mentioned technical problem, present invention also offers a kind of big data analysis method, including:The step of data retrieval
Suddenly, the step that data leaches, the step of data clusters, and, the step of information retrieval.
The step of described data retrieval, for data retrieval, the data attribute in data set and property value is divided and comes,
Build double-deck index structure.
The described step according to retrieval, the first attribute for data intensive data set up upper layer index;
Secondly index is set up to the data value corresponding to the attribute of upper strata, if numeric type data just builds B+ tree index knot
Structure, if character type data just builds inverted index.
The step that described data leaches, leaches for the data after data retrieval;Described data leaches, and takes following equal mould
The variation of type:Assume project i to be transformed scoring vector be Ii={ r1i, r2i, r3i..., rmiConvert through equal model, to
Amount Ii is converted to equal model representation form:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,
t23) it is the 2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
The step of described data clusters, the data clusters analysis after leaching for data;
Described data clusters analysis, using the analysis method of predicted intensity;Described predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) take cluster numbers to be k, above-mentioned two subset is clustered, cluster result is designated as I type cluster;
(3) with the cluster result of training set, test set is differentiated, result is designated as II type cluster;
(4) the apoplexy due to endogenous wind being polymerized in test set itself, examination arbitrary to sample point i and i ' whether quilt in II type cluster
Mistake point is in different classes, and records the ratio correctly being divided;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
Beneficial the having technical effect that of the present invention:
(1) hybrid index of the present invention combines with and has continued the advantage of both B+ tree and inverted index, avoids simultaneously again
The shortcoming of each of which.The model of logarithm value type data can also be achieved while the speed and the space utilisation that improve index construct
Enclose query function.
(2) the equal model data of the present invention leaches the scoring feature extracting projects by the means of project vector compression, has
Solve to effect the sparse sex chromosome mosaicism in commending system, drastically increase the computational efficiency of item similarity simultaneously.Finally, lead to
Cross experiment the improvement effect of equal model is verified, test result indicate that the equal model after present invention improvement is for scoring relatively
Few project has more preferable recommendation effect, more meets the application demand of real system.
(3) the k- means clustering method based on improvement predicted intensity for the present invention is can to the cluster result of big data in example
Believe and be of practical significance.On the basis of k- means clustering algorithm, introduce improved predicted intensity, and cluster is determined with this
Variable and cluster numbers.Cluster analyses to big data website column mean residence time show, this improved big data cluster
The cluster of method is tied up fruit and is had a more clear and definite practical significance, the more conventional clustering method of clustering method of the present invention is preferably be used for into
The cluster analyses of row big data.
Brief description
Fig. 1 is two-layer hybrid big data index structure figure described in the embodiment of the present invention;
Fig. 2 is user items rating matrix-vector compression schematic diagram described in the embodiment of the present invention;
Fig. 3 is the user items rating matrix-vector compression schematic diagram of dimensionality reduction described in the embodiment of the present invention;
Fig. 4 is equal model vector transformation process figure described in the embodiment of the present invention;
Fig. 5 is that described in the embodiment of the present invention, (100K) is schemed in all model algorithm assessments;
Specific embodiment
To describe embodiments of the present invention below with reference to embodiment in detail, whereby to the present invention how application technology handss
Section is solving technical problem, and reaches realizing process and fully understanding and implement according to this of technique effect.
It should be noted that writing length for saving description, it is to avoid unnecessary repetition and waste, in the feelings do not conflicted
Under condition, the embodiment in the application and the feature in embodiment can be mutually combined.
First, data retrieval
The present invention proposes a kind of hybrid index structure based on inverted index and B+ tree.The leafy node of B+ tree is ordered into
, this makes it have obvious advantage when logarithm value type data carries out range retrieval, can bear substantial amounts of live load, tool
There is relatively stable I/O expense.Inverted index can not provide to the range retrieval completing numeric type data and support well, but because
It is fast that it realizes relatively easy, inquiry velocity, and retrieval can provide good with one-time positioning to the index construct of character type data
Hold.
On the basis of tradition index, introduce the thought of stratification index, by the data attribute in data set and attribute
Value division is come, and builds double-deck index structure.It is the upper layer index of attribute foundation of data intensive data first.Secondly upper strata is belonged to
Property corresponding to data value set up index, if numeric type data just builds B+ tree index structure, if character type data
Just build inverted index.So, not all data is all set up tree index and is reduced storage caused by node split
The problem of space waste, in addition, the use decreasing shared by produced interim node during node split is extra
Memory space, accelerates the speed of index building, improves the utilization rate of memory space.Enter line range when logarithm value type data to look into
During inquiry, the tree index that will be directly targeted to lower floor completes, and reduces data query time and cost.
The hybrid index of present invention design combines with and has continued the advantage of both B+ tree and inverted index, avoids simultaneously again
The shortcoming of each of which.Logarithm value type data can also be achieved while the speed and the space utilisation that improve index construct
Range query function.
The two-layer hybrid big data index structure of the present invention is as shown in Figure 1:
The tree index structure on upper strata is that the attribute being primarily directed to included in data set is set up, in this layer index
The specific object of data is stored entirely in n omicronn-leaf child node, and then stores three partial informations in all leafy nodes of B+ tree
Ai, PType, Pointer, the implication of expression is respectively:
(1)AiIt is the specific object of directoried data set, wherein n is the number of all properties, i ∈ [1, n];
(2) what PType represented is pointer type, and particular type has PType { Inverted_index, B+ tree };
(3) Pointer is the pointer pointing to lower layer index, and according to the difference of data type, this pointer points to different ropes
Guiding structure, that is, point to the root node of inverted list gauge outfit or B+ tree.
2nd layer index is the index constructed by the data value corresponding to the 1st layer of attribute, builds including for numeric type data
Vertical B+ tree index structure and the inverted list index set up for character type data.Specific data value is stored in B+ tree index knot
In the n omicronn-leaf child node of structure, and leafy node is ordered arrangement and three partial informations A that comprise index fileRVS、Loc、
Doc, represents respectively and is meant that:
(1)ARVSFor the S property value of R attribute, R ∈ [1, n2], S ∈ [1, p], n2For comprise in data set
The number of numerical attribute, P is the data amount check of R attribute.
(2) Loc is the positional information that the file comprising this property value is located.
(3) Doc is the reference number of a document comprising searching keyword, and Doc is unique.
Inverted index is divided into two parts, and one is " dictionary ", is a concordance list being made up of different index word, record
Different Chinese keywords and their relevant information.Another is " log ", have recorded and each index terms
Collection of document and the relevant information such as their storage address.A is specifically comprised in the inverted index structure of the second layeriVj、Doc、
Loc, F tetra- partial information, the implication of expression is respectively:
(1)AiVjFor j-th property value of ith attribute, i ∈ [1, n1], j ∈ [1, m],
n1The number of the property value comprising for ith attribute for the number of character attibute, m.
(2) Doc is the reference number of a document comprising searching keyword, and Doc is unique.
(3) Loc is the position comprising searching keyword file place.
(4) frequency that F occurs in data set for searching keyword.
The establishment process of index:
It will be that it sets up the data of index that Step1 analyzes first, if not this data in the index of structure, mixed
Close one new index node of the first layer building of index.
Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ tree index for it;If
Character type attribute then sets up inverted index structure for it.
Step3 repeats Step1, if there is current attribute before in the index building, then no longer to index first
Layer increases new node, only the data of this attribute is added in the corresponding index of the second layer.
Step4 repeats above step, until setting up till index completes for all of data.
Search index method:
Analysis querying condition first obtains key word, searching keyword is handed to indexed lexicon, if index marker position
For Fales, return null value and represent in index file, there is not data to be inquired about, then judge that this query word returns if True
Return the data type of result, navigate to different index according to dissimilar, read the numbering of this vocabulary and comprise vocabulary document
Number, obtains the relevant information of querying condition by these.Number further according to vocabulary and read in B+ tree index or inverted index
Content, integrates the retrieval content obtaining, finally carries out dependency with search condition and compare, result ranking is terminated most
Fruit returns to user.Using the key assignments term_id in tables of data as the input value of search algorithm, it is output as Boolean, concrete mistake
Journey is as follows:
(1-1) using root, term_id, layer as |input paramete, call lookup function treeSearch (root,
Term_id, layer), lookup result is assigned to leaf page record record.
If (1-2) record is sky, directly return null value;Otherwise, return real lookup result rid.
Using current page currentPage as the input searching function treeSearch, key is key for searching and layer is
The initial number of plies, may comprise the output as function for the leaf record leafRecord of key for searching key, detailed process is as follows:
If (2-1) be currently located is leaf page, key key is searched using binary chop algorithm, and provide lookup knot
Really.
If (2-2) current page is not leaf page, execution step (2-3) arrives (2-6).
(2-3) press currentPage and key value, select the subtree containing key assignments, obtain the page number pageNo of child node.
(2-4) in the buffer the child node page subTreePage that it is comprised is read according to page number.
If the child node page (2-5) finding is leaf page, then return (2-1).
If (2-6) this child node page is branch's page, subTreePage, key, layer are all subtracted 1 as new defeated
Enter, recursive call function returns output result.
The validation verification of hybrid index
The quality of index construct will directly influence tissue effect and the Query Result efficiency of data, proposed by the present invention
Two-layer hybrid index structure, in validation verification, compares from the time performance of index construct and analyzes.
Time performance analysis with compare
If n1、n2It is respectively the mean number of the number of Numeric Attributes and its property value in data set, n3、n4It is respectively
The number of character type attribute and the mean number of property value.Then the total number of property value is N=n1×n2+n3×n4.Assume the
One layer is k rank B+ tree index, and the second layer is that m rank B+ tree indexes.
The ground floor B+ height of tree degree of hybrid index structure is logk(n1+n3) it is assumed that B+ tree each node in addition to leaf node
There is k child node.Now ground floor B+ tree index needs the node into line splitting just to have FBdiv, calculated by formula (3-1)
Go out:
The height of second layer B+ tree is logmn2It is assumed that B+ tree index each node in addition to leaf node has m son knot
Point.Now B+ tree needs the node into line splitting to have SBdivIndividual, calculated by formula (3-2):
Then
The number of all division nodes is a total of:
If the whole index of data set is all indexed using traditional B+ tree construction, as all of property value is all built
Vertical tree index index, then the total number of split vertexes is:
Formula (3-3) and formula (3-4) are compared and understand, hybrid index structure of the present invention is on the index creation time
There is for relatively single index structure more obvious superiority.
2nd, data leaches
Although traditional Collaborative Filtering Recommendation Algorithm has been achieved for good effect in actual applications, it is still
Have such problems as that sparse sex chromosome mosaicism, operation efficiency be low and poor expandability.The present invention is proposed and is leached based on the data of equal model
Algorithm, this algorithm is started with it is proposed that a kind of method of equal model representation Item vector from Item vector is long, is effectively shortened
The calculating time of item similarity improves the treatment effeciency to big data for the commending system, can preferably apply in large-scale data
On collection.
The ultimate principle of equal model
The essence of all models is the scoring average by layering, is extracted the main scoring feature of project, is ensureing to recommend
While precision, have compressed project scoring vector length, thus recommendation efficiency is greatly improved.All models are commented to user-project
The compression process of sub-matrix is respectively as Fig. 2, shown in Fig. 3, wherein m>>t.
Define 3.1 equal models to pass through to be layered the vectorial transformation model that average extracts project scoring feature, the form of expression is one
Orderly complete binary tree.When project no corresponding score information, it is an empty tree, otherwise left child node in this binary tree
It is respectively less than father node, right child node is all higher than father node, and each subtree is also satisfied by above rule.
Define 3.2 in the distinguishing hierarchy of equal model, the root node of binary tree is the 0th layer of equal model, be project scoring
The grand mean of vector, represents the aggregate level that user scores to this project, is considered as the main scoring feature of project;By that analogy,
Each dtex that other level averages of all models represent project scoring is levied.
All model variations:
Assume project i to be transformed scoring vector be Ii={ r1i, r2i, r3i..., rmiThrough the conversion of equal model, vectorial Ii turns
It is changed to equal model representation form:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ....
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,
t23) it is the 2nd layer of four elements.By that analogy, just project scoring vector can be converted to the equal model of the specified number of plies.
All model transformation for mula are:
Wherein, FkFor the conversion formula of kth (k >=0) layer, card (Ii) for project i scoring number.All model vector conversions
Flow process is as shown in Figure 4.
All model conversion algorithms
Input:The vectorial I of original item scoringi={ r1i, r2i, r3i..., rmi, change number of plies k.
Output:All model item vector I 'i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ....
Step:
(1) the 0th layer,
(2) first, according to t0By vectorial IiIt is divided into two vectors:
(3) then calculate the 1st layer of equal model two elements:
(4) in the same manner according to t10And t11Respectively by vectorial I10 iAnd I11 iIt is divided into vectorial I20 i、I21 iAnd I22 i、I23 i, then distinguish
Calculate four element t of the 2nd layer of equal model20,t21,t22,t23;
(5) by that analogy, obtain equal model vector
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...)....
For the abundant effect verifying the equal model of the present invention, the present invention is using MovieLens 100K and MovieLens 1M
(being shown in Table 1) two classical data sets are contrast experiment to the improvement effect of equal model.Data set is divided by experimentation at random
For 5 equal portions, tested by the way of five folding cross validations.
Table 1 experimental data gathers
The embodiment of the present invention, by MAE value, recall rate and tri- evaluation metricses of NDCG, evaluates and tests the equal model of the present invention respectively
The prediction accuracy of (Improved MM), classification accuracy and sequence accuracy.
First, by the equal model of MAE value relative analyses two-layer (level1 comprises the 0th layer and the 1st layer) and three layers of equal model
(level2 comprises the 0th layer to the 2nd layer) before improvement with improve after recommendation precision.Then, contrasted by recall rate and NDCG
After analysis improvement, all model (Improved MM), cloud model (Cloud_Model) and classic algorithm Cosine algorithms are in IBCF
Application effect in algorithm, thus evaluate and test from the auxiliary that Improved MM is carried out with multi-angle.
As shown in figure 5, level1_Improved MM and level2_Improved MM with improve before the corresponding number of plies equal
Model is compared, and has obvious lifting on recommending precision.However, on 1M data set, the improvement effect phase of equal model
To less, or even level1_Improved MM is almost identical with the recommendation effect of level1_MM.Experiment as a result, this
Improvement effect on 100K data set for the bright equal model is obvious, but the improvement effect on 1M data set weakens.
3rd, data clusters
Cluster is the major issue in data mining, is also core also one of problem of big data analysis.Means clustering algorithm
It is a kind of simple and effective distance algorithm, thus application is quite varied.Different from hierarchical clustering algorithm, changing algorithm needs every time
Calculate the distance between any two points, so it has faster convergence rate than hierarchy distance.But k- means clustering algorithm has
Two defects, one is to need cluster numbers are determined in advance, and two is larger by also being affected in initial clustering.
The present invention proposes and a kind of determines method based on the cluster numbers of predicted intensity.
The definition of predicted intensity is
Wherein, Xtr, Xte, represent training set and the test set that initial data is carried out with random division gained respectively;c(Xtr,
K) represent the cluster process of training set, be copolymerized into k class;Ak1,Ak2,…,AkkThe heart represents the k class that test set itself is polymerized to, i, i ' be
Sample point in same class, nkjIt is AkjThe number of middle sample point;D[C(Xtr, k), Xte] representing a k x k matrix, it i-th
Row and the i-th ' element arranging take 0 or 1, and value 0 represents not in same class, value 1 expression training set pair:I and i ' is clustered;
Ps (k) represents the predicted intensity that cluster numbers are cluster result during k, and interval is [0,1].
Predicted intensity calculating process is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) take cluster numbers to be k, above-mentioned two subset is clustered, cluster result is designated as I type cluster;
(3) with the cluster result of training set, test set is differentiated, result is designated as II type cluster;
(4) the apoplexy due to endogenous wind being polymerized in test set itself, examination arbitrary to sample point i and i ' whether quilt in II type cluster
Mistake point is in different classes, and records the ratio correctly being divided;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
Obviously, the ability being intuitively meant that the correctly predicted new sample point of current cluster result energy of predicted intensity.In reality
In, can W predicted intensity be object function, W cluster numbers and variable subset are the factor of impact predicted intensity, be worked as by selecting to force
Cluster numbers and variable subset, make predicted intensity maximize.
In the calculating process of predicted intensity, because training set and test set are to divide with change, some accidentalia of institute W
Considerable influence may be produced to the result of calculation of predicted intensity.In order to reduce the impact of accidentalia, the present invention is changed using one kind
Enter method and calculate predicted intensity, specific practice is:First data set is randomly divided into some deciles, by each decile alternately as
Test set, after obtaining respective predicted intensity, then takes the predicted intensity that its meansigma methods is under this cluster numbers.
It is credible to the cluster result of big data in example and has reality based on the k- means clustering method improving predicted intensity
Border meaning.On the basis of k- means clustering algorithm, introduce improved predicted intensity, and clustering variable is determined with this and gathers
Class number.Cluster analyses to big data website column mean residence time show, this improved big data clustering method poly-
Class is tied up fruit and is had more clear and definite practical significance, the more conventional clustering method of the inventive method preferably for carrying out the poly- of big data
Alanysis.
4th, information retrieval
Information retrieval is exactly information extraction (the Information Extraction often saying in fact:), IE that is, need
Information inside data source to be extracted carry out some structurized process and can be organized into be easy to people inquiry utilize
Form.In the middle of real life and work, information source has popularity, and the form that it is showed is also ever-changing crisscross
Complicated, particularly in this big data epoch, tend not to correctly using information source and make a policy.It is therefore necessary to
Effective information retrieval is carried out to these complicated information sources.
The info web source handled well for cluster analyses, is that the label useless to user removes, some mistakes first
Or irregular label carries out repairing arrangement, such as comment tag "<script>" wait script file.Nowadays substantial amounts of webpage
All integrated with TABLE or DIV label, thus the present invention when carrying out data processing just according to both in the middle of
A kind of label constructing tree, html file therein is exactly the root node of tree, send the corresponding web page blocks of two kinds of labels to be son
Node.
Then the semanteme a few partial content tundish being contained is analyzed.Step is, before this to the tag tree in root node
DIV the or TABLE node comprising carries out data collection, only can extract the node content of this layer when certainly extracting information.
In the same level label extracting, need it is detected further.That is, if the son mark extracting
In label after carrying out Semantic detection or to detect the content that it comprises little with the content relation degree of user's request
Or basic user is not related at all, then information redundancy part can be regarded it as, directly redundant content can be abandoned and delete.
Detect step followed by divider, user in the process to label using delamination process, also
Be say before deleted those with user the unrelated data messages of expectation, so for the quantity of the data message block of detection
Just relatively little, improve work efficiency and data processing speed.
After above step, web page contents have been divided into DIV or TABLE labeled marker relatively not
Unified semantic block message, is accomplished by them to be converted into if necessary to carry out deeper process to these semantic block messages
Complete DOM number form formula, carries out data message extraction with recurrence method step by step to the dom tree comprising respective different content.
Can carry out time with all labels that word frequency co-occurrence method comprises to dom tree during the main contents extracting data block
Go through, if it find that some chunk contents are little with user's desired data message degree of relationship, also in the middle of ergodic process
It is information redundancy part, then the data message that user's expectation obtains can be removed it and retain.
All above-mentioned this intellectual properties of primary enforcement, do not set this new product of enforcement limiting other forms
And/or new method.Those skilled in the art will be using this important information, and the above is changed, to realize similar execution feelings
Condition.But, all modifications or transformation belong to the right of reservation based on new product of the present invention.
Claims (10)
1. a kind of big data analysis system is it is characterised in that include:Data retrieval module, data leaches module, data clusters mould
Block, and, information extraction modules.
2. big data analysis system, it is characterised in that described data retrieval module, is examined for data according to claim 1
Rope, the data attribute in data set and property value is divided and comes, and builds double-deck index structure.
3. according to claim 1 big data analysis system it is characterised in that described according to retrieval module, first for data set
The attribute of middle data sets up upper layer index;
Secondly index is set up to the data value corresponding to the attribute of upper strata, if numeric type data just builds B+ tree index structure,
If character type data just builds inverted index.
4. big data analysis system, it is characterised in that described data leaches module, is examined for data according to claim 1
Data after rope leaches;Described data leaches, and takes the variation of following equal model:Assume the scoring of project i to be transformed to
Measure as Ii={ r1i, r2i, r3i..., rmiThrough the conversion of equal model, vectorial Ii is converted to equal model representation form:
Ii'={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,t23) be
2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
5. big data analysis system, it is characterised in that described data clusters module, is filtered for data according to claim 1
Data clusters analysis after going out;
Described data clusters analysis, using the analysis method of predicted intensity;Described predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) take cluster numbers to be k, above-mentioned two subset is clustered, cluster result is designated as I type cluster;
(3) with the cluster result of training set, test set is differentiated, result is designated as II type cluster;
(4) the apoplexy due to endogenous wind being polymerized in test set itself, whether examination is arbitrary is divided by wrong to sample point i and i ' in II type cluster
In different classes, and record the ratio correctly being divided;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
6. a kind of big data analysis method is it is characterised in that include:The step of data retrieval, the step that data leaches, data is gathered
The step of class, and, the step of information retrieval.
7. according to claim 6 big data analysis method it is characterised in that the step of described data retrieval, for data
Retrieval, the data attribute in data set and property value is divided and comes, and builds double-deck index structure.
8. according to claim 6 big data analysis method it is characterised in that described according to retrieval step, first for data
The attribute of intensive data sets up upper layer index;
Secondly index is set up to the data value corresponding to the attribute of upper strata, if numeric type data just builds B+ tree index structure,
If character type data just builds inverted index.
9. according to claim 6 big data analysis method it is characterised in that the step that leaches of described data, for data
Data after retrieval leaches;Described data leaches, and takes the variation of following equal model:Assume the scoring of project i to be transformed
Vector is Ii={ r1i, r2i, r3i..., rmiThrough the conversion of equal model, vectorial Ii is converted to equal model representation form:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,t23) be
2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
10. according to claim 6 big data analysis method it is characterised in that the step of described data clusters, for data
Data clusters analysis after leaching;
Described data clusters analysis, using the analysis method of predicted intensity;Described predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) take cluster numbers to be k, above-mentioned two subset is clustered, cluster result is designated as I type cluster;
(3) with the cluster result of training set, test set is differentiated, result is designated as II type cluster;
(4) the apoplexy due to endogenous wind being polymerized in test set itself, whether examination is arbitrary is divided by wrong to sample point i and i ' in II type cluster
In different classes, and record the ratio correctly being divided;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610848904.9A CN106484813B (en) | 2016-09-23 | 2016-09-23 | A kind of big data analysis system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610848904.9A CN106484813B (en) | 2016-09-23 | 2016-09-23 | A kind of big data analysis system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106484813A true CN106484813A (en) | 2017-03-08 |
CN106484813B CN106484813B (en) | 2017-10-31 |
Family
ID=58267892
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610848904.9A Active CN106484813B (en) | 2016-09-23 | 2016-09-23 | A kind of big data analysis system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106484813B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341221A (en) * | 2017-06-28 | 2017-11-10 | 百度在线网络技术(北京)有限公司 | Foundation, associative search method, apparatus, equipment and the storage medium of index structure |
CN107609105A (en) * | 2017-09-12 | 2018-01-19 | 电子科技大学 | The construction method of big data accelerating structure |
CN108256083A (en) * | 2018-01-22 | 2018-07-06 | 成都博睿德科技有限公司 | Content recommendation method based on deep learning |
CN108256086A (en) * | 2018-01-22 | 2018-07-06 | 成都博睿德科技有限公司 | Data characteristics statistical analysis technique |
CN108764991A (en) * | 2018-05-22 | 2018-11-06 | 江南大学 | Information of supply chain analysis method based on K-means algorithms |
CN109325027A (en) * | 2018-08-21 | 2019-02-12 | 朱常林 | One kind is based on the analysis of cloud data, Situation Awareness algorithm |
CN109547271A (en) * | 2019-01-06 | 2019-03-29 | 广州泳泳信息科技有限公司 | A kind of network state real time monitoring warning system based on big data |
CN110019400A (en) * | 2017-12-25 | 2019-07-16 | 深圳云天励飞技术有限公司 | Date storage method, electronic equipment and storage medium |
CN110348021A (en) * | 2019-07-17 | 2019-10-18 | 湖北亿咖通科技有限公司 | Character string identification method, electronic equipment, storage medium based on name physical model |
CN114996360A (en) * | 2022-07-20 | 2022-09-02 | 江西现代职业技术学院 | Data analysis method, system, readable storage medium and computer equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101256594A (en) * | 2008-03-25 | 2008-09-03 | 北京百问百答网络技术有限公司 | Method and system for measuring graph structure similarity |
US20130326346A1 (en) * | 2012-05-30 | 2013-12-05 | Sap Ag | Brainstorming in a cloud environment |
CN105787097A (en) * | 2016-03-16 | 2016-07-20 | 中山大学 | Distributed index establishment method and system based on text clustering |
-
2016
- 2016-09-23 CN CN201610848904.9A patent/CN106484813B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101256594A (en) * | 2008-03-25 | 2008-09-03 | 北京百问百答网络技术有限公司 | Method and system for measuring graph structure similarity |
US20130326346A1 (en) * | 2012-05-30 | 2013-12-05 | Sap Ag | Brainstorming in a cloud environment |
CN105787097A (en) * | 2016-03-16 | 2016-07-20 | 中山大学 | Distributed index establishment method and system based on text clustering |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341221A (en) * | 2017-06-28 | 2017-11-10 | 百度在线网络技术(北京)有限公司 | Foundation, associative search method, apparatus, equipment and the storage medium of index structure |
CN107341221B (en) * | 2017-06-28 | 2020-08-11 | 百度在线网络技术(北京)有限公司 | Index structure establishing and associated retrieving method, device, equipment and storage medium |
CN107609105B (en) * | 2017-09-12 | 2020-07-28 | 电子科技大学 | Construction method of big data acceleration structure |
CN107609105A (en) * | 2017-09-12 | 2018-01-19 | 电子科技大学 | The construction method of big data accelerating structure |
CN110019400B (en) * | 2017-12-25 | 2021-01-12 | 深圳云天励飞技术有限公司 | Data storage method, electronic device and storage medium |
CN110019400A (en) * | 2017-12-25 | 2019-07-16 | 深圳云天励飞技术有限公司 | Date storage method, electronic equipment and storage medium |
CN108256083A (en) * | 2018-01-22 | 2018-07-06 | 成都博睿德科技有限公司 | Content recommendation method based on deep learning |
CN108256086A (en) * | 2018-01-22 | 2018-07-06 | 成都博睿德科技有限公司 | Data characteristics statistical analysis technique |
CN108764991A (en) * | 2018-05-22 | 2018-11-06 | 江南大学 | Information of supply chain analysis method based on K-means algorithms |
CN108764991B (en) * | 2018-05-22 | 2021-11-02 | 江南大学 | Supply chain information analysis method based on K-means algorithm |
CN109325027A (en) * | 2018-08-21 | 2019-02-12 | 朱常林 | One kind is based on the analysis of cloud data, Situation Awareness algorithm |
CN109547271B (en) * | 2019-01-06 | 2020-01-03 | 广州泳泳信息科技有限公司 | Network state real-time monitoring alarm system based on big data |
CN109547271A (en) * | 2019-01-06 | 2019-03-29 | 广州泳泳信息科技有限公司 | A kind of network state real time monitoring warning system based on big data |
CN110348021A (en) * | 2019-07-17 | 2019-10-18 | 湖北亿咖通科技有限公司 | Character string identification method, electronic equipment, storage medium based on name physical model |
CN114996360A (en) * | 2022-07-20 | 2022-09-02 | 江西现代职业技术学院 | Data analysis method, system, readable storage medium and computer equipment |
CN114996360B (en) * | 2022-07-20 | 2022-11-18 | 江西现代职业技术学院 | Data analysis method, system, readable storage medium and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
CN106484813B (en) | 2017-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106484813B (en) | A kind of big data analysis system and method | |
CN111428053B (en) | Construction method of tax field-oriented knowledge graph | |
Hassan et al. | Sentiment analysis on bangla and romanized bangla text using deep recurrent models | |
Lim et al. | Business intelligence and analytics: Research directions | |
WO2021109464A1 (en) | Personalized teaching resource recommendation method for large-scale users | |
CN105930469A (en) | Hadoop-based individualized tourism recommendation system and method | |
CN102129470A (en) | Tag clustering method and system | |
CN104899229A (en) | Swarm intelligence based behavior clustering system | |
CN103425740A (en) | IOT (Internet Of Things) faced material information retrieval method based on semantic clustering | |
CN111737421A (en) | Intellectual property big data information retrieval system and storage medium | |
Xue et al. | Ontology alignment based on instance using NSGA-II | |
CN112508743B (en) | Technology transfer office general information interaction method, terminal and medium | |
CN114254201A (en) | Recommendation method for science and technology project review experts | |
Ishfaq et al. | Identifying the Influential Bloggers: A modular approach based on Sentiment Analysis. | |
Das et al. | Case study of trend mining in transportation research record articles | |
Wei et al. | Online education recommendation model based on user behavior data analysis | |
CN106909626A (en) | Improved Decision Tree Algorithm realizes search engine optimization technology | |
CN109062551A (en) | Development Framework based on big data exploitation command set | |
CN116957128A (en) | Service index prediction method, device, equipment and storage medium | |
CN116361428A (en) | Question-answer recall method, device and storage medium | |
CN112668836B (en) | Risk spectrum-oriented associated risk evidence efficient mining and monitoring method and apparatus | |
Vidal et al. | Data Sources as a Driver for Market-Oriented Tourism Organizations: a Bibliometric Perspective | |
CN113821718A (en) | Article information pushing method and device | |
Wang et al. | Deep learning-based open api recommendation for mashup development | |
Ahmed et al. | Ontological Based Approach of Integrating Big Data: Issues and Prospects |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |