CN106484813B - A kind of big data analysis system and method - Google Patents
A kind of big data analysis system and method Download PDFInfo
- Publication number
- CN106484813B CN106484813B CN201610848904.9A CN201610848904A CN106484813B CN 106484813 B CN106484813 B CN 106484813B CN 201610848904 A CN201610848904 A CN 201610848904A CN 106484813 B CN106484813 B CN 106484813B
- Authority
- CN
- China
- Prior art keywords
- data
- index
- attribute
- layer
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2237—Vectors, bitmaps or matrices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2272—Management thereof
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of big data analysis system and method.Big data analysis system includes:Data retrieval module, data filter out module, data clusters module, and, information extraction modules.The data retrieval module, for data retrieval, the data attribute and property value in data set is demarcated to come, and builds double-deck index structure.It is described according to retrieval module, be that the attribute of data intensive data sets up upper layer index first;Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ tree index structures, if character type data just builds inverted index.The present invention is credible to the cluster result of big data in example and is of practical significance based on the k means clustering methods for improving predicted intensity.
Description
Technical field
The present invention relates to computer science and technology field, more particularly to a kind of big data analysis system and method.
Background technology
Currently, internet is all connected the computer of all networkings, fundamentally have impact on the production and living of people, this
It is the first choice for obtaining various data at present.The pattern for obtaining data by client to server by internet may be summarized to be
The pattern of " request "+" response ".This is the basic model of the Internet, applications agreement.
It is exactly to send order, then conducting interviews to click on mouse, and everyone access record is recorded clear in detail
Look in daily record, including the specific data such as time, request content, address.Data on internet are connected by these access records
Collectively constitute together, this is same reason by following the trail of vestige to catch prey with hunter, and access log is contained huge
Value.Therefore, this is also one of important sources of big data.
Several Internet enterprises such as Google, Amazon, Facebook, Twitter the biggest in the world etc. are just dominated entirely
The internet industry of ball, why they, which so succeed, a common factor, that is, superpower data analysis capabilities.
These enterprises analyze and process substantial amounts of data message daily, using big data as means, excavate commercial opportunity therein, and Google is
It is most typical in these enterprises to represent.According to statistics, the search of Google monthly is analyzed up to more than one hundred billion time, and to search information
And processing, handled data volume reach 600PB (GB of 1,PB=,100 ten thousand, it is said that equivalent to 1,000,000 years news morning of this information content
The summation of report).All contents searched for by google search engine and data message can all be used by its analysis.Such as, with
When Google is scanned for, keyword is keyed in search box, the information related with search content can be shown, if input
" big data ", search result can point out the contents such as " big data concept ", " big data epoch ", " big data technology ".This is big
The result analyzed on the basis of amount historical search information using big data technology.If in addition, input be error message,
Or directly inputted in phonetic mode, Google can correct search content automatically, then provide correct suggestion, this function of search fortune
With same search theory.
Compared with traditional enterprise operation data, big data has two differences.
First, data volume is huge, but different from the data message such as traditional sales volume, quantity in stock, Google, Facebook
Difference is very big in analysis and management method when being handled Deng Internet enterprises the data of website click generation.At big data
The core of reason, is not structural data, but the data produced on above-mentioned website clickstream data and social networks, Yi Jichuan
The data stored in sensor data, it is impossible to be stored in lane database, are referred to as unstructured data.
Second, from the point of view of the type of business of data processing, really grasp huge data storage and analytical technology is not to pass
The entity industry of system, but emerging Internet enterprises (Google), social networks (Facebook) and electric business enterprise
(Amazon) etc..The former can entrust the latter to carry out big data information analysis and processing service for it.
Facebook can produce 30PB data volume, and the data volume that Wal-Mart produces only has 2.5PB, not only in data
In amount, while difference is also very big in the diversity of data and the speed of generation.From the foregoing, it will be observed that Large-Scale Interconnected net enterprise is in interconnection
In net booming period, easily neglected for other enterprises
Depending on data value, the technology of low cost storage and processing can be developed in time, and will wherein valuable letter
Breath is extracted, integration apply in operation flow, gradually formed the competitive advantage of itself, in Internet enterprises take off grain husk and
Go out.At present, increasing with the influence of these Internet enterprises, more enterprises start to pay attention to the analysis of big data, utilize
Big data is by providing new services, to improve customer satisfaction, and then improves the competitive advantage of enterprise.
Big data is in two or three years penetrated into rapidly in different industries, different field with swift and violent developing state short, is made
Production efficiency is largely increased, and the raising of the development trend and productivity of big data is closely bound up.
Data volume exponentially increases.Many common achievements in research of research institutions show, global metadata total amount will be
It is interior over the next several years exponentially to increase.Estimate according to U.S. advisory organization Mai Kenxi, the new data of global enterprise storage in 2010
Amount stores the new data more than 6EB more than 7EB on client personal computer.
The big data intensity and content of different industries are had nothing in common with each other.The data volume of industry-by-industry storage is different from, big number
According to growth according to the difference of industry, produce and the data type of storage also differed.There is card in the maximum field of memory data output
Certificate, investment consultation and bank and other financial mechanism, the number that the department such as communication common carrier, media mediation and public institution of government produces
It is also very big according to scale.These industries for possessing data assets have very big value potentiality in big data using aspect.
Existing trend will continue to press on data growth.Between different areas and industry, relevant enterprise is all accelerating speed
Degree collects data, while also having promoted the growth of traditional transaction database;Multimedia is wide the people's livelihood such as health care field
General application, is added significantly to the generation of big data;The commonly used and Internet of Things of network social intercourse extensively should in production and living
With all promoting the continuous growth of big data, the cross-applications of these different industries further have stimulated big data growth and
The rapid expansion of data pool.
Big data is the new frontline technology of following promotion productivity dynamics.Big data, which wants to turn into, has stronger competition
Power, productivity, innovation ability, it is desirable to have appropriate policy is promoted, this is also the key element for creating consumer surplus.In medical treatment
Health industry, makes full use of big data, can reduce operating cost, it is to avoid unnecessary treatment, reduces what treatment accident occurred
Probability, is improved and lifting medical service quality;In public administration field, revenue department can promote tax revenue work using big data
The development of work, improves the operating efficiency of related department of paying taxes;In retail trade, the efficiency for improving and improving industry can be by supplying
The big data application of chain and business is answered to realize;In marketing field, big data is made full use of, is consumer with more suitably
Price finds the product for meeting its demand, improves value-added content of service.
Now, data are also a kind of assets, can material assets and human capital shoulder to shoulder, while it be also a kind of production will
Element.With the development of the emerging industry such as multimedia, Internet of Things in social life, enterprise will be collected into more from these media
Information, so as to bring increasing rapidly for data.Big data is in commerce services and for that can give play to huge in consumer's creation of value
Big potentiality.
The content of the invention
The technical problems to be solved by the invention are that there is provided a kind of big data analysis system and method.It is of the invention big
In data analysing method, using hybrid index with reference to using and having continued B+ trees and advantage both inverted index, while avoid again
Their own shortcoming.Logarithm value type data can also be realized while the speed and the space utilisation that improve index construct
Range query function.Data of the present invention filter out the scoring feature that projects are extracted by the means of project vector compression, effectively
The sparse sex chromosome mosaicism in commending system is solved, while drastically increasing the computational efficiency of item similarity.Finally, reality is passed through
Test and the improvement effect of equal model is verified, test result indicates that the equal model after the present invention is improved is less for scoring
Project possesses more preferable recommendation effect, more meets the application demand of real system.
In order to solve the above technical problems, the invention provides a kind of big data analysis system, including:Data retrieval module,
Data filter out module, data clusters module, and, information extraction modules.
The data retrieval module, for data retrieval, the data attribute and property value in data set is demarcated to come, structure
Build double-deck index structure.
It is described according to retrieval module, be that the attribute of data intensive data sets up upper layer index first;
Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ trees index knot
Structure, if character type data just builds inverted index.
The data filter out module, are filtered out for the data after data retrieval;The data are filtered out, and take following equal model
Variation:Assuming that project i to be transformed scoring vector is Ii={ r1i, r2i, r3i..., rmiConverted through equal model, vector
Ii is converted to equal model representation:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,
t23) it is the 2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
The data clusters module, the data clusters analysis after being filtered out for data;
The data clusters analysis, using the analysis method of predicted intensity;The predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in the class that test set itself is polymerized to, examine or check it is any to sample point i and i ' whether II types cluster in quilt
Mistake point records the ratio correctly divided in different classes;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
In order to solve the above technical problems, present invention also offers a kind of big data analysis method, including:The step of data retrieval
Suddenly, the step of the step of data are filtered out, data clusters, and, the step of information extraction.
The step of data retrieval, for data retrieval, the data attribute and property value in data set are demarcated to come,
Build double-deck index structure.
It is described according to retrieval the step of, be that the attribute of data intensive data sets up upper layer index first;
Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ trees index knot
Structure, if character type data just builds inverted index.
The step of data are filtered out, filters out for the data after data retrieval;The data are filtered out, and take following equal mould
The variation of type:Assuming that project i to be transformed scoring vector is Ii={ r1i, r2i, r3i..., rmiConverted through equal model, to
Amount Ii is converted to equal model representation:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,
t23) it is the 2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
The step of data clusters, the data clusters analysis after being filtered out for data;
The data clusters analysis, using the analysis method of predicted intensity;The predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in the class that test set itself is polymerized to, examine or check it is any to sample point i and i ' whether II types cluster in quilt
Mistake point records the ratio correctly divided in different classes;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
The present invention is beneficial to be had technical effect that:
(1) hybrid index of the present invention is with reference to using and having continued B+ trees and advantage both inverted index, while avoid again
Their own shortcoming.The model of logarithm value type data can also be realized while the speed and the space utilisation that improve index construct
Enclose query function.
(2) present invention model data filters out the scoring feature that projects are extracted by the means of project vector compression, has
The sparse sex chromosome mosaicism in commending system is solved to effect, while drastically increasing the computational efficiency of item similarity.Finally, lead to
Experiment is crossed to verify the improvement effect of equal model, test result indicates that the present invention improve after equal model for score compared with
Few project possesses more preferable recommendation effect, more meets the application demand of real system.
(3) present invention based on improve predicted intensity k- means clustering methods be to the cluster result of big data in example can
Believe and be of practical significance.On the basis of k- means clustering algorithms, improved predicted intensity is introduced, and cluster with this determination
Variable and cluster numbers.Clustering to big data website column mean residence time shows that this improved big data is clustered
The cluster of method, which ties up fruit, has more clear and definite practical significance, the more conventional clustering method of clustering method of the present invention is preferably be used for into
The clustering of row big data.
Brief description of the drawings
Fig. 1 is two-layer hybrid big data index structure figure described in the embodiment of the present invention;
Fig. 2 is user items rating matrix-vector compression schematic diagram described in the embodiment of the present invention;
Fig. 3 is user items rating matrix-vector compression schematic diagram of dimensionality reduction described in the embodiment of the present invention;
Fig. 4 is equal model vector transfer process figure described in the embodiment of the present invention;
Fig. 5 is that model algorithm assesses figure (100K) described in the embodiment of the present invention;
Embodiment
Describe embodiments of the present invention in detail below with reference to embodiment, whereby to the present invention how application technology hand
Section solves technical problem, and reaches the implementation process of technique effect and can fully understand and implement according to this.
It should be noted that writing length to save specification, it is to avoid unnecessary repetition and waste, in the feelings not conflicted
Under condition, the feature in embodiment and embodiment in the application can be mutually combined.
First, data retrieval
The present invention proposes a kind of hybrid index structure based on inverted index and B+ trees.The leafy node of B+ trees is ordered into
, this causes it to have obvious advantage when logarithm value type data carry out range retrieval, can bear substantial amounts of workload, has
There are relatively stable I/O expenses.Inverted index can not be provided the range retrieval for completing numeric type data and supported well, but because
It realizes that relatively easy, inquiry velocity is fast, and retrieval can provide good branch with one-time positioning to the index construct of character type data
Hold.
On the basis of tradition index, the thought of stratification index is introduced, by the data attribute and attribute in data set
Value, which is demarcated, to be come, and builds double-deck index structure.Attribute first for data intensive data sets up upper layer index.Secondly upper strata is belonged to
Property corresponding to data value set up index, if numeric type data just builds B+ tree index structures, if character type data
Just build inverted index.So, not all data all set up tree index and reduce the storage as caused by node split
The problem of space waste, in addition, the use shared by interim node produced by decreasing during node split, are extra
Memory space, accelerates the speed for building index, improves the utilization rate of memory space.Enter line range when logarithm value type data to look into
During inquiry, the tree index that will be directly targeted to lower floor is completed, and reduces data query time and cost.
The hybrid index that the present invention is designed is with reference to using and having continued B+ trees and advantage both inverted index, while avoid again
Their own shortcoming.Logarithm value type data can also be realized while the speed and the space utilisation that improve index construct
Range query function.
The two-layer hybrid big data index structure of the present invention is as shown in Figure 1:
The tree index structure on upper strata is set up primarily directed to the attribute included in data set, in the layer index
The specific object of data is stored entirely in n omicronn-leaf child node, and three partial informations are then stored in all leafy nodes of B+ trees
Ai, PType, Pointer, the implication of expression is respectively:
(1)AiIt is the specific object of directoried data set, wherein n is the number of all properties, i ∈ [1, n];
(2) what PType was represented is pointer type, and particular type has PType { Inverted_index, B+ tree };
(3) Pointer is points to the pointer of lower layer index, and according to the difference of data type, the pointer points to different ropes
Guiding structure, that is, point to the root node of inverted list gauge outfit or B+ trees.
2nd layer index is for the index constructed by the data value corresponding to the 1st layer of attribute, including to build for numeric type data
Vertical B+ trees index structure and the inverted list index set up for character type data.Specific data value is stored in B+ trees index knot
In the n omicronn-leaf child node of structure, and leafy node be ordered arrangement and the three partial information A comprising index fileRVS、Loc、
Doc, represents to be meant that respectively:
(1)ARVSFor the S property value of the R attribute, R ∈ [1, n2], S ∈ [1, p], n2For what is included in data set
The number of numerical attribute, P is the data amount check of the R attribute.
(2) Loc is to include the positional information where the file of this property value.
(3) Doc is the reference number of a document comprising searching keyword, and Doc is unique.
Inverted index is divided into two parts, and one is " dictionary ", is a concordance list being made up of different index word, record
Different Chinese keywords and their relevant information.Another is " record sheet ", have recorded and each index terms occurred
Collection of document and the relevant information such as their storage address.A is specifically included in the inverted index structure of the second layeriVj、Doc、
The partial information of Loc, F tetra-, the implication of expression is respectively:
(1)AiVjFor j-th of property value of ith attribute, i ∈ [1, n1], j ∈ [1, m],
n1For the number of character attibute, m is the number for the property value that ith attribute is included.
(2) Doc is the reference number of a document comprising searching keyword, and Doc is unique.
(3) Loc is to include the position where searching keyword file.
(4) F is the frequency that searching keyword occurs in data set.
The establishment process of index:
Step1 analyzes the data to be indexed for its foundation first, if not having the data in the index built, mixed
Close one new index node of the first layer building of index.
Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ trees index for it;If
Character type attribute then sets up inverted index structure for it.
Step3 repeats Step1, if there is current attribute in the index built before, then no longer to index first
The new node of layer increase, is only added to the data of the attribute in the corresponding index of the second layer.
Step4 repeats above step, untill setting up index completion for all data.
Search index method:
Analysis querying condition first obtains keyword, and searching keyword is handed to indexed lexicon, if index marker position
For Fales, return to null value and represent that the data to be inquired about are not present in index file, then judge that the query word is returned if True
The data type of result is returned, different index is navigated to according to different type, the numbering of the vocabulary is read and comprising vocabulary document
Number, the relevant information of querying condition is obtained by these.Read further according to vocabulary numbering in B+ trees index or inverted index
Content, integrates obtained retrieval content, finally carries out correlation comparison with search condition, result ranking is most terminated
Fruit returns to user.Using the key assignments term_id in tables of data as the input value of search algorithm, Boolean, specific mistake are output as
Journey is as follows:
(1-1) using root, term_id, layer as input parameter, call lookup function treeSearch (root,
Term_id, layer), lookup result is assigned to leaf page record record.
(1-2) directly returns to null value if record is sky;Otherwise, real lookup result rid is returned.
Using current page currentPage as the input for searching function treeSearch, key is key for searching and layer is
The initial number of plies, may include key for searching key leaf record leafRecord as the output of function, detailed process is as follows:
(2-1) searches key keys, and provide lookup knot if what is be currently located is leaf page using binary chop algorithm
Really.
(2-2) performs step (2-3) and arrives (2-6) if current page is not leaf page.
(2-3) presses currentPage and key values, selects the subtree containing key assignments, obtains the page number pageNo of child node.
(2-4) reads the child node page subTreePage that it is included according to page number in the buffer.
(2-5) is if the child node page found is leaf page, then return (2-1).
SubTreePage, key, layer are subtracted 1 and are used as newly defeated by (2-6) if child node page is branch's page
Enter, recursive call function returns to output result.
The validation verification of hybrid index
The quality of index construct will directly influence tissue effect and the Query Result efficiency of data, proposed by the present invention
Two-layer hybrid index structure is compared and analyzed from the time performance of index construct in validation verification.
Time performance analysis is with being compared
If n1、n2The number of Numeric Attributes and the mean number of its property value, n respectively in data set3、n4Respectively
The number of character type attribute and the mean number of property value.Then the total number of property value is N=n1×n2+n3×n4.Assuming that the
One layer is k rank B+ trees index, and the second layer indexes for m rank B+ trees.
The first layer B+ height of trees degree of hybrid index structure is logk(n1+n3), it is assumed that B+ trees each node in addition to leaf node
There is k child node.Now first layer B+ trees index needs the node into line splitting just to have FBdiv, calculated by formula (3-1)
Go out:
The height of second layer B+ trees is logm n2, it is assumed that B+ trees index each node in addition to leaf node has m son knot
Point.Now B+ trees need the node into line splitting to have SBdivIt is individual, calculated by formula (3-2):
Then
The number of all division nodes is a total of:
If the whole index of data set is all indexed using traditional B+ tree constructions, as all property values are all built
Vertical tree index index, then the total number of split vertexes is:
Formula (3-3) and formula (3-4) are compared and understood, hybrid index structure of the present invention is on the index creation time
There is more obvious superiority for relatively single index structure.
2nd, data are filtered out
Although traditional Collaborative Filtering Recommendation Algorithm has been achieved for good effect in actual applications, it is still
There are problems that sparse sex chromosome mosaicism, operation efficiency it is low and.The present invention proposes the data based on equal model and filtered out
Algorithm, the algorithm is started with from Item vectors are long, it is proposed that a kind of method that equal model represents Item vectors, is effectively shortened
The calculating time of item similarity improves treatment effeciency of the commending system to big data, can preferably apply in large-scale data
On collection.
The general principle of equal model
The essence of equal model is the scoring average by layering, is extracted the main scoring feature of project, is ensureing to recommend
While precision, project scoring vector length is have compressed, so that recommendation efficiency greatly improved.Equal model is commented user-project
The compression process of sub-matrix is respectively such as Fig. 2, shown in Fig. 3, wherein m>>t.
The vectorial transformation model that 3.1 equal models extract project scoring feature by being layered average is defined, the form of expression is one
Orderly complete binary tree.It is an empty tree when project is without corresponding score information, the otherwise left child node in the binary tree
Respectively less than father node, right child node is all higher than father node, and each subtree is also satisfied by above rule.
3.2 are defined in the distinguishing hierarchy of equal model, the root node of binary tree is the 0th layer of equal model, be project scoring
The grand mean of vector, represents the aggregate level that user scores the project, is considered as the main scoring feature of project;By that analogy,
Each dtex that other level averages of equal model represent project scoring is levied.
Equal model variation:
Assuming that project i to be transformed scoring vector is Ii={ r1i, r2i, r3i..., rmiConverted through equal model, vectorial Ii turns
It is changed to equal model representation:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ....
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,
t23) it is the 2nd layer of four elements.By that analogy, the vector that just project can score is converted to the equal model of the specified number of plies.
Equal model transformation for mula is:
Wherein, FkFor the conversion formula of kth (k >=0) layer, card (Ii) be project i scoring number.Equal model vector conversion
Flow is as shown in Figure 4.
Equal model conversion algorithm
Input:Original item scoring vector Ii={ r1i, r2i, r3i..., rmi, conversion number of plies k.
Output:Equal model item vector I 'i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ....
Step:
(1) the 0th layer,
(2) first, according to t0By vectorial IiIt is divided into two vectors:
(3) and then two elements of equal the 1st layer of model are calculated:
(4) similarly according to t10And t11Respectively by vectorial I10 iAnd I11 iIt is divided into vectorial I20 i、I21 iAnd I22 i、I23 i, then distinguish
Calculate four element t of the 2nd layer of equal model20,t21,t22,t23;
(5) by that analogy, equal model vector is obtained
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...)....
For the effect of the fully checking present invention model, the present invention is using MovieLens 100K and MovieLens 1M
(being shown in Table 1) two classical data sets are contrast experiment to the improvement effect of equal model.Data set is divided at random in experimentation
For 5 equal portions, tested by the way of five folding cross validations.
The experimental data of table 1 is gathered
The embodiment of the present invention evaluates and tests the present invention model respectively by MAE values, tri- evaluation metricses of recall rate and NDCG
Prediction accuracy, the classification degree of accuracy and the sequence degree of accuracy of (Improved MM).
First, two layers of equal model (level1 includes the 0th layer and the 1st layer) of MAE values comparative analysis and three layers of equal model are passed through
(level2 includes the 0th layer to the 2nd layer) before improvement with the recommendation precision after improvement.Then, contrasted by recall rate and NDCG
Model (Improved MM), cloud model (Cloud_Model) and classic algorithm Cosine algorithms is in IBCF after analysis is improved
Application effect in algorithm, so as to be evaluated and tested from the auxiliary that multi-angle is carried out to Improved MM.
As shown in figure 5, level1_Improved MM and level2_Improved MM are with improving the equal of the preceding corresponding number of plies
Model is compared, and has obvious lifting on precision is recommended.However, on 1M data sets, the improvement effect phase of equal model
To smaller, or even level1_Improved MM almost identical with level1_MM recommendation effect.Experiment as a result, this hair
Improvement effect of the bright equal model on 100K data sets is obvious, but the improvement effect on 1M data sets weakens.
3rd, data clusters
Cluster is the major issue in data mining, is also core also one of problem of big data analysis.Means clustering algorithm
It is a kind of simple and effective distance algorithm, thus application is quite varied.Different from hierarchical clustering algorithm, changing algorithm needs every time
The distance between any two points are calculated, so it has faster convergence rate than hierarchy distance.But k- means clustering algorithms have
Two defects, one is to need that cluster numbers are determined in advance, and two be larger by also being influenceed in initial clustering.
The present invention proposes a kind of cluster numbers based on predicted intensity and determines method.
The definition of predicted intensity is
Wherein, Xtr, Xte, represent to carry out initial data the training set and test set obtained by random division respectively;c(Xtr,
K) cluster process of training set is represented, k classes are copolymerized into;Ak1,Ak2,…,AkkThe heart represents the k classes that test set itself is polymerized to, i, i ' be
Sample point in same class, nkjIt is AkjThe number of middle sample point;D[C(Xtr, k), Xte] a k x k matrix is represented, it i-th
The element of row and the i-th ' row takes 0 or 1, and value 0 is represented not in same class, the expression training set pair of value 1:I and i ' are clustered;
Ps (k) represents the predicted intensity of cluster result when cluster numbers are k, and interval is [0,1].
Predicted intensity calculating process is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in the class that test set itself is polymerized to, examine or check it is any to sample point i and i ' whether II types cluster in quilt
Mistake point records the ratio correctly divided in different classes;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
Obviously, the ability for being intuitively meant that the correctly predicted new sample point of current cluster result energy of predicted intensity.In reality
In, can W predicted intensities be object function, W cluster numbers and variable subset are the factor of influence predicted intensity, force what is worked as by selection
Cluster numbers and variable subset, maximize predicted intensity.
In the calculating process of predicted intensity, because training set and test set are divided with change, some accidentalia of institute W
Considerable influence may be produced to the result of calculation of predicted intensity.In order to reduce the influence of accidentalia, the present invention is changed using one kind
Enter method and calculate predicted intensity, specific practice is:Data set is randomly divided into some deciles first, by each decile alternately as
Test set, is obtained after respective predicted intensity, then takes its average value to be the predicted intensity under this cluster numbers.
It is credible to the cluster result of big data in example and has a reality based on the k- means clustering method for improving predicted intensity
Border meaning.On the basis of k- means clustering algorithms, improved predicted intensity is introduced, and with this determination clustering variable and gather
Class number.Clustering to big data website column mean residence time shows, this improved big data clustering method it is poly-
Class, which ties up fruit, has more clear and definite practical significance, and the more conventional clustering method of the inventive method is preferably to be used for carrying out the poly- of big data
Alanysis.
4th, information extraction
Information extraction is exactly information extraction (the Information Extraction often said in fact:), IE that is, need
Information inside the data source to be extracted carries out the processing of some structurings and can be organized into be easy to what people's inquiry was utilized
Form.Among real life and work, information source has popularity, and the form that it is showed is also ever-changing crisscross
It is complicated, particularly in this big data epoch, tend not to correctly using information source and make a policy.It is therefore necessary to
Effective information extraction is carried out to these complicated information sources.
The info web source handled well for clustering, is that the label useless to user is removed first, some mistakes
Or irregular label carries out reparation arrangement, such as comment tag "<script>" wait script file.Nowadays substantial amounts of webpage
All integrated with TABLE or DIV labels, thus the present invention carry out data processing when just according among both
A kind of label construct tree, html file therein is exactly the root node of tree, it is son to send two kinds of corresponding web page blocks of label
Node.
Then the semanteme contained to a few partial content tundish is analyzed.Step is, before this to the tag tree in root node
Comprising DIV or TABLE nodes carry out Data Collection, certainly extract information when can only extract the node content of this layer.
In the same level label extracted, need further to detect it.That is, if the son mark extracted
After Semantic detection is carried out or content that it includes can be detected and the content relation degree of user's request is little in label
Or basic user is not related at all, then information redundancy part can be regarded it as, directly redundant content can be abandoned and deleted.
Detect Bu Sudden followed by divider, user when to the processing of label using delamination process, also
That those data messages unrelated with user's expectation have been deleted before saying, so for detection data message block quantity
Just relatively seldom, operating efficiency and data processing speed are improved.
After above step, web page contents have been divided into DIV or TABLE labeled markers relatively not
Unified semantic block message, if necessary to carry out deeper processing to these semantic block messages is accomplished by them to be converted into
Complete DOM number form formulas, data message extraction is carried out with recurrence method step by step to the dom tree comprising respective different content.
All labels progress time that can be included during the main contents for extracting data block with word frequency co-occurrence method to dom tree
Go through, if it find that some block of information contents are little with the desired data message degree of relationship of user, also among ergodic process
It is information redundancy part, then the data message that user expects to obtain can be removed it and retained.
All above-mentioned this intellectual properties of primarily implementation, the not this new product of implementation of setting limitation other forms
And/or new method.Those skilled in the art will utilize this important information, the above modification, to realize similar execution feelings
Condition.But, all modifications or transformation belong to the right of reservation based on new product of the present invention.
Claims (6)
1. a kind of big data analysis system, it is characterised in that including:Data retrieval module, data filter out module, data clusters mould
Block, and, information extraction modules;The data retrieval module, for data retrieval, by the data attribute and property value in data set
Demarcate and, build double-deck index structure;The data retrieval module, the first attribute for data intensive data set up upper strata rope
Draw;
Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ tree index structures,
If character type data just builds inverted index;
The establishment process of index is specifically included:
Step1 analyzes the data to be indexed for its foundation first, if not having the data in the index built, in mixing
One new index node of the first layer building of index;
Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ trees index for it;If character
Type attribute then sets up inverted index structure for it;
Step3 repeats Step1, if there is current attribute in the index built before, then no longer increases to index first layer
Plus new node, only the data of the attribute are added in the corresponding index of the second layer;
Step4 repeats above step, untill setting up index completion for all data.
2. big data analysis system according to claim 1, it is characterised in that the data filter out module, for data inspection
Data after rope are filtered out;The data are filtered out, and take the variation of following equal model:Assuming that project i to be transformed scoring to
Measure as Ii={ r1i, r2i, r3i..., rmiConverted through equal model, vectorial Ii is converted to equal model representation:
Ii'={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,t23) be
2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
3. big data analysis system according to claim 1, it is characterised in that the data clusters module, for data filter
Data clusters analysis after going out;
The data clusters analysis, using the analysis method of predicted intensity;The predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in each class that test set itself is polymerized to, whether examination is any is divided sample point i and i ' in II types cluster by mistake
In different classes, and record the ratio correctly divided;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
4. a kind of big data analysis method, it is characterised in that including:The step of data retrieval, the step of data are filtered out, data are gathered
The step of class, and, the step of information extraction;The step of data retrieval, for data retrieval, the data in data set are belonged to
Property and property value demarcate and, build double-deck index structure;The step of data retrieval, the first category for data intensive data
Property set up upper layer index;
Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ tree index structures,
If character type data just builds inverted index;
The establishment process of index is specifically included:
Step1 analyzes the data to be indexed for its foundation first, if not having the data in the index built, in mixing rope
The one new index node of the first layer building drawn;
Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ trees index for it;If character
Type attribute then sets up inverted index structure for it;
Step3 repeats Step1, if there is current attribute in the index built before, then no longer increases to index first layer
Plus new node, only the data of the attribute are added in the corresponding index of the second layer;
Step4 repeats above step, untill setting up index completion for all data.
5. big data analysis method according to claim 4, it is characterised in that the step of data are filtered out, for data
Data after retrieval are filtered out;The data are filtered out, and take the variation of following equal model:Assuming that project i to be transformed scoring
Vector is Ii={ r1i, r2i, r3i..., rmiConverted through equal model, vectorial Ii is converted to equal model representation:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,t23) be
2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
6. big data analysis method according to claim 4, it is characterised in that the step of the data clusters, for data
Data clusters analysis after filtering out;
The data clusters analysis, using the analysis method of predicted intensity;The predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in each class that test set itself is polymerized to, whether examination is any is divided sample point i and i ' in II types cluster by mistake
In different classes, and record the ratio correctly divided;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610848904.9A CN106484813B (en) | 2016-09-23 | 2016-09-23 | A kind of big data analysis system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610848904.9A CN106484813B (en) | 2016-09-23 | 2016-09-23 | A kind of big data analysis system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106484813A CN106484813A (en) | 2017-03-08 |
CN106484813B true CN106484813B (en) | 2017-10-31 |
Family
ID=58267892
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610848904.9A Active CN106484813B (en) | 2016-09-23 | 2016-09-23 | A kind of big data analysis system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106484813B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341221B (en) * | 2017-06-28 | 2020-08-11 | 百度在线网络技术(北京)有限公司 | Index structure establishing and associated retrieving method, device, equipment and storage medium |
CN107609105B (en) * | 2017-09-12 | 2020-07-28 | 电子科技大学 | Construction method of big data acceleration structure |
CN110019400B (en) * | 2017-12-25 | 2021-01-12 | 深圳云天励飞技术有限公司 | Data storage method, electronic device and storage medium |
CN108256083A (en) * | 2018-01-22 | 2018-07-06 | 成都博睿德科技有限公司 | Content recommendation method based on deep learning |
CN108256086A (en) * | 2018-01-22 | 2018-07-06 | 成都博睿德科技有限公司 | Data characteristics statistical analysis technique |
CN108764991B (en) * | 2018-05-22 | 2021-11-02 | 江南大学 | Supply chain information analysis method based on K-means algorithm |
CN109325027A (en) * | 2018-08-21 | 2019-02-12 | 朱常林 | One kind is based on the analysis of cloud data, Situation Awareness algorithm |
CN109547271B (en) * | 2019-01-06 | 2020-01-03 | 广州泳泳信息科技有限公司 | Network state real-time monitoring alarm system based on big data |
CN110348021B (en) * | 2019-07-17 | 2021-05-18 | 湖北亿咖通科技有限公司 | Character string recognition method based on named entity model, electronic device and storage medium |
CN114996360B (en) * | 2022-07-20 | 2022-11-18 | 江西现代职业技术学院 | Data analysis method, system, readable storage medium and computer equipment |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101256594A (en) * | 2008-03-25 | 2008-09-03 | 北京百问百答网络技术有限公司 | Method and system for measuring graph structure similarity |
CN103455908A (en) * | 2012-05-30 | 2013-12-18 | Sap股份公司 | Brainstorming service in cloud environment |
CN105787097A (en) * | 2016-03-16 | 2016-07-20 | 中山大学 | Distributed index establishment method and system based on text clustering |
-
2016
- 2016-09-23 CN CN201610848904.9A patent/CN106484813B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN106484813A (en) | 2017-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106484813B (en) | A kind of big data analysis system and method | |
CN106649455B (en) | Standardized system classification and command set system for big data development | |
WO2021109464A1 (en) | Personalized teaching resource recommendation method for large-scale users | |
CN107066599A (en) | A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning | |
Wei et al. | Research on construction of a cloud platform for tourism information intelligent service based on blockchain technology | |
CN106844407B (en) | Tag network generation method and system based on data set correlation | |
Saarela et al. | Expert-based versus citation-based ranking of scholarly and scientific publication channels | |
CN111737421A (en) | Intellectual property big data information retrieval system and storage medium | |
Jalali et al. | Research trends on big data domain using text mining algorithms | |
CN103425740A (en) | IOT (Internet Of Things) faced material information retrieval method based on semantic clustering | |
CN103218400A (en) | Method for dividing network community user groups based on link and text contents | |
Ristoski | Exploiting semantic web knowledge graphs in data mining | |
Pujari et al. | Link prediction in complex networks by supervised rank aggregation | |
Ishfaq et al. | Identifying the influential bloggers: a modular approach based on sentiment analysis | |
Hu et al. | EGC: A novel event-oriented graph clustering framework for social media text | |
Liu et al. | Detecting industry clusters from the bottom up based on co-location patterns mining: A case study in Dongguan, China | |
CN111210307A (en) | Scientific and technological service chain intelligent recommendation system and method with response user preference as core | |
CN106909626A (en) | Improved Decision Tree Algorithm realizes search engine optimization technology | |
Wang | Collaborative filtering recommendation of music MOOC resources based on spark architecture | |
CN113821718A (en) | Article information pushing method and device | |
CN103123641A (en) | Social contact search method and device | |
CN112668836B (en) | Risk spectrum-oriented associated risk evidence efficient mining and monitoring method and apparatus | |
CN115114519A (en) | Artificial intelligence based recommendation method and device, electronic equipment and storage medium | |
Yun et al. | Tourist attraction recommendation method based on megadata and artificial intelligence algorithm | |
Khamis | The Use of Machine Learning in Libraries: How to Build a Book Recommender System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |