CN106484813B - A kind of big data analysis system and method - Google Patents

A kind of big data analysis system and method Download PDF

Info

Publication number
CN106484813B
CN106484813B CN201610848904.9A CN201610848904A CN106484813B CN 106484813 B CN106484813 B CN 106484813B CN 201610848904 A CN201610848904 A CN 201610848904A CN 106484813 B CN106484813 B CN 106484813B
Authority
CN
China
Prior art keywords
data
index
attribute
layer
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610848904.9A
Other languages
Chinese (zh)
Other versions
CN106484813A (en
Inventor
韦天瀚
刘国庆
李海威
黄震廷
吴华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGDONG GANG XIN SCIENCE AND TECHNOLOGY Co Ltd
Original Assignee
GUANGDONG GANG XIN SCIENCE AND TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGDONG GANG XIN SCIENCE AND TECHNOLOGY Co Ltd filed Critical GUANGDONG GANG XIN SCIENCE AND TECHNOLOGY Co Ltd
Priority to CN201610848904.9A priority Critical patent/CN106484813B/en
Publication of CN106484813A publication Critical patent/CN106484813A/en
Application granted granted Critical
Publication of CN106484813B publication Critical patent/CN106484813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of big data analysis system and method.Big data analysis system includes:Data retrieval module, data filter out module, data clusters module, and, information extraction modules.The data retrieval module, for data retrieval, the data attribute and property value in data set is demarcated to come, and builds double-deck index structure.It is described according to retrieval module, be that the attribute of data intensive data sets up upper layer index first;Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ tree index structures, if character type data just builds inverted index.The present invention is credible to the cluster result of big data in example and is of practical significance based on the k means clustering methods for improving predicted intensity.

Description

A kind of big data analysis system and method
Technical field
The present invention relates to computer science and technology field, more particularly to a kind of big data analysis system and method.
Background technology
Currently, internet is all connected the computer of all networkings, fundamentally have impact on the production and living of people, this It is the first choice for obtaining various data at present.The pattern for obtaining data by client to server by internet may be summarized to be The pattern of " request "+" response ".This is the basic model of the Internet, applications agreement.
It is exactly to send order, then conducting interviews to click on mouse, and everyone access record is recorded clear in detail Look in daily record, including the specific data such as time, request content, address.Data on internet are connected by these access records Collectively constitute together, this is same reason by following the trail of vestige to catch prey with hunter, and access log is contained huge Value.Therefore, this is also one of important sources of big data.
Several Internet enterprises such as Google, Amazon, Facebook, Twitter the biggest in the world etc. are just dominated entirely The internet industry of ball, why they, which so succeed, a common factor, that is, superpower data analysis capabilities. These enterprises analyze and process substantial amounts of data message daily, using big data as means, excavate commercial opportunity therein, and Google is It is most typical in these enterprises to represent.According to statistics, the search of Google monthly is analyzed up to more than one hundred billion time, and to search information And processing, handled data volume reach 600PB (GB of 1,PB=,100 ten thousand, it is said that equivalent to 1,000,000 years news morning of this information content The summation of report).All contents searched for by google search engine and data message can all be used by its analysis.Such as, with When Google is scanned for, keyword is keyed in search box, the information related with search content can be shown, if input " big data ", search result can point out the contents such as " big data concept ", " big data epoch ", " big data technology ".This is big The result analyzed on the basis of amount historical search information using big data technology.If in addition, input be error message, Or directly inputted in phonetic mode, Google can correct search content automatically, then provide correct suggestion, this function of search fortune With same search theory.
Compared with traditional enterprise operation data, big data has two differences.
First, data volume is huge, but different from the data message such as traditional sales volume, quantity in stock, Google, Facebook Difference is very big in analysis and management method when being handled Deng Internet enterprises the data of website click generation.At big data The core of reason, is not structural data, but the data produced on above-mentioned website clickstream data and social networks, Yi Jichuan The data stored in sensor data, it is impossible to be stored in lane database, are referred to as unstructured data.
Second, from the point of view of the type of business of data processing, really grasp huge data storage and analytical technology is not to pass The entity industry of system, but emerging Internet enterprises (Google), social networks (Facebook) and electric business enterprise (Amazon) etc..The former can entrust the latter to carry out big data information analysis and processing service for it.
Facebook can produce 30PB data volume, and the data volume that Wal-Mart produces only has 2.5PB, not only in data In amount, while difference is also very big in the diversity of data and the speed of generation.From the foregoing, it will be observed that Large-Scale Interconnected net enterprise is in interconnection In net booming period, easily neglected for other enterprises
Depending on data value, the technology of low cost storage and processing can be developed in time, and will wherein valuable letter Breath is extracted, integration apply in operation flow, gradually formed the competitive advantage of itself, in Internet enterprises take off grain husk and Go out.At present, increasing with the influence of these Internet enterprises, more enterprises start to pay attention to the analysis of big data, utilize Big data is by providing new services, to improve customer satisfaction, and then improves the competitive advantage of enterprise.
Big data is in two or three years penetrated into rapidly in different industries, different field with swift and violent developing state short, is made Production efficiency is largely increased, and the raising of the development trend and productivity of big data is closely bound up.
Data volume exponentially increases.Many common achievements in research of research institutions show, global metadata total amount will be It is interior over the next several years exponentially to increase.Estimate according to U.S. advisory organization Mai Kenxi, the new data of global enterprise storage in 2010 Amount stores the new data more than 6EB more than 7EB on client personal computer.
The big data intensity and content of different industries are had nothing in common with each other.The data volume of industry-by-industry storage is different from, big number According to growth according to the difference of industry, produce and the data type of storage also differed.There is card in the maximum field of memory data output Certificate, investment consultation and bank and other financial mechanism, the number that the department such as communication common carrier, media mediation and public institution of government produces It is also very big according to scale.These industries for possessing data assets have very big value potentiality in big data using aspect.
Existing trend will continue to press on data growth.Between different areas and industry, relevant enterprise is all accelerating speed Degree collects data, while also having promoted the growth of traditional transaction database;Multimedia is wide the people's livelihood such as health care field General application, is added significantly to the generation of big data;The commonly used and Internet of Things of network social intercourse extensively should in production and living With all promoting the continuous growth of big data, the cross-applications of these different industries further have stimulated big data growth and The rapid expansion of data pool.
Big data is the new frontline technology of following promotion productivity dynamics.Big data, which wants to turn into, has stronger competition Power, productivity, innovation ability, it is desirable to have appropriate policy is promoted, this is also the key element for creating consumer surplus.In medical treatment Health industry, makes full use of big data, can reduce operating cost, it is to avoid unnecessary treatment, reduces what treatment accident occurred Probability, is improved and lifting medical service quality;In public administration field, revenue department can promote tax revenue work using big data The development of work, improves the operating efficiency of related department of paying taxes;In retail trade, the efficiency for improving and improving industry can be by supplying The big data application of chain and business is answered to realize;In marketing field, big data is made full use of, is consumer with more suitably Price finds the product for meeting its demand, improves value-added content of service.
Now, data are also a kind of assets, can material assets and human capital shoulder to shoulder, while it be also a kind of production will Element.With the development of the emerging industry such as multimedia, Internet of Things in social life, enterprise will be collected into more from these media Information, so as to bring increasing rapidly for data.Big data is in commerce services and for that can give play to huge in consumer's creation of value Big potentiality.
The content of the invention
The technical problems to be solved by the invention are that there is provided a kind of big data analysis system and method.It is of the invention big In data analysing method, using hybrid index with reference to using and having continued B+ trees and advantage both inverted index, while avoid again Their own shortcoming.Logarithm value type data can also be realized while the speed and the space utilisation that improve index construct Range query function.Data of the present invention filter out the scoring feature that projects are extracted by the means of project vector compression, effectively The sparse sex chromosome mosaicism in commending system is solved, while drastically increasing the computational efficiency of item similarity.Finally, reality is passed through Test and the improvement effect of equal model is verified, test result indicates that the equal model after the present invention is improved is less for scoring Project possesses more preferable recommendation effect, more meets the application demand of real system.
In order to solve the above technical problems, the invention provides a kind of big data analysis system, including:Data retrieval module, Data filter out module, data clusters module, and, information extraction modules.
The data retrieval module, for data retrieval, the data attribute and property value in data set is demarcated to come, structure Build double-deck index structure.
It is described according to retrieval module, be that the attribute of data intensive data sets up upper layer index first;
Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ trees index knot Structure, if character type data just builds inverted index.
The data filter out module, are filtered out for the data after data retrieval;The data are filtered out, and take following equal model Variation:Assuming that project i to be transformed scoring vector is Ii={ r1i, r2i, r3i..., rmiConverted through equal model, vector Ii is converted to equal model representation:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22, t23) it is the 2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
The data clusters module, the data clusters analysis after being filtered out for data;
The data clusters analysis, using the analysis method of predicted intensity;The predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in the class that test set itself is polymerized to, examine or check it is any to sample point i and i ' whether II types cluster in quilt Mistake point records the ratio correctly divided in different classes;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
In order to solve the above technical problems, present invention also offers a kind of big data analysis method, including:The step of data retrieval Suddenly, the step of the step of data are filtered out, data clusters, and, the step of information extraction.
The step of data retrieval, for data retrieval, the data attribute and property value in data set are demarcated to come, Build double-deck index structure.
It is described according to retrieval the step of, be that the attribute of data intensive data sets up upper layer index first;
Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ trees index knot Structure, if character type data just builds inverted index.
The step of data are filtered out, filters out for the data after data retrieval;The data are filtered out, and take following equal mould The variation of type:Assuming that project i to be transformed scoring vector is Ii={ r1i, r2i, r3i..., rmiConverted through equal model, to Amount Ii is converted to equal model representation:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22, t23) it is the 2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
The step of data clusters, the data clusters analysis after being filtered out for data;
The data clusters analysis, using the analysis method of predicted intensity;The predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in the class that test set itself is polymerized to, examine or check it is any to sample point i and i ' whether II types cluster in quilt Mistake point records the ratio correctly divided in different classes;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
The present invention is beneficial to be had technical effect that:
(1) hybrid index of the present invention is with reference to using and having continued B+ trees and advantage both inverted index, while avoid again Their own shortcoming.The model of logarithm value type data can also be realized while the speed and the space utilisation that improve index construct Enclose query function.
(2) present invention model data filters out the scoring feature that projects are extracted by the means of project vector compression, has The sparse sex chromosome mosaicism in commending system is solved to effect, while drastically increasing the computational efficiency of item similarity.Finally, lead to Experiment is crossed to verify the improvement effect of equal model, test result indicates that the present invention improve after equal model for score compared with Few project possesses more preferable recommendation effect, more meets the application demand of real system.
(3) present invention based on improve predicted intensity k- means clustering methods be to the cluster result of big data in example can Believe and be of practical significance.On the basis of k- means clustering algorithms, improved predicted intensity is introduced, and cluster with this determination Variable and cluster numbers.Clustering to big data website column mean residence time shows that this improved big data is clustered The cluster of method, which ties up fruit, has more clear and definite practical significance, the more conventional clustering method of clustering method of the present invention is preferably be used for into The clustering of row big data.
Brief description of the drawings
Fig. 1 is two-layer hybrid big data index structure figure described in the embodiment of the present invention;
Fig. 2 is user items rating matrix-vector compression schematic diagram described in the embodiment of the present invention;
Fig. 3 is user items rating matrix-vector compression schematic diagram of dimensionality reduction described in the embodiment of the present invention;
Fig. 4 is equal model vector transfer process figure described in the embodiment of the present invention;
Fig. 5 is that model algorithm assesses figure (100K) described in the embodiment of the present invention;
Embodiment
Describe embodiments of the present invention in detail below with reference to embodiment, whereby to the present invention how application technology hand Section solves technical problem, and reaches the implementation process of technique effect and can fully understand and implement according to this.
It should be noted that writing length to save specification, it is to avoid unnecessary repetition and waste, in the feelings not conflicted Under condition, the feature in embodiment and embodiment in the application can be mutually combined.
First, data retrieval
The present invention proposes a kind of hybrid index structure based on inverted index and B+ trees.The leafy node of B+ trees is ordered into , this causes it to have obvious advantage when logarithm value type data carry out range retrieval, can bear substantial amounts of workload, has There are relatively stable I/O expenses.Inverted index can not be provided the range retrieval for completing numeric type data and supported well, but because It realizes that relatively easy, inquiry velocity is fast, and retrieval can provide good branch with one-time positioning to the index construct of character type data Hold.
On the basis of tradition index, the thought of stratification index is introduced, by the data attribute and attribute in data set Value, which is demarcated, to be come, and builds double-deck index structure.Attribute first for data intensive data sets up upper layer index.Secondly upper strata is belonged to Property corresponding to data value set up index, if numeric type data just builds B+ tree index structures, if character type data Just build inverted index.So, not all data all set up tree index and reduce the storage as caused by node split The problem of space waste, in addition, the use shared by interim node produced by decreasing during node split, are extra Memory space, accelerates the speed for building index, improves the utilization rate of memory space.Enter line range when logarithm value type data to look into During inquiry, the tree index that will be directly targeted to lower floor is completed, and reduces data query time and cost.
The hybrid index that the present invention is designed is with reference to using and having continued B+ trees and advantage both inverted index, while avoid again Their own shortcoming.Logarithm value type data can also be realized while the speed and the space utilisation that improve index construct Range query function.
The two-layer hybrid big data index structure of the present invention is as shown in Figure 1:
The tree index structure on upper strata is set up primarily directed to the attribute included in data set, in the layer index The specific object of data is stored entirely in n omicronn-leaf child node, and three partial informations are then stored in all leafy nodes of B+ trees Ai, PType, Pointer, the implication of expression is respectively:
(1)AiIt is the specific object of directoried data set, wherein n is the number of all properties, i ∈ [1, n];
(2) what PType was represented is pointer type, and particular type has PType { Inverted_index, B+ tree };
(3) Pointer is points to the pointer of lower layer index, and according to the difference of data type, the pointer points to different ropes Guiding structure, that is, point to the root node of inverted list gauge outfit or B+ trees.
2nd layer index is for the index constructed by the data value corresponding to the 1st layer of attribute, including to build for numeric type data Vertical B+ trees index structure and the inverted list index set up for character type data.Specific data value is stored in B+ trees index knot In the n omicronn-leaf child node of structure, and leafy node be ordered arrangement and the three partial information A comprising index fileRVS、Loc、 Doc, represents to be meant that respectively:
(1)ARVSFor the S property value of the R attribute, R ∈ [1, n2], S ∈ [1, p], n2For what is included in data set The number of numerical attribute, P is the data amount check of the R attribute.
(2) Loc is to include the positional information where the file of this property value.
(3) Doc is the reference number of a document comprising searching keyword, and Doc is unique.
Inverted index is divided into two parts, and one is " dictionary ", is a concordance list being made up of different index word, record Different Chinese keywords and their relevant information.Another is " record sheet ", have recorded and each index terms occurred Collection of document and the relevant information such as their storage address.A is specifically included in the inverted index structure of the second layeriVj、Doc、 The partial information of Loc, F tetra-, the implication of expression is respectively:
(1)AiVjFor j-th of property value of ith attribute, i ∈ [1, n1], j ∈ [1, m],
n1For the number of character attibute, m is the number for the property value that ith attribute is included.
(2) Doc is the reference number of a document comprising searching keyword, and Doc is unique.
(3) Loc is to include the position where searching keyword file.
(4) F is the frequency that searching keyword occurs in data set.
The establishment process of index:
Step1 analyzes the data to be indexed for its foundation first, if not having the data in the index built, mixed Close one new index node of the first layer building of index.
Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ trees index for it;If Character type attribute then sets up inverted index structure for it.
Step3 repeats Step1, if there is current attribute in the index built before, then no longer to index first The new node of layer increase, is only added to the data of the attribute in the corresponding index of the second layer.
Step4 repeats above step, untill setting up index completion for all data.
Search index method:
Analysis querying condition first obtains keyword, and searching keyword is handed to indexed lexicon, if index marker position For Fales, return to null value and represent that the data to be inquired about are not present in index file, then judge that the query word is returned if True The data type of result is returned, different index is navigated to according to different type, the numbering of the vocabulary is read and comprising vocabulary document Number, the relevant information of querying condition is obtained by these.Read further according to vocabulary numbering in B+ trees index or inverted index Content, integrates obtained retrieval content, finally carries out correlation comparison with search condition, result ranking is most terminated Fruit returns to user.Using the key assignments term_id in tables of data as the input value of search algorithm, Boolean, specific mistake are output as Journey is as follows:
(1-1) using root, term_id, layer as input parameter, call lookup function treeSearch (root, Term_id, layer), lookup result is assigned to leaf page record record.
(1-2) directly returns to null value if record is sky;Otherwise, real lookup result rid is returned.
Using current page currentPage as the input for searching function treeSearch, key is key for searching and layer is The initial number of plies, may include key for searching key leaf record leafRecord as the output of function, detailed process is as follows:
(2-1) searches key keys, and provide lookup knot if what is be currently located is leaf page using binary chop algorithm Really.
(2-2) performs step (2-3) and arrives (2-6) if current page is not leaf page.
(2-3) presses currentPage and key values, selects the subtree containing key assignments, obtains the page number pageNo of child node.
(2-4) reads the child node page subTreePage that it is included according to page number in the buffer.
(2-5) is if the child node page found is leaf page, then return (2-1).
SubTreePage, key, layer are subtracted 1 and are used as newly defeated by (2-6) if child node page is branch's page Enter, recursive call function returns to output result.
The validation verification of hybrid index
The quality of index construct will directly influence tissue effect and the Query Result efficiency of data, proposed by the present invention Two-layer hybrid index structure is compared and analyzed from the time performance of index construct in validation verification.
Time performance analysis is with being compared
If n1、n2The number of Numeric Attributes and the mean number of its property value, n respectively in data set3、n4Respectively The number of character type attribute and the mean number of property value.Then the total number of property value is N=n1×n2+n3×n4.Assuming that the One layer is k rank B+ trees index, and the second layer indexes for m rank B+ trees.
The first layer B+ height of trees degree of hybrid index structure is logk(n1+n3), it is assumed that B+ trees each node in addition to leaf node There is k child node.Now first layer B+ trees index needs the node into line splitting just to have FBdiv, calculated by formula (3-1) Go out:
The height of second layer B+ trees is logm n2, it is assumed that B+ trees index each node in addition to leaf node has m son knot Point.Now B+ trees need the node into line splitting to have SBdivIt is individual, calculated by formula (3-2):
Then
The number of all division nodes is a total of:
If the whole index of data set is all indexed using traditional B+ tree constructions, as all property values are all built Vertical tree index index, then the total number of split vertexes is:
Formula (3-3) and formula (3-4) are compared and understood, hybrid index structure of the present invention is on the index creation time There is more obvious superiority for relatively single index structure.
2nd, data are filtered out
Although traditional Collaborative Filtering Recommendation Algorithm has been achieved for good effect in actual applications, it is still There are problems that sparse sex chromosome mosaicism, operation efficiency it is low and.The present invention proposes the data based on equal model and filtered out Algorithm, the algorithm is started with from Item vectors are long, it is proposed that a kind of method that equal model represents Item vectors, is effectively shortened The calculating time of item similarity improves treatment effeciency of the commending system to big data, can preferably apply in large-scale data On collection.
The general principle of equal model
The essence of equal model is the scoring average by layering, is extracted the main scoring feature of project, is ensureing to recommend While precision, project scoring vector length is have compressed, so that recommendation efficiency greatly improved.Equal model is commented user-project The compression process of sub-matrix is respectively such as Fig. 2, shown in Fig. 3, wherein m>>t.
The vectorial transformation model that 3.1 equal models extract project scoring feature by being layered average is defined, the form of expression is one Orderly complete binary tree.It is an empty tree when project is without corresponding score information, the otherwise left child node in the binary tree Respectively less than father node, right child node is all higher than father node, and each subtree is also satisfied by above rule.
3.2 are defined in the distinguishing hierarchy of equal model, the root node of binary tree is the 0th layer of equal model, be project scoring The grand mean of vector, represents the aggregate level that user scores the project, is considered as the main scoring feature of project;By that analogy, Each dtex that other level averages of equal model represent project scoring is levied.
Equal model variation:
Assuming that project i to be transformed scoring vector is Ii={ r1i, r2i, r3i..., rmiConverted through equal model, vectorial Ii turns It is changed to equal model representation:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ....
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22, t23) it is the 2nd layer of four elements.By that analogy, the vector that just project can score is converted to the equal model of the specified number of plies.
Equal model transformation for mula is:
Wherein, FkFor the conversion formula of kth (k >=0) layer, card (Ii) be project i scoring number.Equal model vector conversion Flow is as shown in Figure 4.
Equal model conversion algorithm
Input:Original item scoring vector Ii={ r1i, r2i, r3i..., rmi, conversion number of plies k.
Output:Equal model item vector I 'i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ....
Step:
(1) the 0th layer,
(2) first, according to t0By vectorial IiIt is divided into two vectors:
(3) and then two elements of equal the 1st layer of model are calculated:
(4) similarly according to t10And t11Respectively by vectorial I10 iAnd I11 iIt is divided into vectorial I20 i、I21 iAnd I22 i、I23 i, then distinguish Calculate four element t of the 2nd layer of equal model20,t21,t22,t23
(5) by that analogy, equal model vector is obtained
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...)....
For the effect of the fully checking present invention model, the present invention is using MovieLens 100K and MovieLens 1M (being shown in Table 1) two classical data sets are contrast experiment to the improvement effect of equal model.Data set is divided at random in experimentation For 5 equal portions, tested by the way of five folding cross validations.
The experimental data of table 1 is gathered
The embodiment of the present invention evaluates and tests the present invention model respectively by MAE values, tri- evaluation metricses of recall rate and NDCG Prediction accuracy, the classification degree of accuracy and the sequence degree of accuracy of (Improved MM).
First, two layers of equal model (level1 includes the 0th layer and the 1st layer) of MAE values comparative analysis and three layers of equal model are passed through (level2 includes the 0th layer to the 2nd layer) before improvement with the recommendation precision after improvement.Then, contrasted by recall rate and NDCG Model (Improved MM), cloud model (Cloud_Model) and classic algorithm Cosine algorithms is in IBCF after analysis is improved Application effect in algorithm, so as to be evaluated and tested from the auxiliary that multi-angle is carried out to Improved MM.
As shown in figure 5, level1_Improved MM and level2_Improved MM are with improving the equal of the preceding corresponding number of plies Model is compared, and has obvious lifting on precision is recommended.However, on 1M data sets, the improvement effect phase of equal model To smaller, or even level1_Improved MM almost identical with level1_MM recommendation effect.Experiment as a result, this hair Improvement effect of the bright equal model on 100K data sets is obvious, but the improvement effect on 1M data sets weakens.
3rd, data clusters
Cluster is the major issue in data mining, is also core also one of problem of big data analysis.Means clustering algorithm It is a kind of simple and effective distance algorithm, thus application is quite varied.Different from hierarchical clustering algorithm, changing algorithm needs every time The distance between any two points are calculated, so it has faster convergence rate than hierarchy distance.But k- means clustering algorithms have Two defects, one is to need that cluster numbers are determined in advance, and two be larger by also being influenceed in initial clustering.
The present invention proposes a kind of cluster numbers based on predicted intensity and determines method.
The definition of predicted intensity is
Wherein, Xtr, Xte, represent to carry out initial data the training set and test set obtained by random division respectively;c(Xtr, K) cluster process of training set is represented, k classes are copolymerized into;Ak1,Ak2,…,AkkThe heart represents the k classes that test set itself is polymerized to, i, i ' be Sample point in same class, nkjIt is AkjThe number of middle sample point;D[C(Xtr, k), Xte] a k x k matrix is represented, it i-th The element of row and the i-th ' row takes 0 or 1, and value 0 is represented not in same class, the expression training set pair of value 1:I and i ' are clustered; Ps (k) represents the predicted intensity of cluster result when cluster numbers are k, and interval is [0,1].
Predicted intensity calculating process is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in the class that test set itself is polymerized to, examine or check it is any to sample point i and i ' whether II types cluster in quilt Mistake point records the ratio correctly divided in different classes;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
Obviously, the ability for being intuitively meant that the correctly predicted new sample point of current cluster result energy of predicted intensity.In reality In, can W predicted intensities be object function, W cluster numbers and variable subset are the factor of influence predicted intensity, force what is worked as by selection Cluster numbers and variable subset, maximize predicted intensity.
In the calculating process of predicted intensity, because training set and test set are divided with change, some accidentalia of institute W Considerable influence may be produced to the result of calculation of predicted intensity.In order to reduce the influence of accidentalia, the present invention is changed using one kind Enter method and calculate predicted intensity, specific practice is:Data set is randomly divided into some deciles first, by each decile alternately as Test set, is obtained after respective predicted intensity, then takes its average value to be the predicted intensity under this cluster numbers.
It is credible to the cluster result of big data in example and has a reality based on the k- means clustering method for improving predicted intensity Border meaning.On the basis of k- means clustering algorithms, improved predicted intensity is introduced, and with this determination clustering variable and gather Class number.Clustering to big data website column mean residence time shows, this improved big data clustering method it is poly- Class, which ties up fruit, has more clear and definite practical significance, and the more conventional clustering method of the inventive method is preferably to be used for carrying out the poly- of big data Alanysis.
4th, information extraction
Information extraction is exactly information extraction (the Information Extraction often said in fact:), IE that is, need Information inside the data source to be extracted carries out the processing of some structurings and can be organized into be easy to what people's inquiry was utilized Form.Among real life and work, information source has popularity, and the form that it is showed is also ever-changing crisscross It is complicated, particularly in this big data epoch, tend not to correctly using information source and make a policy.It is therefore necessary to Effective information extraction is carried out to these complicated information sources.
The info web source handled well for clustering, is that the label useless to user is removed first, some mistakes Or irregular label carries out reparation arrangement, such as comment tag "<script>" wait script file.Nowadays substantial amounts of webpage All integrated with TABLE or DIV labels, thus the present invention carry out data processing when just according among both A kind of label construct tree, html file therein is exactly the root node of tree, it is son to send two kinds of corresponding web page blocks of label Node.
Then the semanteme contained to a few partial content tundish is analyzed.Step is, before this to the tag tree in root node Comprising DIV or TABLE nodes carry out Data Collection, certainly extract information when can only extract the node content of this layer.
In the same level label extracted, need further to detect it.That is, if the son mark extracted After Semantic detection is carried out or content that it includes can be detected and the content relation degree of user's request is little in label Or basic user is not related at all, then information redundancy part can be regarded it as, directly redundant content can be abandoned and deleted.
Detect Bu Sudden followed by divider, user when to the processing of label using delamination process, also That those data messages unrelated with user's expectation have been deleted before saying, so for detection data message block quantity Just relatively seldom, operating efficiency and data processing speed are improved.
After above step, web page contents have been divided into DIV or TABLE labeled markers relatively not Unified semantic block message, if necessary to carry out deeper processing to these semantic block messages is accomplished by them to be converted into Complete DOM number form formulas, data message extraction is carried out with recurrence method step by step to the dom tree comprising respective different content.
All labels progress time that can be included during the main contents for extracting data block with word frequency co-occurrence method to dom tree Go through, if it find that some block of information contents are little with the desired data message degree of relationship of user, also among ergodic process It is information redundancy part, then the data message that user expects to obtain can be removed it and retained.
All above-mentioned this intellectual properties of primarily implementation, the not this new product of implementation of setting limitation other forms And/or new method.Those skilled in the art will utilize this important information, the above modification, to realize similar execution feelings Condition.But, all modifications or transformation belong to the right of reservation based on new product of the present invention.

Claims (6)

1. a kind of big data analysis system, it is characterised in that including:Data retrieval module, data filter out module, data clusters mould Block, and, information extraction modules;The data retrieval module, for data retrieval, by the data attribute and property value in data set Demarcate and, build double-deck index structure;The data retrieval module, the first attribute for data intensive data set up upper strata rope Draw;
Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ tree index structures, If character type data just builds inverted index;
The establishment process of index is specifically included:
Step1 analyzes the data to be indexed for its foundation first, if not having the data in the index built, in mixing One new index node of the first layer building of index;
Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ trees index for it;If character Type attribute then sets up inverted index structure for it;
Step3 repeats Step1, if there is current attribute in the index built before, then no longer increases to index first layer Plus new node, only the data of the attribute are added in the corresponding index of the second layer;
Step4 repeats above step, untill setting up index completion for all data.
2. big data analysis system according to claim 1, it is characterised in that the data filter out module, for data inspection Data after rope are filtered out;The data are filtered out, and take the variation of following equal model:Assuming that project i to be transformed scoring to Measure as Ii={ r1i, r2i, r3i..., rmiConverted through equal model, vectorial Ii is converted to equal model representation:
Ii'={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,t23) be 2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
3. big data analysis system according to claim 1, it is characterised in that the data clusters module, for data filter Data clusters analysis after going out;
The data clusters analysis, using the analysis method of predicted intensity;The predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in each class that test set itself is polymerized to, whether examination is any is divided sample point i and i ' in II types cluster by mistake In different classes, and record the ratio correctly divided;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
4. a kind of big data analysis method, it is characterised in that including:The step of data retrieval, the step of data are filtered out, data are gathered The step of class, and, the step of information extraction;The step of data retrieval, for data retrieval, the data in data set are belonged to Property and property value demarcate and, build double-deck index structure;The step of data retrieval, the first category for data intensive data Property set up upper layer index;
Secondly the data value corresponding to the attribute of upper strata is set up and indexed, if numeric type data just builds B+ tree index structures, If character type data just builds inverted index;
The establishment process of index is specifically included:
Step1 analyzes the data to be indexed for its foundation first, if not having the data in the index built, in mixing rope The one new index node of the first layer building drawn;
Step2 judges the attribute Value Types of new suffix evidence, if numeric type data, then creates B+ trees index for it;If character Type attribute then sets up inverted index structure for it;
Step3 repeats Step1, if there is current attribute in the index built before, then no longer increases to index first layer Plus new node, only the data of the attribute are added in the corresponding index of the second layer;
Step4 repeats above step, untill setting up index completion for all data.
5. big data analysis method according to claim 4, it is characterised in that the step of data are filtered out, for data Data after retrieval are filtered out;The data are filtered out, and take the variation of following equal model:Assuming that project i to be transformed scoring Vector is Ii={ r1i, r2i, r3i..., rmiConverted through equal model, vectorial Ii is converted to equal model representation:
I′i={ t0, (t10, t11), (t20, t21, t22, t23), (t30, t31...) ...;
Wherein, t0For the 0th layer of only element of equal model, (t10,t11) it is the 1st layer of two elements, (t20,t21,t22,t23) be 2nd layer of four elements;By that analogy, project scoring vector is converted to the equal model of the specified number of plies.
6. big data analysis method according to claim 4, it is characterised in that the step of the data clusters, for data Data clusters analysis after filtering out;
The data clusters analysis, using the analysis method of predicted intensity;The predicted intensity method is as follows:
(1) initial data to be clustered is randomly divided into training set and test set;
(2) it is k to take cluster numbers, and above-mentioned two subset is clustered, and cluster result is designated as I types cluster;
(3) test set is differentiated with the cluster result of training set, is as a result designated as II types cluster;
(4) in each class that test set itself is polymerized to, whether examination is any is divided sample point i and i ' in II types cluster by mistake In different classes, and record the ratio correctly divided;
(5) in this k composition of proportions, reckling is the predicted intensity under current cluster numbers k.
CN201610848904.9A 2016-09-23 2016-09-23 A kind of big data analysis system and method Active CN106484813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610848904.9A CN106484813B (en) 2016-09-23 2016-09-23 A kind of big data analysis system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610848904.9A CN106484813B (en) 2016-09-23 2016-09-23 A kind of big data analysis system and method

Publications (2)

Publication Number Publication Date
CN106484813A CN106484813A (en) 2017-03-08
CN106484813B true CN106484813B (en) 2017-10-31

Family

ID=58267892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610848904.9A Active CN106484813B (en) 2016-09-23 2016-09-23 A kind of big data analysis system and method

Country Status (1)

Country Link
CN (1) CN106484813B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341221B (en) * 2017-06-28 2020-08-11 百度在线网络技术(北京)有限公司 Index structure establishing and associated retrieving method, device, equipment and storage medium
CN107609105B (en) * 2017-09-12 2020-07-28 电子科技大学 Construction method of big data acceleration structure
CN110019400B (en) * 2017-12-25 2021-01-12 深圳云天励飞技术有限公司 Data storage method, electronic device and storage medium
CN108256083A (en) * 2018-01-22 2018-07-06 成都博睿德科技有限公司 Content recommendation method based on deep learning
CN108256086A (en) * 2018-01-22 2018-07-06 成都博睿德科技有限公司 Data characteristics statistical analysis technique
CN108764991B (en) * 2018-05-22 2021-11-02 江南大学 Supply chain information analysis method based on K-means algorithm
CN109325027A (en) * 2018-08-21 2019-02-12 朱常林 One kind is based on the analysis of cloud data, Situation Awareness algorithm
CN109547271B (en) * 2019-01-06 2020-01-03 广州泳泳信息科技有限公司 Network state real-time monitoring alarm system based on big data
CN110348021B (en) * 2019-07-17 2021-05-18 湖北亿咖通科技有限公司 Character string recognition method based on named entity model, electronic device and storage medium
CN114996360B (en) * 2022-07-20 2022-11-18 江西现代职业技术学院 Data analysis method, system, readable storage medium and computer equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101256594A (en) * 2008-03-25 2008-09-03 北京百问百答网络技术有限公司 Method and system for measuring graph structure similarity
CN103455908A (en) * 2012-05-30 2013-12-18 Sap股份公司 Brainstorming service in cloud environment
CN105787097A (en) * 2016-03-16 2016-07-20 中山大学 Distributed index establishment method and system based on text clustering

Also Published As

Publication number Publication date
CN106484813A (en) 2017-03-08

Similar Documents

Publication Publication Date Title
CN106484813B (en) A kind of big data analysis system and method
CN106649455B (en) Standardized system classification and command set system for big data development
WO2021109464A1 (en) Personalized teaching resource recommendation method for large-scale users
CN107066599A (en) A kind of similar enterprise of the listed company searching classification method and system of knowledge based storehouse reasoning
Wei et al. Research on construction of a cloud platform for tourism information intelligent service based on blockchain technology
CN106844407B (en) Tag network generation method and system based on data set correlation
Saarela et al. Expert-based versus citation-based ranking of scholarly and scientific publication channels
CN111737421A (en) Intellectual property big data information retrieval system and storage medium
Jalali et al. Research trends on big data domain using text mining algorithms
CN103425740A (en) IOT (Internet Of Things) faced material information retrieval method based on semantic clustering
CN103218400A (en) Method for dividing network community user groups based on link and text contents
Ristoski Exploiting semantic web knowledge graphs in data mining
Pujari et al. Link prediction in complex networks by supervised rank aggregation
Ishfaq et al. Identifying the influential bloggers: a modular approach based on sentiment analysis
Hu et al. EGC: A novel event-oriented graph clustering framework for social media text
Liu et al. Detecting industry clusters from the bottom up based on co-location patterns mining: A case study in Dongguan, China
CN111210307A (en) Scientific and technological service chain intelligent recommendation system and method with response user preference as core
CN106909626A (en) Improved Decision Tree Algorithm realizes search engine optimization technology
Wang Collaborative filtering recommendation of music MOOC resources based on spark architecture
CN113821718A (en) Article information pushing method and device
CN103123641A (en) Social contact search method and device
CN112668836B (en) Risk spectrum-oriented associated risk evidence efficient mining and monitoring method and apparatus
CN115114519A (en) Artificial intelligence based recommendation method and device, electronic equipment and storage medium
Yun et al. Tourist attraction recommendation method based on megadata and artificial intelligence algorithm
Khamis The Use of Machine Learning in Libraries: How to Build a Book Recommender System

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant