CN108256086A

CN108256086A - Data characteristics statistical analysis technique

Info

Publication number: CN108256086A
Application number: CN201810060531.8A
Authority: CN
Inventors: 李垚霖
Original assignee: Chengdu Boruide Science & Technology Co Ltd
Current assignee: Chengdu Boruide Science & Technology Co Ltd
Priority date: 2018-01-22
Filing date: 2018-01-22
Publication date: 2018-07-06

Abstract

The present invention provides a kind of data characteristics statistical analysis technique, this method includes：The data attribute key and attribute value that data characteristics is concentrated divide, according to the data attribute key after division and the double-deck aspect indexing of attribute value structure.The present invention proposes a kind of data characteristics statistical analysis technique, for the back end of distributed environment, improves the efficiency of data mining.

Description

Data characteristics statistical analysis technique

Technical field

The present invention relates to data, more particularly to a kind of data characteristics statistical analysis technique.

Background technology

The aggregation of data and analysis are performed in large-scale distributed back end to be needed to design efficient data mining Method.In current the relevant technologies, traditional centralized data management and searching method are faced with Single Point of Faliure, scalability The problems such as poor, can not meet data mining demand flexible, expansible and healthy and strong under distributed environment.Therefore, how using non- The back end management of centralization and data digging method, to meet the expansible back end management of structure data service and number It is still a challenging problem according to assembling and analyzing demand.In addition, existing big data parallel computation frame is in data directory Stage, data query time and cost have much room for improvement, and according to traditional sorting in parallel merger, then data characteristics field point Cloth is uneven, will be decreased obviously in access phase efficiency.

Invention content

To solve the problems of above-mentioned prior art, the present invention proposes a kind of data characteristics statistical analysis technique, Including：

The data attribute key and attribute value that data characteristics is concentrated divide,

According to the data attribute key after division and the double-deck aspect indexing of attribute value structure.

Preferably, it for characteristic attribute included in data characteristics collection, is indexed using tree structure construction feature.

Preferably, for data attribute numerical value included in data characteristics collection, when construction feature indexes, according to number Corresponding index structure is selected according to the data type of attribute.

Preferably, if data attribute is numeric type data, R tree aspect indexings are built；The specific features of data in index Attribute is stored entirely in n omicronn-leaf subobject, when logarithm type data carry out range query, is directly targeted to the tree-like of low layer Aspect indexing is completed.

Preferably, three parts information A is stored in all leaf objects of the tree structure_i, Pcat, Psi, expression contains Justice is respectively：(1)A_iIt is the specific features attribute of index data feature set, wherein n is the number of all characteristic attributes, i ∈ [1, n]；(2) what Pcat was represented is pointer type；(3) Psi is the pointer for being directed toward low-level feature index.

Preferably, if data attribute is text-type data, Reverse features index is built, the Reverse features index is divided into Two parts, first part are the aspect indexing tables being made of different index word, record different text keywords and they Relevant information；Second part records the collection of document and its storage address for each index terms occurred.

7. according to the method described in claim 6, it is characterized in that, in search index, querying condition is analyzed first and is obtained To Feature Words, query characteristics word is handed into index dictionary, if index marker position is false, null value is returned and represents in index file not In the presence of the characteristic to be inquired, if if true, judge the data type that the query word returns the result, determined according to different type Position is indexed to different characteristic, reads the ID of this feature word and comprising Feature Words number of documents, the phase of querying condition is obtained with this Close information；Further according to the content in Feature Words ID reading R trees aspect indexings or Converse Index, the retrieval content integrated, most Correlation comparison is carried out with search condition afterwards, final result is obtained to result ranking and returns to user.

The present invention compared with prior art, has the following advantages：

The present invention proposes a kind of data characteristics statistical analysis technique, for the back end of distributed environment, facilitates use Family uses data by matching service description information, improves the efficiency of data mining；To be provided by using cloud service Computing resource or storage resource come develop structure data service provide a feasible scheme.

Description of the drawings

Fig. 1 is the flow chart of data characteristics statistical analysis technique according to embodiments of the present invention.

Specific embodiment

Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing for illustrating the principle of the invention It states.The present invention is described with reference to such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of data characteristics statistical analysis technique.Fig. 1 is according to embodiments of the present invention Data characteristics statistical analysis technique flow chart.

The present invention data characteristics digging system include storage subsystem, tagsort subsystem, trusted key subsystem, Feature mining subsystem, task scheduling subsystem.

Trusted key subsystem is for ensureing data by identity authentication result to obtain, including key generation, authentication And decryption；Key schedule is as follows：

1) data are divided into the block of multiple key string length scales；

2) with each character of the integer of 0~26 range substitution plaintext and key, space character=00, A=01 ..., Z= 26；

3) to each block of plaintext, the corresponding calculated value of each of which character is replaced, the corresponding calculated value is will The integer coding of corresponding character removes 27 obtained values of remainder again after being added with the integer coding of the character of corresponding position in key；

4) character replaced with corresponding calculated value is substituted again with its character of equal value；

The authentication is logged in by user and voice print verification is realized；The successful user of authentication can pass through decryption Module obtains key, completes decryption；

Storage subsystem includes memory module and disaster tolerance module, the net stored needed for the memory module certification into row information Node in network builds the trusting relationship of stored information, based on the data being distributed under distributed environment, to characteristic According to storage is packaged, using combined type aspect indexing structure, faster inquiry is generated to text-type data and numeric type data Speed；The disaster tolerance module restores data for loss of data or in the case of being destroyed；

The memory module on the basis of tradition indexes, draw by data attribute key and attribute value that data characteristics is concentrated It separates, builds double-deck aspect indexing structure.First high layer index is built for the attribute of data characteristics intensive data.Secondly to height Key assignments construction feature index corresponding to layer characteristic attribute, if numeric type data just builds R tree aspect indexing structures, if literary This type data just build Reverse features index.When logarithm type data carry out range query, low layer will be directly targeted to Tree-like aspect indexing is completed, and reduces data query time and cost.

High-rise tree-like aspect indexing is built for characteristic attribute included in data characteristics collection, the data in the layer index Specific features attribute be stored entirely in n omicronn-leaf subobject, and three parts information A is then stored in all leaf objects of R trees_i、 Pcat, Psi, the meaning of expression are respectively：(1)A_iIt is the specific features attribute of index data feature set, wherein n is all features The number of attribute, i ∈ [1, n]；(2) what Pcat was represented is pointer type；(3) Psi is to be directed toward the pointer that low-level feature indexes, root According to the difference of data type, which is directed toward different aspect indexing structures, that is, is directed toward the root section of reverse document table gauge outfit or R trees Point.

Low-level feature index is for the index constructed by the key assignments corresponding to high-rise characteristic attribute, including for numeric type number According to the R tree aspect indexing structures of structure and the reverse document table aspect indexing for text-type data structure.Practical key assignments stores In the n omicronn-leaf subobject of R tree aspect indexing structures, and leaf objects are ordered arrangement and three comprising feature index file Partial information R_S, Pos, Fileid, represent to be meant that respectively：(1)R_SFor the S attribute key assignments of the R characteristic attribute key, R ∈ [1, n₂], S ∈ [1, p], n₂For the number of numerical characteristics attribute that data characteristics concentration includes, P is the spy of the R property key Levy quantity.(2) Pos is includes the location information where the file of this attribute value.(3) Fileid is includes query characteristics word File ID.

Reverse features index is divided into two parts, and first part is the aspect indexing table being made of different index word, record Different text keyword and their relevant information.Second part has recorded the collection of document for each index terms occurred And its storage address.A is specifically included in Reverse features index structure_ij, tetra- partial information of Fileid, Pos, Freq, expression Meaning is respectively：(1)A_ijFor j-th of characteristic attribute key assignments of ith feature property key, i ∈ [1, n₁], j ∈ [1, m], n₁For The number of text attribute, m are the number of attribute value that ith attribute key includes.(2) Fileid is includes query characteristics word File ID, Fileid are unique.(3) Pos is includes the position where query characteristics word file.(4) Freq is query characteristics Word concentrates the frequency occurred in data characteristics.The establishment process of aspect indexing is as follows：

The data that construction feature indexes are wanted in step 1, first analysis, if there is no current data in the index built, The new aspect indexing object of the high layer building one of combined type aspect indexing；

Step 2, the characteristic attribute value type for judging newly-increased data, if numeric type data, then create R tree features for it Index；If text-type attribute then builds Reverse features index structure for it；

Step 3 repeats step 1, if there are current attribute in the aspect indexing built before, no longer to feature rope Draw the new object of high-rise increase, only the data of the attribute are added in the corresponding aspect indexing of low layer；

Step 4 repeats above step, until completion until being indexed for all data construction features.

In search index, querying condition is analyzed first and obtains Feature Words, query characteristics word is handed into index dictionary, if Index marker position is false, returns to null value and represents there is no the characteristic to be inquired in index file, if if true, judgement should The data type that query word returns the result, according to different type navigate to different characteristic index, read this feature word ID and Comprising Feature Words number of documents, the relevant information of querying condition is obtained by these.R tree feature ropes are read further according to Feature Words ID Draw or Converse Index in content, the retrieval content integrated, finally with search condition carry out correlation comparison, to inquiry tie Fruit, which sorts to obtain final result, returns to user.Using the key assignments key_id in characteristic table as the input value of search algorithm, It exports as Boolean, detailed process is as follows：

(1) using root, key_id, level number level as input parameter, search function lookup (root, key_ are called Id, level), search result is assigned to leaf node record；

(2) if leaf node is recorded as sky, null value is directly returned；Otherwise, real search result rid is returned；

Using current block as the input of search function lookup, key is search key, and level is the initial number of plies, may be included The leaf of search key key records the output as function, and detailed process is as follows：

(3.1) if what is be currently located is leaf node, key keys are searched for, and provide search result using binary search algorithm.

(3.2) if current block is not leaf node, step (3.3) to (3.6) is performed.

(3.3) by current block and key values, the subtree containing key assignments is selected, obtains the block number of child node.

(3.4) the child node block that it is included is read according to block number in the buffer.

(3.5) it if the child node block found is leaf node, returns (3.1).

(3.6) if the child node block is branch's block, child node block, key, level is subtracted 1 as new input, passed Call function is returned to return to output result.

The tagsort subsystem is used to carry out Classification Management to characteristic using the method for cluster；The present invention uses Method is determined with lower class quantity.Definition estimation density first：

Wherein, X_tr, X_teIt represents to carry out initial data the feature training set and characteristic test obtained by random division respectively Collection；C(X_tr, k) represent feature training set cluster process, be copolymerized into k classes；A_k1, A_k2..., A_kkRepresent that characteristic test collection itself is poly- Into k classes, i, i ' be sample point in same class, n_kjIt is A_kjThe number of middle sample point；D[C(X_tr, k), X_te] represent a k The element of × k matrixes, the i-th row and the i-th ' row takes 0 or 1, and value 0 represents that, not in same class, value 1 is represented with feature training set It is right：I and i ' are clustered；Ps (k) represents the estimation density of cluster result when class quantity is k.

Estimate that density calculating process is as follows：

(1) initial data to be clustered is randomly divided into feature training set and characteristic test collection；

(2) class quantity is taken to be clustered for k to above-mentioned two subset, cluster result is denoted as I types cluster；

(3) characteristic test collection is differentiated with the cluster result of feature training set, is as a result denoted as II types cluster；

(4) in k-th of class itself being polymerized in characteristic test collection, whether examination is any gathers in II types sample point i and i ' Accidentally divided in class in different classes, and record the ratio correctly divided；

(5) in k composition of proportions, reckling is the estimation density under current class quantity k.

To estimate density as majorized function, class quantity and variable subset are to influence the factor of estimation density, are closed by selecting Suitable class quantity and variable subset maximize estimation density.

It is unevenly distributed Hash access phase efficiency declines the problem of for characteristic field, the present invention is based on features The Hash join algorithm of the MapReduce data queries of field storage, makes the field under MapReduce distributed environments each It is evenly distributed on node, improves data-handling efficiency.Query execution projection operation is changed into the feature field on each node Operation reduces the I/O wastes that repeated accesses come with watchband.In feature field storage, under the target object that pushes away specific to certain A feature field, each feature field are equivalent to a small table being made of (lineid, value).

In order to solve the problems, such as that data are unbalanced, first in bottom-layer design one in MapReduce distributed computing frameworks New file format HMF so that if there are user's HMF file sets：O (F)={ f₁, f₂..., f_n, by present node set For P={ γ₁, γ₂..., γ_x, virtual machine nodes set Λ={ v (γ of corresponding node₁), v (γ₂) ..., v (γ_x)}。v(γ_i) represent virtual machine nodes and the mapping relations of true calculate node.

Parallel computation based on hash algorithm realizes that step includes：

Entire hash-value space is organized into a virtual end to end ring by step 1.；

For step 2. using the mode of the network address of calculate node as keyword Hash, each node determines it in Hash sky Between on position；

HMF files are mapped to a value of hash space by step 3. with hash function, along the value backward, will encountered One node is as processing node；

Step 4. is in the Map stages, and when according to HMF search nodes, what is found is virtual machine nodes, then further according to void Intend calculate node and search for corresponding true calculate node, each cluster is mapped on a node；

Step 5：In access phase, the load data of each node is collected, once find there is uneven situation, the node institute The cluster of mapping is then reassigned to new node, and new node quantity is determined according to loading condition.Origin node resource reclaim after replacement, with Just sub-distribution again；

Step 6：After the completion of the Hash connection of each node, new node and origin node data transfer to the Reduce stages together It does and merges, finally export query result.

In Hash connection, two relationships R and S, number of tuples is is respectively T_RAnd T_S, and T_R＞ T_S.One Hash letter Initial division S is mapped to B cluster, the serial number of cluster by number：The public attribute of 1,2 ..., B, R and S are A respectively₁, A₂... A_k, it is right In the connection attribute A for the relationship R being distributed on m nodes_iComponent, after hash operations, it is determined that in the B cluster Match.

The optimization process of above-mentioned Hash connection is divided into two stages：Structure and connection.

In the structure stage：The Map tasks of each node select one of table and connect base table as Hash, to build Hash table using the connection attribute for participating in attended operation as hashkey, reads the connection attribute of the base table in HMF file system In field to the node memory of MapReduce distributed systems, then, to all key assignments of link field, hash function is carried out Operation.

By the processed base table connection row of Hash, one piece opened up in the memory is stored in together with data and is specially deposited Put such data space.Then, according to different hash function values, cluster dividing operation is carried out to base table.Each cluster includes institute There are the base table data of identical hash function value.

In access phase：The fact that carry out Hash connection table is used as by base table on each node, it will be connected Data field read in batches successively from HMF, and hash operations are done to connection attribute field, it is determined that in which cluster Removal search using Hash searching algorithms, is navigated on appropriate cluster；

In the cluster navigated to, qualified line number is obtained by accurate matching.To the qualified line number of each node, It is done and merged by Reduce, in HMF file system, read the inquiry row involved in SQL statement, finally export query result.

Because the search range of algorithm has reduced, it is high to carry out matched success rate accuracy.Matching operation is in memory Middle progress, it is many soon when speed is compared with ordering by merging, realize optimization purpose.

In order to reduce the influence of accidental factor, it is preferable that data characteristics collection is randomly divided into several deciles first, it will be each Decile as characteristic test collection, after respective estimation density is obtained, then takes its average value as the estimation under this kind of quantity successively Density.Hierarchy clustering method based on improved estimator density is credible to the cluster result of data in example and is of practical significance , the cluster analysis preferably for being used for carrying out data of more conventional clustering method.

The feature mining subsystem is dispersed in cloud platform everywhere under the safe cloud environment having verified that from data Layer Data set provider at search for and match the characteristic for meeting application demand, and pass through aggregation and analysis and arrangement formed it is pending Characteristic；Calculate node under being used to use storage cluster in the modelling phase to distributed environment models, each non- The shared of characteristic is carried out between local node, searches for and matches the characteristic for meeting application demand；If x_iCollect for storage A node in group, { x_i1, x_i2... x_imIt is x_iNon-local nodes collection, PL_iFor local resource pond, PN_iFor non-local nodes Data pool, i ∈ [1, n], n are the sum that storage cluster includes node, and m represents the number of non-local nodes, m<n；

Carry out data it is shared when using based on the data protocol between non-local nodes：Work as x_iIt, will when adding in P2P networks x_iWith { x_i1, x_i2... x_imStructure connection, x_iFurther according to PL_iIn information on services, create sharing feature data, and by institute Sharing feature data forwarding is stated to all non-local nodes x_imIt is shared, if any node in storage cluster receives one During sharing feature data, judged whether to receive the sharing feature data according to the id information of sharing feature data, if having connect It received, and abandoned the sharing feature data, if receiving for the first time, according to the data and node location information of sharing feature data, Update PN_iIn content, and identified according to the validity of sharing feature data, determine forwarding or abandon the sharing feature number According to, wherein, data needs periodically synchronize between non-local nodes；

In resource searching, the operation specifically performed is：If initiate sharing request M_jNode be x_j, in x_jIt is non-local According to Probability p in node set_jThe set of node picked out at random is p_j×{x_j1, x_j2... x_jm, j ∈ [1, n]；As node x_iIt receives To x_jThe sharing request M of transmission_jWhen, check PN_iAnd PL_iIn whether containing meeting sharing request M_jCharacteristic, if so, according to The location information of node, creates the response message of inquiry and according to x where the characteristic and data_jLocation information, will The response message returns to x_j, then by x_jValidity mark subtract 1, if x_jValidity be identified as 0, abandon sharing request M_jIf not 0, p is calculated using EM algorithms_j×{x_j1, x_j2... x_jmIn each node desired value, by sharing request M_jIt is transmitted to p_j×{x_j1, x_j2... x_jmExpected value maximum node；Set the calculation formula of desired value as：

E_new=E_old+αE_learn+β×I[N_xjμ(t)(T_xjμ-T’_xjμ)/(T_xjμ×T’_xjμ)]×(N_xjμ(t))/T_xjμ

Wherein, E_newRepresent the new value of E, E_oldRepresent the old value of E, E_learnRepresent the value learnt, α represents learning rate, β Represent congestion factor, N_xjμ(t) moment t node x is represented_jμBuffer queue in pending sharing request message count, T '_xjμIt represents p_j×{x_j1, x_j2... x_jmIn node x_jμHandle the time of a sharing request message defined, T_xjμRepresent p_j×{x_j1, x_j2... x_jmIn node x_jμHandle a sharing request message actually required time；Function I [x] is in x>Value is 1 when 0, Value is 0 during x≤0；

The task scheduling subsystem carries out task scheduling to data handling procedure, by complicated data processing calculating task It splits to the single and independent subtask of one group of function, and meets the cloud service resource pool of its demand, shape for subtask matching Into Services Composition scheme, to obtain storage resource needed for data handling procedure or computing resource；It is taken according to the data of generation The task scheduling of business performs the estimation of Services Composition scheme：

(1), according to cloud service resource pool SP_vWith corresponding service quality historical record, CS is carried out_γEfficiency function X Modeling and according to each parameter of efficiency function in application example initialization model, if being constrained to by task scheduling is correspondingCorresponding qos constraint is C={ C₁, C₂.., C_d, each subtask G_vCorresponding money Source pond SP_vShared m_vA service, for cloud service resource pool SP_vIn each service SP_vω, it includes historical record Number is L_vω, by SP_vThe γ feasible Services Composition schemes formed are CS_γ, ω ∈ [1, m_v], defining service model is：

Wherein, QoS_max(k) it is the service quality maximum value of kth dimension, QoS_min(k) for the service quality of kth dimension most Small value, d be corresponding to maximum dimension, q_d() be majorized function, SP_vωR_hTo be under the jurisdiction of SP_vωHistorical record, x_vω-hIt represents The parameter of efficiency function in model；

(2), each feasible Services Composition scheme is ranked up by sequence from small to large according to efficiency function value, before selection As preferred service assembled scheme, the value of Z is set Z feasible Services Composition schemes according to application example；

(3), the average value of its efficiency function value is calculated each group of preferred service assembled scheme；

(4), the average value of efficiency of selection functional value is maximum preferred service assembled scheme as optimal Services Composition Scheme；

Record preferred service assembled scheme efficiency function value and optimal Services Composition scheme, and as sample into Row study, if new preferred service assembled scheme had occurred, directly invokes its functional value.

By taking network topics data mining as an example, the present invention is on the basis of constructed index structure, with training characteristics collection The difference of the contribution margin of each training characteristics set pair sample space construction when expressing test feature collection, using σ (I^c)V^cSquare Battle array constructs new sparse expression dictionary, wherein σ (I^c) it is every class training characteristics collection, V^cValue matrix is contributed for dictionary；In sparse expression Bound term in category set is added in constraint so that the sample of the same category can flock together in the less space of sum, Effectively excavate the hiding feature of complex data.Network topics method for digging the present invention is based on big data comprises the steps of：

Step 1：Using backward neural network to topic Text Feature Extraction topic feature.

Step 2：Training characteristics collection is inputted, using the article sample training classified lexicon for including C type, training characteristics Collection space is represented with I, is expressed as I=[I¹, I²..., I^c..., I^C]∈R^D×N, D represents the characteristic dimension of training characteristics collection, and N is Training characteristics collection total number, IⁱIt represents the i-th class sample, defines NⁱIt represents per class training characteristics collection quantity, then N=N¹+N²+ ...+N^c +…+N^C；

Step 3：Regularization is carried out to training characteristics collection, obtains the training characteristics collection collection I of regularization；

Step 4：Its dictionary is respectively trained to every a kind of training characteristics collection, the process of training dictionary is：

1st, c class samples I is taken out^c, by I^cIt is mapped to kernel space σ (I^c)；

2nd, sparse coding dictionary σ (I^c)V^cTraining need to meet constraints, the majorized function of the constraints is：

In formula, α is the constraint factor of sparse item constraint in sparse coding, and δ is that grouped accumulation constrains in coding dictionary Ic Constraint factor, S^cFor the eigenmatrix of c class kernel space training characteristics collection, m rowRepresent kernel space sample to structure The contribution margin of each entry, dictionary B in word making library^c=σ (I^c)V^c, mapping of the σ expression samples in kernel space.

3rd, the majorized function of constraints in step 2 is solved：First to V^cAnd S^cIt is initialized, it is random to generate Two matrixes, wherein, V^cIt is N^c× K matrix, S^cIt is K × N^cMatrix, K are dictionary sizes；Then, alternating iteration update V^cAnd S^c, Ask for optimal contribution margin matrix V^cWith eigenmatrix S^cSo that majorized function value is minimum, by the contribution of every a kind of training characteristics collection Value matrix V^cIt is placed into a unit matrix, obtains contribution margin matrix V, which is classified lexicon；It has Body solution procedure is：

(1) fixed V^c, update S^c；By V^cThe majorized function of constraints is substituted into, i.e. majorized function is converted into：

To S^cEach element in matrix is updated, and is made majorized function optimal, that is, is defined S^cIn row k n-th arrange Element asks for optimal eigenmatrix S^c。

(2) the fixed eigenmatrix S asked for^c, update contribution margin matrix V^c, i.e. majorized function is converted into：

f(W^c)=| | σ (I^c)-σ(I^c)V^cS^c||²

To contribution margin matrix V^cEach row be gradually updated, when updating a certain row, remaining row then be fixed value；

Traverse V^cEach row update V^cContribution margin；

(3) iteration updates above-mentioned steps (1) and step (2) to update S^cAnd V^cContribution margin, as above-mentioned majorized function value f (V^c, S^c) when tending towards stability, update finishes；

(4) the eigenmatrix S per a kind of training characteristics collection is trained successively^cWith contribution margin matrix V^c；

(5) the contribution margin matrix V c integrated by every a kind of training characteristics obtains the contribution value matrix that dimension is arranged as N rows C × K V, as classified lexicon.

Step 3：Text is identified, step is：

(1) text feature of test feature collection to be identified is extracted using backward neural network, defines y mark test samples The feature of topic.

(2) acquired contribution margin matrix V is used, test feature collection text feature σ (y) is predicted, obtains prediction Function, the anticipation function of acquisition are：

F (s)=| | σ (y)-σ (I) s × V^c||²+2αs

In formula, s represents the sparse coding of test feature collection σ (y), and σ (I) represents training characteristics collection I reflecting in kernel space It penetrates.

(3) kernel space σ (y) is asked to form the prediction error of sample space in every class sample, is represented with r (c), table It is up to formula：

R (c)=| | σ (y)-σ (I^c)V^cS^c||²

(5) compare kernel space σ (y) and the prediction error per class sample, it is minimum that text to be identified then belongs to prediction error Classification.

In conclusion the present invention proposes a kind of data characteristics statistical analysis technique, for the data section of distributed environment Point facilitates user by matching service description information to use data, improves the efficiency of data mining；For by using high in the clouds The computing resource provided or storage resource are provided and provide a feasible scheme to develop structure data service.

It obviously, can be with general it should be appreciated by those skilled in the art each module or each step of, the above-mentioned present invention Computing system realize that they can concentrate in single computing system or be distributed in multiple computing systems and be formed Network on, optionally, they can be realized with the program code that computing system can perform, it is thus possible to which they are stored It is performed within the storage system by computing system.It to be combined in this way, the present invention is not limited to any specific hardware and softwares.

It should be understood that the above-mentioned specific embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into scope and boundary or this range and the equivalent form on boundary and repairing Change example.

Claims

1. a kind of data characteristics statistical analysis technique, which is characterized in that including：

2. according to the method described in claim 1, it is characterized in that, for characteristic attribute included in data characteristics collection, adopt It is indexed with tree structure construction feature.

3. according to the method described in claim 1, it is characterized in that, for data attribute number included in data characteristics collection When construction feature indexes, corresponding index structure is selected according to the data type of data attribute for value.

4. according to the method described in claim 3, it is characterized in that：

If data attribute is numeric type data, R tree aspect indexings are built；The specific features attribute of data all stores in index In n omicronn-leaf subobject, when logarithm type data carry out range query, the tree-like aspect indexing for being directly targeted to low layer is completed.

5. according to the method described in claim 4, it is characterized in that, three are stored in all leaf objects of the tree structure Divide information A_i, Pcat, Psi, the meaning of expression is respectively：(1)A_iIt is the specific features attribute of index data feature set, wherein n is The number of all characteristic attributes, i ∈ [1, n]；(2) what Pcat was represented is pointer type；(3) Psi is is directed toward low-level feature index Pointer.

6. according to the method described in claim 2, it is characterized in that：

If data attribute is text-type data, Reverse features index is built, the Reverse features index is divided into two parts, the A part is the aspect indexing table being made of different index word, records different text keywords and their relevant information； Second part records the collection of document and its storage address for each index terms occurred.

7. according to the method described in claim 6, it is characterized in that, in search index, querying condition is analyzed first and obtains spy Word is levied, query characteristics word is handed into index dictionary, if index marker position is false, null value is returned and represents to be not present in index file If the characteristic to be inquired if true, judge the data type that the query word returns the result, is navigated to according to different type Different characteristic indexes, and reads the ID of this feature word and comprising Feature Words number of documents, obtains the related of querying condition with this and believe Breath；Read the content in R trees aspect indexing or Converse Index further according to Feature Words ID, the retrieval content integrated, finally with Search condition carries out correlation comparison, obtains final result to result ranking and returns to user.