CN108256086A - Data characteristics statistical analysis technique - Google Patents

Data characteristics statistical analysis technique Download PDF

Info

Publication number
CN108256086A
CN108256086A CN201810060531.8A CN201810060531A CN108256086A CN 108256086 A CN108256086 A CN 108256086A CN 201810060531 A CN201810060531 A CN 201810060531A CN 108256086 A CN108256086 A CN 108256086A
Authority
CN
China
Prior art keywords
data
index
attribute
feature
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810060531.8A
Other languages
Chinese (zh)
Inventor
李垚霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Boruide Science & Technology Co Ltd
Original Assignee
Chengdu Boruide Science & Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Boruide Science & Technology Co Ltd filed Critical Chengdu Boruide Science & Technology Co Ltd
Priority to CN201810060531.8A priority Critical patent/CN108256086A/en
Publication of CN108256086A publication Critical patent/CN108256086A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of data characteristics statistical analysis technique, this method includes:The data attribute key and attribute value that data characteristics is concentrated divide, according to the data attribute key after division and the double-deck aspect indexing of attribute value structure.The present invention proposes a kind of data characteristics statistical analysis technique, for the back end of distributed environment, improves the efficiency of data mining.

Description

Data characteristics statistical analysis technique
Technical field
The present invention relates to data, more particularly to a kind of data characteristics statistical analysis technique.
Background technology
The aggregation of data and analysis are performed in large-scale distributed back end to be needed to design efficient data mining Method.In current the relevant technologies, traditional centralized data management and searching method are faced with Single Point of Faliure, scalability The problems such as poor, can not meet data mining demand flexible, expansible and healthy and strong under distributed environment.Therefore, how using non- The back end management of centralization and data digging method, to meet the expansible back end management of structure data service and number It is still a challenging problem according to assembling and analyzing demand.In addition, existing big data parallel computation frame is in data directory Stage, data query time and cost have much room for improvement, and according to traditional sorting in parallel merger, then data characteristics field point Cloth is uneven, will be decreased obviously in access phase efficiency.
Invention content
To solve the problems of above-mentioned prior art, the present invention proposes a kind of data characteristics statistical analysis technique, Including:
The data attribute key and attribute value that data characteristics is concentrated divide,
According to the data attribute key after division and the double-deck aspect indexing of attribute value structure.
Preferably, it for characteristic attribute included in data characteristics collection, is indexed using tree structure construction feature.
Preferably, for data attribute numerical value included in data characteristics collection, when construction feature indexes, according to number Corresponding index structure is selected according to the data type of attribute.
Preferably, if data attribute is numeric type data, R tree aspect indexings are built;The specific features of data in index Attribute is stored entirely in n omicronn-leaf subobject, when logarithm type data carry out range query, is directly targeted to the tree-like of low layer Aspect indexing is completed.
Preferably, three parts information A is stored in all leaf objects of the tree structurei, Pcat, Psi, expression contains Justice is respectively:(1)AiIt is the specific features attribute of index data feature set, wherein n is the number of all characteristic attributes, i ∈ [1, n];(2) what Pcat was represented is pointer type;(3) Psi is the pointer for being directed toward low-level feature index.
Preferably, if data attribute is text-type data, Reverse features index is built, the Reverse features index is divided into Two parts, first part are the aspect indexing tables being made of different index word, record different text keywords and they Relevant information;Second part records the collection of document and its storage address for each index terms occurred.
7. according to the method described in claim 6, it is characterized in that, in search index, querying condition is analyzed first and is obtained To Feature Words, query characteristics word is handed into index dictionary, if index marker position is false, null value is returned and represents in index file not In the presence of the characteristic to be inquired, if if true, judge the data type that the query word returns the result, determined according to different type Position is indexed to different characteristic, reads the ID of this feature word and comprising Feature Words number of documents, the phase of querying condition is obtained with this Close information;Further according to the content in Feature Words ID reading R trees aspect indexings or Converse Index, the retrieval content integrated, most Correlation comparison is carried out with search condition afterwards, final result is obtained to result ranking and returns to user.
The present invention compared with prior art, has the following advantages:
The present invention proposes a kind of data characteristics statistical analysis technique, for the back end of distributed environment, facilitates use Family uses data by matching service description information, improves the efficiency of data mining;To be provided by using cloud service Computing resource or storage resource come develop structure data service provide a feasible scheme.
Description of the drawings
Fig. 1 is the flow chart of data characteristics statistical analysis technique according to embodiments of the present invention.
Specific embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing for illustrating the principle of the invention It states.The present invention is described with reference to such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of data characteristics statistical analysis technique.Fig. 1 is according to embodiments of the present invention Data characteristics statistical analysis technique flow chart.
The present invention data characteristics digging system include storage subsystem, tagsort subsystem, trusted key subsystem, Feature mining subsystem, task scheduling subsystem.
Trusted key subsystem is for ensureing data by identity authentication result to obtain, including key generation, authentication And decryption;Key schedule is as follows:
1) data are divided into the block of multiple key string length scales;
2) with each character of the integer of 0~26 range substitution plaintext and key, space character=00, A=01 ..., Z= 26;
3) to each block of plaintext, the corresponding calculated value of each of which character is replaced, the corresponding calculated value is will The integer coding of corresponding character removes 27 obtained values of remainder again after being added with the integer coding of the character of corresponding position in key;
4) character replaced with corresponding calculated value is substituted again with its character of equal value;
The authentication is logged in by user and voice print verification is realized;The successful user of authentication can pass through decryption Module obtains key, completes decryption;
Storage subsystem includes memory module and disaster tolerance module, the net stored needed for the memory module certification into row information Node in network builds the trusting relationship of stored information, based on the data being distributed under distributed environment, to characteristic According to storage is packaged, using combined type aspect indexing structure, faster inquiry is generated to text-type data and numeric type data Speed;The disaster tolerance module restores data for loss of data or in the case of being destroyed;
The memory module on the basis of tradition indexes, draw by data attribute key and attribute value that data characteristics is concentrated It separates, builds double-deck aspect indexing structure.First high layer index is built for the attribute of data characteristics intensive data.Secondly to height Key assignments construction feature index corresponding to layer characteristic attribute, if numeric type data just builds R tree aspect indexing structures, if literary This type data just build Reverse features index.When logarithm type data carry out range query, low layer will be directly targeted to Tree-like aspect indexing is completed, and reduces data query time and cost.
High-rise tree-like aspect indexing is built for characteristic attribute included in data characteristics collection, the data in the layer index Specific features attribute be stored entirely in n omicronn-leaf subobject, and three parts information A is then stored in all leaf objects of R treesi、 Pcat, Psi, the meaning of expression are respectively:(1)AiIt is the specific features attribute of index data feature set, wherein n is all features The number of attribute, i ∈ [1, n];(2) what Pcat was represented is pointer type;(3) Psi is to be directed toward the pointer that low-level feature indexes, root According to the difference of data type, which is directed toward different aspect indexing structures, that is, is directed toward the root section of reverse document table gauge outfit or R trees Point.
Low-level feature index is for the index constructed by the key assignments corresponding to high-rise characteristic attribute, including for numeric type number According to the R tree aspect indexing structures of structure and the reverse document table aspect indexing for text-type data structure.Practical key assignments stores In the n omicronn-leaf subobject of R tree aspect indexing structures, and leaf objects are ordered arrangement and three comprising feature index file Partial information RS, Pos, Fileid, represent to be meant that respectively:(1)RSFor the S attribute key assignments of the R characteristic attribute key, R ∈ [1, n2], S ∈ [1, p], n2For the number of numerical characteristics attribute that data characteristics concentration includes, P is the spy of the R property key Levy quantity.(2) Pos is includes the location information where the file of this attribute value.(3) Fileid is includes query characteristics word File ID.
Reverse features index is divided into two parts, and first part is the aspect indexing table being made of different index word, record Different text keyword and their relevant information.Second part has recorded the collection of document for each index terms occurred And its storage address.A is specifically included in Reverse features index structureij, tetra- partial information of Fileid, Pos, Freq, expression Meaning is respectively:(1)AijFor j-th of characteristic attribute key assignments of ith feature property key, i ∈ [1, n1], j ∈ [1, m], n1For The number of text attribute, m are the number of attribute value that ith attribute key includes.(2) Fileid is includes query characteristics word File ID, Fileid are unique.(3) Pos is includes the position where query characteristics word file.(4) Freq is query characteristics Word concentrates the frequency occurred in data characteristics.The establishment process of aspect indexing is as follows:
The data that construction feature indexes are wanted in step 1, first analysis, if there is no current data in the index built, The new aspect indexing object of the high layer building one of combined type aspect indexing;
Step 2, the characteristic attribute value type for judging newly-increased data, if numeric type data, then create R tree features for it Index;If text-type attribute then builds Reverse features index structure for it;
Step 3 repeats step 1, if there are current attribute in the aspect indexing built before, no longer to feature rope Draw the new object of high-rise increase, only the data of the attribute are added in the corresponding aspect indexing of low layer;
Step 4 repeats above step, until completion until being indexed for all data construction features.
In search index, querying condition is analyzed first and obtains Feature Words, query characteristics word is handed into index dictionary, if Index marker position is false, returns to null value and represents there is no the characteristic to be inquired in index file, if if true, judgement should The data type that query word returns the result, according to different type navigate to different characteristic index, read this feature word ID and Comprising Feature Words number of documents, the relevant information of querying condition is obtained by these.R tree feature ropes are read further according to Feature Words ID Draw or Converse Index in content, the retrieval content integrated, finally with search condition carry out correlation comparison, to inquiry tie Fruit, which sorts to obtain final result, returns to user.Using the key assignments key_id in characteristic table as the input value of search algorithm, It exports as Boolean, detailed process is as follows:
(1) using root, key_id, level number level as input parameter, search function lookup (root, key_ are called Id, level), search result is assigned to leaf node record;
(2) if leaf node is recorded as sky, null value is directly returned;Otherwise, real search result rid is returned;
Using current block as the input of search function lookup, key is search key, and level is the initial number of plies, may be included The leaf of search key key records the output as function, and detailed process is as follows:
(3.1) if what is be currently located is leaf node, key keys are searched for, and provide search result using binary search algorithm.
(3.2) if current block is not leaf node, step (3.3) to (3.6) is performed.
(3.3) by current block and key values, the subtree containing key assignments is selected, obtains the block number of child node.
(3.4) the child node block that it is included is read according to block number in the buffer.
(3.5) it if the child node block found is leaf node, returns (3.1).
(3.6) if the child node block is branch's block, child node block, key, level is subtracted 1 as new input, passed Call function is returned to return to output result.
The tagsort subsystem is used to carry out Classification Management to characteristic using the method for cluster;The present invention uses Method is determined with lower class quantity.Definition estimation density first:
Wherein, Xtr, XteIt represents to carry out initial data the feature training set and characteristic test obtained by random division respectively Collection;C(Xtr, k) represent feature training set cluster process, be copolymerized into k classes;Ak1, Ak2..., AkkRepresent that characteristic test collection itself is poly- Into k classes, i, i ' be sample point in same class, nkjIt is AkjThe number of middle sample point;D[C(Xtr, k), Xte] represent a k The element of × k matrixes, the i-th row and the i-th ' row takes 0 or 1, and value 0 represents that, not in same class, value 1 is represented with feature training set It is right:I and i ' are clustered;Ps (k) represents the estimation density of cluster result when class quantity is k.
Estimate that density calculating process is as follows:
(1) initial data to be clustered is randomly divided into feature training set and characteristic test collection;
(2) class quantity is taken to be clustered for k to above-mentioned two subset, cluster result is denoted as I types cluster;
(3) characteristic test collection is differentiated with the cluster result of feature training set, is as a result denoted as II types cluster;
(4) in k-th of class itself being polymerized in characteristic test collection, whether examination is any gathers in II types sample point i and i ' Accidentally divided in class in different classes, and record the ratio correctly divided;
(5) in k composition of proportions, reckling is the estimation density under current class quantity k.
To estimate density as majorized function, class quantity and variable subset are to influence the factor of estimation density, are closed by selecting Suitable class quantity and variable subset maximize estimation density.
It is unevenly distributed Hash access phase efficiency declines the problem of for characteristic field, the present invention is based on features The Hash join algorithm of the MapReduce data queries of field storage, makes the field under MapReduce distributed environments each It is evenly distributed on node, improves data-handling efficiency.Query execution projection operation is changed into the feature field on each node Operation reduces the I/O wastes that repeated accesses come with watchband.In feature field storage, under the target object that pushes away specific to certain A feature field, each feature field are equivalent to a small table being made of (lineid, value).
In order to solve the problems, such as that data are unbalanced, first in bottom-layer design one in MapReduce distributed computing frameworks New file format HMF so that if there are user's HMF file sets:O (F)={ f1, f2..., fn, by present node set For P={ γ1, γ2..., γx, virtual machine nodes set Λ={ v (γ of corresponding node1), v (γ2) ..., v (γx)}。v(γi) represent virtual machine nodes and the mapping relations of true calculate node.
Parallel computation based on hash algorithm realizes that step includes:
Entire hash-value space is organized into a virtual end to end ring by step 1.;
For step 2. using the mode of the network address of calculate node as keyword Hash, each node determines it in Hash sky Between on position;
HMF files are mapped to a value of hash space by step 3. with hash function, along the value backward, will encountered One node is as processing node;
Step 4. is in the Map stages, and when according to HMF search nodes, what is found is virtual machine nodes, then further according to void Intend calculate node and search for corresponding true calculate node, each cluster is mapped on a node;
Step 5:In access phase, the load data of each node is collected, once find there is uneven situation, the node institute The cluster of mapping is then reassigned to new node, and new node quantity is determined according to loading condition.Origin node resource reclaim after replacement, with Just sub-distribution again;
Step 6:After the completion of the Hash connection of each node, new node and origin node data transfer to the Reduce stages together It does and merges, finally export query result.
In Hash connection, two relationships R and S, number of tuples is is respectively TRAnd TS, and TR> TS.One Hash letter Initial division S is mapped to B cluster, the serial number of cluster by number:The public attribute of 1,2 ..., B, R and S are A respectively1, A2... Ak, it is right In the connection attribute A for the relationship R being distributed on m nodesiComponent, after hash operations, it is determined that in the B cluster Match.
The optimization process of above-mentioned Hash connection is divided into two stages:Structure and connection.
In the structure stage:The Map tasks of each node select one of table and connect base table as Hash, to build Hash table using the connection attribute for participating in attended operation as hashkey, reads the connection attribute of the base table in HMF file system In field to the node memory of MapReduce distributed systems, then, to all key assignments of link field, hash function is carried out Operation.
By the processed base table connection row of Hash, one piece opened up in the memory is stored in together with data and is specially deposited Put such data space.Then, according to different hash function values, cluster dividing operation is carried out to base table.Each cluster includes institute There are the base table data of identical hash function value.
In access phase:The fact that carry out Hash connection table is used as by base table on each node, it will be connected Data field read in batches successively from HMF, and hash operations are done to connection attribute field, it is determined that in which cluster Removal search using Hash searching algorithms, is navigated on appropriate cluster;
In the cluster navigated to, qualified line number is obtained by accurate matching.To the qualified line number of each node, It is done and merged by Reduce, in HMF file system, read the inquiry row involved in SQL statement, finally export query result.
Because the search range of algorithm has reduced, it is high to carry out matched success rate accuracy.Matching operation is in memory Middle progress, it is many soon when speed is compared with ordering by merging, realize optimization purpose.
In order to reduce the influence of accidental factor, it is preferable that data characteristics collection is randomly divided into several deciles first, it will be each Decile as characteristic test collection, after respective estimation density is obtained, then takes its average value as the estimation under this kind of quantity successively Density.Hierarchy clustering method based on improved estimator density is credible to the cluster result of data in example and is of practical significance , the cluster analysis preferably for being used for carrying out data of more conventional clustering method.
The feature mining subsystem is dispersed in cloud platform everywhere under the safe cloud environment having verified that from data Layer Data set provider at search for and match the characteristic for meeting application demand, and pass through aggregation and analysis and arrangement formed it is pending Characteristic;Calculate node under being used to use storage cluster in the modelling phase to distributed environment models, each non- The shared of characteristic is carried out between local node, searches for and matches the characteristic for meeting application demand;If xiCollect for storage A node in group, { xi1, xi2... ximIt is xiNon-local nodes collection, PLiFor local resource pond, PNiFor non-local nodes Data pool, i ∈ [1, n], n are the sum that storage cluster includes node, and m represents the number of non-local nodes, m<n;
Carry out data it is shared when using based on the data protocol between non-local nodes:Work as xiIt, will when adding in P2P networks xiWith { xi1, xi2... ximStructure connection, xiFurther according to PLiIn information on services, create sharing feature data, and by institute Sharing feature data forwarding is stated to all non-local nodes ximIt is shared, if any node in storage cluster receives one During sharing feature data, judged whether to receive the sharing feature data according to the id information of sharing feature data, if having connect It received, and abandoned the sharing feature data, if receiving for the first time, according to the data and node location information of sharing feature data, Update PNiIn content, and identified according to the validity of sharing feature data, determine forwarding or abandon the sharing feature number According to, wherein, data needs periodically synchronize between non-local nodes;
In resource searching, the operation specifically performed is:If initiate sharing request MjNode be xj, in xjIt is non-local According to Probability p in node setjThe set of node picked out at random is pj×{xj1, xj2... xjm, j ∈ [1, n];As node xiIt receives To xjThe sharing request M of transmissionjWhen, check PNiAnd PLiIn whether containing meeting sharing request MjCharacteristic, if so, according to The location information of node, creates the response message of inquiry and according to x where the characteristic and datajLocation information, will The response message returns to xj, then by xjValidity mark subtract 1, if xjValidity be identified as 0, abandon sharing request MjIf not 0, p is calculated using EM algorithmsj×{xj1, xj2... xjmIn each node desired value, by sharing request MjIt is transmitted to pj×{xj1, xj2... xjmExpected value maximum node;Set the calculation formula of desired value as:
Enew=Eold+αElearn+β×I[Nxjμ(t)(Txjμ-T’xjμ)/(Txjμ×T’xjμ)]×(Nxjμ(t))/Txjμ
Wherein, EnewRepresent the new value of E, EoldRepresent the old value of E, ElearnRepresent the value learnt, α represents learning rate, β Represent congestion factor, Nxjμ(t) moment t node x is representedBuffer queue in pending sharing request message count, T 'xjμIt represents pj×{xj1, xj2... xjmIn node xHandle the time of a sharing request message defined, TxjμRepresent pj×{xj1, xj2... xjmIn node xHandle a sharing request message actually required time;Function I [x] is in x>Value is 1 when 0, Value is 0 during x≤0;
The task scheduling subsystem carries out task scheduling to data handling procedure, by complicated data processing calculating task It splits to the single and independent subtask of one group of function, and meets the cloud service resource pool of its demand, shape for subtask matching Into Services Composition scheme, to obtain storage resource needed for data handling procedure or computing resource;It is taken according to the data of generation The task scheduling of business performs the estimation of Services Composition scheme:
(1), according to cloud service resource pool SPvWith corresponding service quality historical record, CS is carried outγEfficiency function X Modeling and according to each parameter of efficiency function in application example initialization model, if being constrained to by task scheduling is correspondingCorresponding qos constraint is C={ C1, C2.., Cd, each subtask GvCorresponding money Source pond SPvShared mvA service, for cloud service resource pool SPvIn each service SP, it includes historical record Number is L, by SPvThe γ feasible Services Composition schemes formed are CSγ, ω ∈ [1, mv], defining service model is:
Wherein, QoSmax(k) it is the service quality maximum value of kth dimension, QoSmin(k) for the service quality of kth dimension most Small value, d be corresponding to maximum dimension, qd() be majorized function, SPRhTo be under the jurisdiction of SPHistorical record, xvω-hIt represents The parameter of efficiency function in model;
(2), each feasible Services Composition scheme is ranked up by sequence from small to large according to efficiency function value, before selection As preferred service assembled scheme, the value of Z is set Z feasible Services Composition schemes according to application example;
(3), the average value of its efficiency function value is calculated each group of preferred service assembled scheme;
(4), the average value of efficiency of selection functional value is maximum preferred service assembled scheme as optimal Services Composition Scheme;
Record preferred service assembled scheme efficiency function value and optimal Services Composition scheme, and as sample into Row study, if new preferred service assembled scheme had occurred, directly invokes its functional value.
By taking network topics data mining as an example, the present invention is on the basis of constructed index structure, with training characteristics collection The difference of the contribution margin of each training characteristics set pair sample space construction when expressing test feature collection, using σ (Ic)VcSquare Battle array constructs new sparse expression dictionary, wherein σ (Ic) it is every class training characteristics collection, VcValue matrix is contributed for dictionary;In sparse expression Bound term in category set is added in constraint so that the sample of the same category can flock together in the less space of sum, Effectively excavate the hiding feature of complex data.Network topics method for digging the present invention is based on big data comprises the steps of:
Step 1:Using backward neural network to topic Text Feature Extraction topic feature.
Step 2:Training characteristics collection is inputted, using the article sample training classified lexicon for including C type, training characteristics Collection space is represented with I, is expressed as I=[I1, I2..., Ic..., IC]∈RD×N, D represents the characteristic dimension of training characteristics collection, and N is Training characteristics collection total number, IiIt represents the i-th class sample, defines NiIt represents per class training characteristics collection quantity, then N=N1+N2+ ...+Nc +…+NC
Step 3:Regularization is carried out to training characteristics collection, obtains the training characteristics collection collection I of regularization;
Step 4:Its dictionary is respectively trained to every a kind of training characteristics collection, the process of training dictionary is:
1st, c class samples I is taken outc, by IcIt is mapped to kernel space σ (Ic);
2nd, sparse coding dictionary σ (Ic)VcTraining need to meet constraints, the majorized function of the constraints is:
In formula, α is the constraint factor of sparse item constraint in sparse coding, and δ is that grouped accumulation constrains in coding dictionary Ic Constraint factor, ScFor the eigenmatrix of c class kernel space training characteristics collection, m rowRepresent kernel space sample to structure The contribution margin of each entry, dictionary B in word making libraryc=σ (Ic)Vc, mapping of the σ expression samples in kernel space.
3rd, the majorized function of constraints in step 2 is solved:First to VcAnd ScIt is initialized, it is random to generate Two matrixes, wherein, VcIt is Nc× K matrix, ScIt is K × NcMatrix, K are dictionary sizes;Then, alternating iteration update VcAnd Sc, Ask for optimal contribution margin matrix VcWith eigenmatrix ScSo that majorized function value is minimum, by the contribution of every a kind of training characteristics collection Value matrix VcIt is placed into a unit matrix, obtains contribution margin matrix V, which is classified lexicon;It has Body solution procedure is:
(1) fixed Vc, update Sc;By VcThe majorized function of constraints is substituted into, i.e. majorized function is converted into:
To ScEach element in matrix is updated, and is made majorized function optimal, that is, is defined ScIn row k n-th arrange Element asks for optimal eigenmatrix Sc
(2) the fixed eigenmatrix S asked forc, update contribution margin matrix Vc, i.e. majorized function is converted into:
f(Wc)=| | σ (Ic)-σ(Ic)VcSc||2
To contribution margin matrix VcEach row be gradually updated, when updating a certain row, remaining row then be fixed value;
Traverse VcEach row update VcContribution margin;
(3) iteration updates above-mentioned steps (1) and step (2) to update ScAnd VcContribution margin, as above-mentioned majorized function value f (Vc, Sc) when tending towards stability, update finishes;
(4) the eigenmatrix S per a kind of training characteristics collection is trained successivelycWith contribution margin matrix Vc
(5) the contribution margin matrix V c integrated by every a kind of training characteristics obtains the contribution value matrix that dimension is arranged as N rows C × K V, as classified lexicon.
Step 3:Text is identified, step is:
(1) text feature of test feature collection to be identified is extracted using backward neural network, defines y mark test samples The feature of topic.
(2) acquired contribution margin matrix V is used, test feature collection text feature σ (y) is predicted, obtains prediction Function, the anticipation function of acquisition are:
F (s)=| | σ (y)-σ (I) s × Vc||2+2αs
In formula, s represents the sparse coding of test feature collection σ (y), and σ (I) represents training characteristics collection I reflecting in kernel space It penetrates.
(3) kernel space σ (y) is asked to form the prediction error of sample space in every class sample, is represented with r (c), table It is up to formula:
R (c)=| | σ (y)-σ (Ic)VcSc||2
(5) compare kernel space σ (y) and the prediction error per class sample, it is minimum that text to be identified then belongs to prediction error Classification.
In conclusion the present invention proposes a kind of data characteristics statistical analysis technique, for the data section of distributed environment Point facilitates user by matching service description information to use data, improves the efficiency of data mining;For by using high in the clouds The computing resource provided or storage resource are provided and provide a feasible scheme to develop structure data service.
It obviously, can be with general it should be appreciated by those skilled in the art each module or each step of, the above-mentioned present invention Computing system realize that they can concentrate in single computing system or be distributed in multiple computing systems and be formed Network on, optionally, they can be realized with the program code that computing system can perform, it is thus possible to which they are stored It is performed within the storage system by computing system.It to be combined in this way, the present invention is not limited to any specific hardware and softwares.
It should be understood that the above-mentioned specific embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into scope and boundary or this range and the equivalent form on boundary and repairing Change example.

Claims (7)

1. a kind of data characteristics statistical analysis technique, which is characterized in that including:
The data attribute key and attribute value that data characteristics is concentrated divide,
According to the data attribute key after division and the double-deck aspect indexing of attribute value structure.
2. according to the method described in claim 1, it is characterized in that, for characteristic attribute included in data characteristics collection, adopt It is indexed with tree structure construction feature.
3. according to the method described in claim 1, it is characterized in that, for data attribute number included in data characteristics collection When construction feature indexes, corresponding index structure is selected according to the data type of data attribute for value.
4. according to the method described in claim 3, it is characterized in that:
If data attribute is numeric type data, R tree aspect indexings are built;The specific features attribute of data all stores in index In n omicronn-leaf subobject, when logarithm type data carry out range query, the tree-like aspect indexing for being directly targeted to low layer is completed.
5. according to the method described in claim 4, it is characterized in that, three are stored in all leaf objects of the tree structure Divide information Ai, Pcat, Psi, the meaning of expression is respectively:(1)AiIt is the specific features attribute of index data feature set, wherein n is The number of all characteristic attributes, i ∈ [1, n];(2) what Pcat was represented is pointer type;(3) Psi is is directed toward low-level feature index Pointer.
6. according to the method described in claim 2, it is characterized in that:
If data attribute is text-type data, Reverse features index is built, the Reverse features index is divided into two parts, the A part is the aspect indexing table being made of different index word, records different text keywords and their relevant information; Second part records the collection of document and its storage address for each index terms occurred.
7. according to the method described in claim 6, it is characterized in that, in search index, querying condition is analyzed first and obtains spy Word is levied, query characteristics word is handed into index dictionary, if index marker position is false, null value is returned and represents to be not present in index file If the characteristic to be inquired if true, judge the data type that the query word returns the result, is navigated to according to different type Different characteristic indexes, and reads the ID of this feature word and comprising Feature Words number of documents, obtains the related of querying condition with this and believe Breath;Read the content in R trees aspect indexing or Converse Index further according to Feature Words ID, the retrieval content integrated, finally with Search condition carries out correlation comparison, obtains final result to result ranking and returns to user.
CN201810060531.8A 2018-01-22 2018-01-22 Data characteristics statistical analysis technique Pending CN108256086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810060531.8A CN108256086A (en) 2018-01-22 2018-01-22 Data characteristics statistical analysis technique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810060531.8A CN108256086A (en) 2018-01-22 2018-01-22 Data characteristics statistical analysis technique

Publications (1)

Publication Number Publication Date
CN108256086A true CN108256086A (en) 2018-07-06

Family

ID=62742075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810060531.8A Pending CN108256086A (en) 2018-01-22 2018-01-22 Data characteristics statistical analysis technique

Country Status (1)

Country Link
CN (1) CN108256086A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112905591A (en) * 2021-02-04 2021-06-04 成都信息工程大学 Data table connection sequence selection method based on machine learning
WO2021213127A1 (en) * 2020-04-21 2021-10-28 International Business Machines Corporation Cached updatable top-k index

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033954A (en) * 2010-12-24 2011-04-27 东北大学 Full text retrieval inquiry index method for extensible markup language document in relational database
CN106484813A (en) * 2016-09-23 2017-03-08 广东港鑫科技有限公司 A kind of big data analysis system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033954A (en) * 2010-12-24 2011-04-27 东北大学 Full text retrieval inquiry index method for extensible markup language document in relational database
CN106484813A (en) * 2016-09-23 2017-03-08 广东港鑫科技有限公司 A kind of big data analysis system and method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021213127A1 (en) * 2020-04-21 2021-10-28 International Business Machines Corporation Cached updatable top-k index
US11327980B2 (en) 2020-04-21 2022-05-10 International Business Machines Corporation Cached updatable top-k index
GB2610108A (en) * 2020-04-21 2023-02-22 Ibm Cached updatable top-k index
CN112905591A (en) * 2021-02-04 2021-06-04 成都信息工程大学 Data table connection sequence selection method based on machine learning

Similar Documents

Publication Publication Date Title
CN106649455B (en) Standardized system classification and command set system for big data development
KR100816934B1 (en) Clustering system and method using search result document
US6738759B1 (en) System and method for performing similarity searching using pointer optimization
US20080270374A1 (en) Method and system for combining ranking and clustering in a database management system
JP2017037648A (en) Hybrid data storage system, method, and program for storing hybrid data
CN110209808A (en) A kind of event generation method and relevant apparatus based on text information
CN106815310A (en) A kind of hierarchy clustering method and system to magnanimity document sets
CN106095951B (en) Data space multi-dimensional indexing method based on load balancing and inquiry log
CN104112005B (en) Distributed mass fingerprint identification method
CN110390352A (en) A kind of dark data value appraisal procedure of image based on similitude Hash
KR102046692B1 (en) Method and System for Entity summarization based on multilingual projected entity space
Motwani et al. A study on initial centroids selection for partitional clustering algorithms
CN109241278A (en) Scientific research knowledge management method and system
CN113222181A (en) Federated learning method facing k-means clustering algorithm
CN108228787A (en) According to the method and apparatus of multistage classification processing information
Rahman et al. Seed-Detective: A Novel Clustering Technique Using High Quality Seed for K-Means on Categorical and Numerical Attributes.
Abdelli et al. A novel and efficient index based web service discovery approach
CN108256086A (en) Data characteristics statistical analysis technique
CN108256083A (en) Content recommendation method based on deep learning
CN108280176A (en) Data mining optimization method based on MapReduce
WO2022156086A1 (en) Human computer interaction method, apparatus and device, and storage medium
Prasanth et al. Effective big data retrieval using deep learning modified neural networks
Al Aghbari et al. Geosimmr: A mapreduce algorithm for detecting communities based on distance and interest in social networks
CN107844536A (en) The methods, devices and systems of application program selection
CN109446408A (en) Retrieve method, apparatus, equipment and the computer readable storage medium of set of metadata of similar data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180706

WD01 Invention patent application deemed withdrawn after publication