CN105843896A - Redundant source synergistic reducing method of multi-source heterogeneous big data - Google Patents

Redundant source synergistic reducing method of multi-source heterogeneous big data Download PDF

Info

Publication number
CN105843896A
CN105843896A CN201610166631.XA CN201610166631A CN105843896A CN 105843896 A CN105843896 A CN 105843896A CN 201610166631 A CN201610166631 A CN 201610166631A CN 105843896 A CN105843896 A CN 105843896A
Authority
CN
China
Prior art keywords
source
data
matrix
redundancy
redundant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610166631.XA
Other languages
Chinese (zh)
Inventor
张磊
王树鹏
云晓春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201610166631.XA priority Critical patent/CN105843896A/en
Publication of CN105843896A publication Critical patent/CN105843896A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a redundant source synergistic reducing method of multi-source heterogeneous big data. The method comprises two models: an HMSL (Heterogeneous Manifold Smoothness Learning) model and a CMRR (Correlation-based Multi-view Redundancy Reduction) model, wherein the HMSL model is used for linearly projecting multi-source heterogeneous data onto a low-dimension characteristic isomorphic space and enabling the manifold distance of information correlated descriptions and the Euclidean distance of semantic complementary samples to be shorter in the space; and the CMRR model is used for eliminating three-way redundancies and double-level heterogeneities of multi-source redundant data by utilizing gradient energy competition strategy-based generalized elementary transformation constraint in the characteristic isomorphic space obtained through HMSL model learning. The method disclosed by the invention has the advantages that the three-way redundancies and the double-level heterogeneities of the multi-source redundant data can be eliminated and redundant sources of the multi-source heterogeneous data are simplified.

Description

Reduction method is worked in coordination with in the redundancy source of a kind of multi-source heterogeneous big data
Technical field
The invention belongs to areas of information technology, for the three-dimensional redundancy under massive multi-source redundant data environment and bilayer Heterogeneity, it is proposed that reduction method is worked in coordination with in the redundancy source of a kind of multi-source heterogeneous big data.
Background technology
In recent years, along with the appearance of a large amount of high-tech digital products, by these allos electronic equipments produce multi-source heterogeneous Data (Multi-source Heterogeneous Data) are own through spreading all over each corner of people's actual life.What is called is many Source isomeric data refers to from separate sources or channel, but the content expressed is similar, in different forms, different modalities, difference The data that the multiple pattern such as visual angle and different background occurs.Such as, Sina's microblogging, Tengxun's wechat and Sohu.com are about identical The report of the multi-form of news;The brain of senile dementia (Alzheimer) patient can be by nuclear magnetic resonance, NMR (MRI), positive electricity Sub-imaging technique (PET) and X-ray produce the medical imaging of multiple different visual angles;On Wikipedia website, the description to flower leopard is adopted Medium with different modalities such as picture, text and voices;The identical building White House can be under different backgrounds.
But, due to reasons such as inappropriate feature extraction, the storage of incorrect data and random events, not all Pattern representation be all extension succinct efficiently reflection, thus inevitably lead to depositing of multi-source heterogeneous redundant data ?.Being different from repetition data, multi-source redundant data refers to those data that can have a strong impact on learner performance.Therefore, domestic Outer research worker proposes the de-redundancy method of some multi-source redundant datas.These methods are broadly divided into two classes, and a class is dimension Degree reduction (Dimension Reduction) method, another kind of for samples selection (Sample Selection) method.
In recent years, research worker both domestic and external devises various multi-source dimension reduction method, based on low-level image feature Between dependency, carry out feature selection, reduce the dimension of multi-source data, remove the redundancy feature in multi-source data, saving is deposited Storage space and the time of calculating.
Christoudias et al. proposes a kind of based on distributed code without supervision multi-source picture feature selection (Joint Feature Histogram Model, JFHM) algorithm.The method utilizes the Gauss of statistics separate sources information to process model mistake Redundancy in filter separate sources data, and the combined coding of multi-source data is obtained at receiving terminal, to reduce data dimension, carry High Object identifying accuracy rate.But, JFHM algorithm can only be used for processing multi-source picture redundancy.(list of references: C.Mario Christoudias,Raquel Urtasun,Trevor Darrell.Unsupervised feature selection via distributed coding for multi-view object recognition.IEEE International Conference on Computer Vision and Pattern Recognition 2008:1-8.)
Zhu et al. proposes a kind of multi-source typical characteristic and selects (Multi-modality Canonical Feature Selection, MCFS) method, utilize the dependency between separate sources, the Projection Character of separate sources is induced to by CCA Typical space in, thus realize multi-source typical characteristic select.MCFS method, by the relevant information between separate sources, is incorporated into dilute Dredge in multi-task learning (Sparse Multi-Task Learning).It is empty that MCFS method obtains typical characteristic first with CCA Between typical base vector, then the isomery of separate sources described be embedded in this space, and utilize with typical case's formal phase of normalization Sparse multi-task learning screening typical characteristic.But, MCFS method itself cannot process isomeric data, it is necessary to by CCA side Method, just can be associated analyzing, eliminate redundancy.(list of references: Xiaofeng Zhu, Heung-II Suk, Dinggang Shen.Multi-modality Canonical Feature Selection for Alzheimer's Disease Diagnosis.Springer Medical Image Computing and Computer-Assisted Intervention (2)2014:162-169.)
And Lan and Huan proposes the semi-supervised multi-source study (Reducing of a kind of complexity reducing unlabeled exemplars The Unlabeled Sample Complexity of Semi-Supervised Multi-view Learning, RUSCSSML) method.In semi-supervised learning, utilize simultaneously and have exemplar and unlabeled exemplars training grader.And sample Complexity (Sample Complexity) is the common metrics of training sample effectiveness.Lan and Huan is by this two classes sample Complexity to be defined as exemplar complexity (Labeled Sample Complexity, LSC) and unlabeled exemplars complicated Degree (Unlabeled Sample Complexity, USC) two classes.Under relaxation condition, RUSCSSML method by USC from O (1/ ε) it is reduced to O (log (1/ ε)) (ε is error rate).Lan and Huan demonstrates the Generalization error rate and not of grader in theory Contact between compatibility.And demonstrating in semi-supervised multi-source learns, as possessed a large amount of unlabeled exemplars, it is possible to study is arrived The grader that one incompatibility is relatively low.By combining this two theoretical proofs, Lan and Huan further demonstrates semi-supervised many The study scope that may approximate correct (Probably Approximately Correct, PAC) of source study.But, above-mentioned Method but cannot obtain the shared description between separate sources.(list of references: Chao Lan and Jun Huan.Reducing the Unlabeled Sample Complexity of Semi-Supervised Multi-View Learning.ACM SIGKDD International Conference Knowledge Discovery Data Mining,2015:627- 634.)
Guo et al. proposes the collaborative normalization multi-source study of a kind of lower-dimensional subspace for classifying across language text (Subspace Co-regularized Multi-view Learning, SCML) method.The method is based on machine translation flat Row corpus, combines the training error of the grader minimized on each source, and minimizes described in lower-dimensional subspace simultaneously Between distance.Guo et al. assumes a document (original language (Source Language)) and its cypher text (object language (Target Language)), describe identical object with two kinds of different sources.Then, for identical classification task, It should be similar that the separate sources of same target is described in described in potential shared subspace.Therefore, SCML method is not for Same language (source), learns different graders simultaneously, and by semi-supervised Optimized model, minimize all sources has mark Sign the training loss of data, punish the distance that all object isomeries are described in subspace.But, SCML method is at reduction process In, only considered the dependency between separate sources, and do not make full use of the distribution similarity between separate sources, so will necessarily Lose some important information.(list of references: Yuhong Guo, Min Xiao.Cross Language Text Classification via Subspace Co-regularized Multi-view Learning.ACM International Conference on Machine Learning 2012.)
Table 1 summarizes the deficiency of above-mentioned multi-source data Reduced redundancy method.
The deficiency of table 1. existing multi-source data Reduced redundancy method
In order to save memory space further, improving learning efficiency and performance, research worker both domestic and external proposes in succession Some multi-source Method of Sample Selections, utilize the relational structure between multi-source, concentrate to select from initial data and simplify subset, to improve The performance of learner.
Multi-source picture is one group of picture shot under Same Scene by multiple video cameras.Must between the picture of separate sources So there is complementary information.In the reconstruction of multi-source stereo (Multi-View Stereo, MVS), not all picture is all The quality of reconstruction model output result can be improved, and substantial amounts of multi-source picture also can expend the too much process time.For this One problem, Hornung et al. designs a kind of multi-source picture and selects (Image Selection for Improved Multi- View Stereo, ISIMVS) scheme, utilize predefined quality standard, according to the complementary information in multi-source picture, screening The picture synthesis reconstructed results of correlation maximum.ISIMVS method passes through the predefined quality standard of three below (Criteria) complete MVS to rebuild: 1) initial surface simplifies subset (Initial Surface Proxy), i.e. select one group defeated Enter picture simplifies subset, and this subset can not only fully represent raw data set, and can fully approach the surface of unknown object; 2) surface visibility (Surface Visibility), i.e. in the minimum visual range set, that selects simplifies in subset Multi-source sample is apparent;3) adaptivity (Adaptivity), i.e. for the inconsistent region in multi-source picture, passes through Select some other picture concerned, to improve reconstruction performance in that region, increase the reliability of selected subset.According to Above three quality standard, the samples selection process of ISIMVS method is divided into three steps: the first step, from some sources, selects one A little sources being conducive to Fast Convergent;Second step, selects the picture at least two source, to reach enough covering for each object Lid rate;3rd step, selects the extra picture concerned in inconsistent region in some prominent multi-source pictures.And then reach good MVS rebuilds effect.But, there is the defect that can only simplify multi-source picture in ISIMVS method.(list of references: Alexander Hornung,Boyi Zeng,Leif Kobbelt.Image selection for improved multi-view stereo.IEEE International Conference on Computer Vision and Pattern Recognition 2008:1-8.)
Multisource video coding (the Multi-view Video Coding using View that Kitahara et al. proposes Interpolation and Reference Picture Selection, MVCVIRPS) method.MVCVIRPS method based on H.264/AVC action/Inconsistency compensation completes multisource video coding.The method utilizes the association between separate sources picture Property, by selecting relevant anomalous mode picture as reference picture, Interpolation compensation parallax.MVCVIRPS method assumes index c=1, 2 ..., the video camera (based on H.264/AVC coding) that C correspondence C platform is different, and another indexRepresent video camera C is for the index of the reference video camera of Inconsistency compensation.In video coding process, MVCVIRPS method utilizes difference simultaneously Time (temporal), space (spacial) and intermode (inter-view) dependency between source, from the reference of video camera c The index of video camera selects relative index, completes action/Inconsistency compensation between anomalous mode.(list of references: Masaki Kitahara,Hideaki Kimata,Shinya Shimizu,Kazuto Kamikura,Yoshiyuki Yashima, Kenji Yamamoto,Tomohiro Yendo,Toshiaki Fujii,Masayuki Tanimoto.Multi-View Video Coding using View Interpolation and Reference Picture Selection.IEEE International Conference on Multimedia and Expo 2006:97-100.)
In image zooming-out (Image Matting), utilize the colouring information that stingy drawing method based on color samples obtains, The prospect (Foreground (F)) for unknown pixel and the coupling of background (Background (B)) color can be picked out Sample.But, if the distribution of color in foreground and background region has overlap, then utilize colouring information to be difficult to distinguish these districts Territory, matte (Matte) is just unable to estimate by the sample selected.Shahrian et al. proposes a kind of samples selection based on content (Weighted Color and Texture Sample Selection for Image Matting, WCTSSIM) method, The method, by texture (source) information of picture, uses color based on sample (source) matching process, by optimal coupling Picture fills up unknown loyal prospect and background color.In WCTSSIM method, utilize color (Color) and the texture of picture (Texture) feature selects one group of (F, B) Candidate Set.And according to two kinds of features, content based on picture, automatically determine between F and B Weight, and then from Candidate Set, pick out optimal sample.(list of references: Ehsan Shahrian, Deepu Rajan.Weighted Color and Texture Sample Selection for Image Matting.IEEE Transactions on Image Processing 22(11):4260-4270(2013).)
But, although said method is being simplified in multi-source image data collection problem, all achieves good effect, but right Redundancy in the multi-source data (such as, text) of other form is but had too many difficulties to cope with.
Table 2 summarizes the deficiency of above-mentioned multi-source data Method of Sample Selection.
The deficiency of table 2. existing multi-source data Method of Sample Selection
Summary of the invention
Along with present information and the fast development of memory technology, the scale of data is constantly expanding.But at practical situation Under, due to the reasons such as inappropriate feature extraction, the storage of incorrect data and random event, not all pattern representation It is all extension succinct efficiently reflection, thus inevitably leads to the existence of multi-source heterogeneous redundant data.It is different from weight Complex data, multi-source redundant data refers to those data that can have a strong impact on learner performance.Additionally, as it is shown in figure 1, multi-source is superfluous The redundancy source problem of remainder evidence is also to be totally different from single source redundant data problem.Trace it to its cause and be that multi-source redundant data comprises Following three-dimensional redundancy (Three-way Redundancies):
1) data describe superfluous (Data Representations Excessiveness, DRE).This kind of redundancy refers to In homology, exist for the multiple without repeated description of same target, and then cause taking substantial amounts of memory space.
2) sample characteristics various (Sample Features Superabundance, SFS).This by dimension disaster The redundancy that (Curse of Dimensionality) causes refers to embedded in a large amount of relevant or random dimension in higher dimensional space Degree, thus cause expending the too much calculating time.
3) complementary relationship exceeds the quata (Complementary Relationships Overplus, CRO).This kind of redundancy refers to It is that multiple isomeries exist complementary relationship between describing in a certain pattern representation within a source and another source.Owing to this type of redundancy is broken It is broken the one-to-one relationship between not homology, and then the hydraulic performance decline of multi-source heterogeneous data can be caused.
Due to the existence of three-dimensional redundancy, redundancy source problem has double-deck isomerism (Double-level Heterogeneities), i.e. characteristic dimension dissimilarity (Feature Dimension Dissimilarity, FDD) and sample This scale diversity (Sample Size Difference, SSD).First, different sources uses different dimensions and does not belongs to together Property describes identical things;Secondly, the sample size in each source is the most different.The three-dimensional redundancy of multi-source redundant data and bilayer are different Structure can cause the performance of data to be seriously impaired, and then delays learning process, wastes memory space, reduces the extensive energy of model Power.It is therefore proposed that a kind of the collaborative of multi-source redundant data simplifies algorithm, it is possible not only to save valuable memory space, it is to avoid high Computation complexity, it is also possible to significantly improve the Generalization Capability of learner.
As in figure 2 it is shown, multi-source heterogeneous data are under complementarity, dependency and distributivity constraint.Complementary constraint refers to Be from separate sources isomery describe pass on semanteme (class label) be consistent;Correlation constraint refers to isomery and is correlated with Describe in close proximity to one another along manifold (Manifold), thus the complementary information between separate sources is fully included in multi-source data In;Being different from complementarity and correlation constraint, distributivity constraint presents the distribution similarity of height, and this constraint can be by same next Similar sample in source is brought together.And the specific purposes of the present invention are aiming at the redundancy source problem of multi-source heterogeneous data, Reduction method is worked in coordination with in the redundancy source providing a kind of multi-source heterogeneous big data, utilizes semantic complementary, the letter between multi-source heterogeneous data Breath dependency and distribution similarity, based on sub-space learning method, by associating between existing irredundant multi-source heterogeneous data Property excavation, collaborative remove three-dimensional redundancy and double-deck isomerism in multiple sources, reduce data dimension, refine data subset, repair One-to-one relationship between multiple isomery description, simplifies the redundancy source of multi-source heterogeneous data.
As it is shown on figure 3, the invention provides the basic framework of the redundancy source reduction of a kind of multi-source heterogeneous data.This framework Being become by two mathematics model group, one is isomery manifold smooth study (Heterogeneous Manifold Smoothness Learning, HMSL) model, another is multi-source Reduced redundancy (Correlation-based Multi-based on dependency View Redundancy Reduction, CMRR) model.Wherein, multi-source heterogeneous data are linearly projected to one by HMSL model The low dimensional feature isomorphic space, and in this space, make the manifold distance (Manifold Distance) of information associated description more Closely, and the Euclidean distance (Euclidean Distance) of semantic complementary sample closer to.And CMRR model utilizes based on gradient energy The generalized primary transposition constraint of amount competition (Gradient Energy Competition, GEC) strategy, by HMSL model The feature isomorphic space that acquistion comes, eliminates the three-dimensional redundancy of multi-source redundant data and double-deck isomerism, and then it is different to simplify multi-source The redundancy source of structure data, contributes to obtaining accurately and the multi-source data analysis and assessment result of robust.
In figure 3, multi-source heterogeneous data are made up of source X and source Y.XNAnd YNFor the irredundant data of existing multi-source.But, Some multi-source heterogeneous data XRAnd YRThere is three-dimensional redundancy and double-deck isomerism.Such as, the description x during CRO redundancy causes source X7With Multiple description y in the Y of source7, y8, and y9Relevant;Additionally, exist substantial amounts of in the Y of source and describe y10Similar redundant samples y11, y12, and y13;And due to SFS, in every pair of isomery describes, comprise again some random or relevant characteristic dimension.Thus cause this A little multi-source heterogeneous data XRAnd YRThere is double-deck isomerism, i.e. characteristic dimension dissimilarity (FDD) and sample size diversity (SSD).For the redundancy source problem of multi-source heterogeneous data, in order to eliminate the three-dimensional redundancy of multi-source redundant data and double-deck isomery Property, to accelerate learning process, save memory space, improve the generalization ability of model, present invention research has multi-source and works in coordination with de-redundancy The redundancy source reduction method of ability.
The technical scheme that the present invention is concrete is:
1) HMSL model utilizes irredundant data X of existing multi-sourceNAnd YN, learn two isomery linear transformation A and B, one Decision matrix W, and a manifold smooth tolerance M, to eliminate the isomerism in low-level image feature space.Thus, just can get one The feature isomorphic space of low-dimensional, in this space, makes the manifold distance of information associated description closer to, and semantic complementary sample Euclidean distance closer to.
As it is shown on figure 3, isomery is described x by manifold smooth tolerance M2And y2Match together, to extract the relevant letter between allos Breath;Additionally, in order to catch the semantic complementarity between allos, isomery is described x by decision matrix W6And y6It is coupled together;And in class 1 Isomery symbiosis describe (x1,y1), (x2,y2), and (x3,y3) be referred to respectively in different bunches, with excavate between allos point Cloth similarity.
2) meanwhile, by HMSL model learning to the low dimensional feature isomorphic space in, between CMRR model is based on allos Semantic complementary, information correlativity and distribution similarity, effectively removes multi-source redundant data XRAnd YRThree-dimensional redundancy and Double-deck isomerism.CMRR model is first by the generalized primary transposition constraint competed based on gradient energy, according to HMSL model Practise the manifold smooth tolerance M and decision matrix W arrived, recover the one-to-one relationship between the description of same target isomery.This constraint Redundant matrices X can be adjustedRAnd YRThe position of middle corresponding line, to eliminate unnecessary complementary relationship.
As it is shown on figure 3, the description x in the X of source7With the description y in the Y of source8And y9Between unnecessary complementary relationship be eliminated, recover X7And y7Between one-to-one relationship, to eliminate CRO;It addition, in order to remove DRE, the redundancy in the Y of source describes y11, y12, and y13 It is deleted, to save memory space;And simultaneously by the whole descriptions in redundancy source, it is low that linear projection to HMSL model learning arrives In the dimensional feature isomorphic space, to eliminate SFS and double-deck isomerism.From figure 5 it can be seen that after eliminating three-dimensional redundancy, In the feature isomorphic space, the isomery of multi-source redundant data describes and is all able to correct coupling and classification.Therefore, by HMSL+ The framework that CMRR is constituted can reduce the redundancy source of multi-source heterogeneous data effectively.
The concrete steps of the present invention are further illustrated below:
1) isomery manifold smooths learning model
As shown in Figure 4, the isomery manifold smooth study HMSL model that the present invention provides utilizes the irredundant number of existing multi-source According toWith(dxFor source VxDimension, dyFor source VyDimension, n1Number for irredundant sample Amount), learn multiple isomery linear transformation A and B, a decision matrix W, and a manifold smooth tolerance (manifold Smoothness measure) M, the relevant isomery between allos is described and is coupled together, to catch the semantic complementation between allos Property, information correlativity and distribution similarity, eliminate the isomerism between allos, thus the feature isomorphism constructing a low-dimensional be empty Between.In this space, make the manifold distance of information associated description closer to, and the Euclidean distance of semantic complementary sample closer to.
The method first defines one group of mahalanobis distance and measures:
Wherein,For source VxIn i-th sample,For source VyIn i-th sample, MX=ATA And MY=BTB is two positive semidefinite metric matrixs.Secondly, the sample x during HMSL method defines a sourceiOr yiWith separately Isomery sample y in one sourcejOr xj, it is the Probability p of neighbour in the feature isomorphic spaceijOr qij:
p i j = exp ( - | | Ax i - By j | | 2 ) Σ k exp ( - | | Ax i - By k | | 2 ) - - - ( 3 )
q i j = exp ( - | | By i - Ax j | | 2 ) Σ k exp ( - | | By i - Ax k | | 2 ) - - - ( 4 )
Such that it is able to obtain the Probability p that i-th sample is correctly classifiediOr qiFor:
Wherein,WithRepresent source V respectivelyxAnd VyIn the set of t class sample.
Based on above-mentioned definition, the Optimized model of HMSL method is as follows:
Ψ 1 : min A , B , W , M f S ( A , B , W ) + αg M ( A , B , M ) - βh D ( A , B ) s . t . A T A = I , B T B = I , a n d M ≥ 0 - - - ( 7 )
Wherein,Being characterized the dimension of isomorphism subspace, α and β is balance Parameter.HMSL method utilizes orthogonality constraint ATA=I and BTB=I eliminates the dependency in same source between different characteristic, simultaneously Retrain by positive semidefiniteGuarantee model Ψ1The pseudo-metric to a good definition can be learnt.And in formula (7) Object function for comprising three subfunctions, i.e. semantic, relevant and distribution subfunction, compound function (complex function).Wherein, Section 1 f in object functionS(A, B, W):
f S ( A , B , W ) = | | X N A Y N B W - L N L N | | F 2 - - - ( 8 )
For semantic subfunction based on multiple linear regression (multivariant linear regression), it is used for Catch the semantic complementarity between separate sources,For irredundant data X of multi-sourceNAnd YNLabel matrix (m is label Quantity);Additionally, introduce correlator function g in object functionM(A, B, M):
g M ( A , B , M ) = | | X N AMB T Y N T | | F 2 - - - ( 9 )
Purpose be exactly the smoothness between different linear transformation A and B of tolerance, with extract isomery describe between relevant letter Breath;And Section 3 distribution subfunction h in object functionD(A, B):
hD(A, B)=∑ pi+∑qi (10)
For the cross validation (leave-one-out validation) being made up of the classification accuracy of separate sources, this son Function excavates the distribution similarity between separate sources based on mahalanobis distance tolerance.
It should be noted that multi-source heterogeneous data can linearly be projected in a lower dimensional space by HMSL method.This Point is very different from famous non-linear projection method kernel canonical correlation analysis method (list of references: David R.Hardoon,Sándor Szedmák,John Shawe-Taylor.Canonical Correlation Analysis:An Overview with Application to Learning Methods.Neural Computation 16(12):2639- 2664 (2004)) and degree of depth canonical correlation analysis (list of references: Galen Andrew, Raman Arora, Jeff A.Bilmes,Karen Livescu.Deep Canonical Correlation Analysis.ACM International Conference on Machine Learning (3) 2013:1247-1255.) method.
2) gradient energy competitive strategy
In the gradient matrix G got by gradient descent method, each inner element Gi,jAll with four neighbour Gi-1,j, Gi+1,j, Gi,j-1, and Gi,j+1Association.The gradient energy competitive strategy that the present invention provides is based on l1The gradient magnitude energy of norm, fixed The each inner element G of justiceijEnergy (between-sample energy) E between the sample of vertical directionbsFor:
E b s = ∂ ∂ x G = | G ( i + 1 , j ) - G ( i , j ) | + | G ( i , j ) - G ( i - 1 , j ) | - - - ( 11 )
And sample self-energy (within-sample energy) EwsFor:
E w s = ∂ ∂ y G = | G ( i , j + 1 ) - G ( i , j ) | + | G ( i , j ) - G ( i , j - 1 ) | - - - ( 12 )
Pass through EbsAnd Ews, it is possible to obtain each inner element GijGlobal energy (global energy) Eglobe:
Eglobe=δ * Ebs+(1-δ)Ews (13)
Wherein, δ is balance parameters.Utilize formula (13) that the global energy of each element in matrix G can be calculated, and then Obtain energy matrix E.As it is shown in figure 5, the gradient energy competitive strategy that the present invention provides is by each element in comparator matrix E Energy, is set to 1 by the value of victor's (energy the maximum), and both is set to 0 with the go together value of element of same column of victor, And so forth, until constructing a generalized primary matrix Q.
3) multi-source Reduced redundancy model based on dependency
As shown in Figure 6, multi-source Reduced redundancy MODEL C MRR based on dependency that the present invention provides, first with HMSL side Optimal solution (the A that method obtains*,B*,W*,M*), by multi-source heterogeneous redundant dataWith(n2For source Vx The quantity of middle redundant samples, n3For source VyThe quantity of middle redundant samples, and n2≠n3) it is configured to the redundant matrices H of feature isomorphism =XRA*And R=YRB*.And by decision matrix W*Prediction redundant samples H and the classification of R, to repair between the description of same target isomery Complementary relationship.Meanwhile, broad sense Applying Elementary Row Operations matrix P and Q arrived by study, wherein P is for source VxMiddle redundant digit According to n2×n4The broad sense Applying Elementary Row Operations matrix on rank, Q is for source VyThe n of middle redundant data3×n4The broad sense elementary row on rank Transformation matrix, exchanges the position of redundant samples in matrix H and R, thus by isomery profile matching relevant for information to together, and root According to M*Eliminate CRO and DRE, re-establish the one-to-one relationship between isomery description.
The Optimized model of CMRR method is as follows:
Ω 1 : min P , Q | | P T HW * - Q T RW * | | F 2 + γ | | P T HM * R T Q | | F 2 + τ | | ( P T H + Q T R ) / 2 | | F 2 s . t . P ∈ Σ n 2 × n 4 , Q ∈ Σ n 3 × n 4 , a n d | | P | | 2 , 1 = | | Q | | 2 , 1 = n 4 - - - ( 14 )
Wherein, P and Q is broad sense Applying Elementary Row Operations matrix,WithIt is at the beginning of two groups of broad sense Deng the set of line translation matrix, n4=min (n2,n3), γ and τ is balance parameters.
Section 1 in object function utilizes multiple isomery linear transformation A that HMSL method learns*And B*, and decision-making Matrix W*, while eliminating CRO and SFS, reinvent the one-to-one relationship between the description of same target isomery.And in object function The manifold smooth tolerance M that learns by HMSL method of Section 2*, eliminate the DRE in same source, to extract the different of coupling Relevant information between structure description.It addition, the Section 3 in object function is low-rank (low-rank) based on trace norm normalization , so that compound, linear separability as far as possible is described.The purpose that matrix P and Q applies generalized primary transposition constraint exchanges square exactly In battle array H and R, the position of redundant samples, eliminates CRO, re-establishes the one-to-one relationship between isomery description.And by introducing l2,1 Norm equality constraint, can create some full zero row, to remove DRE in matrix P and Q.It should be noted that without extensively Justice elementary transformation constraint, and only l2,1Norm equality constraint, matrix P and Q is likely to become a square only having a non-zero row Battle array.Therefore, in order to screening isomery complementation describes, in CMRR model, matrix P and Q is applied generalized primary transposition constraint It is the most necessary.Thus, eliminate the three-dimensional redundancy of multi-source redundant data and double-deck isomerism, and then simplify multi-source heterogeneous The redundancy source of data.
Compared with prior art, beneficial effects of the present invention is as follows:
It is directed to three-dimensional redundancy and double-deck isomerism present in multi-source redundant data, the invention provides a kind of multi-source different The framework of reduction is worked in coordination with in the redundancy source of the big data of structure.This framework first comprise a kind of with manifold (Manifold) formal phase of normalization and Multi-source heterogeneous data are linearly projected to a low dimensional feature isomorphic space by the HMSL model of pseudo-metric constraint, and empty at this In between, make the manifold distance of information associated description closer to, and the Euclidean distance (Euclidean of semantic complementary sample Distance), such that it is able to semantic complementary, the information correlativity effectively captured between separate sources and distribution similarity. Further it is proposed that a kind of CMRR model with generalized primary transposition constraint, utilize gradient energy competitive strategy and l2,1Norm etc. Formula, in the feature isomorphic space got by HMSL model learning, recovers the one-to-one relationship between the description of allos isomery redundancy, disappears Except three-dimensional redundancy and the double-deck isomerism of multi-source redundant data, and then simplify the redundancy source of multi-source heterogeneous data.
Accompanying drawing explanation
Fig. 1 is illustrating of multi-source redundancy and single source redundancy.
Fig. 2 is illustrating of the complementarity between multi-source heterogeneous data, dependency and distributivity constraint.
Fig. 3 is illustrating of the redundancy source reduction framework of multi-source heterogeneous data.
Fig. 4 is that isomery manifold smooths illustrating of learning model.
Fig. 5 is illustrating of gradient energy competitive strategy.
Fig. 6 is illustrating of multi-source Reduced redundancy model based on dependency.
Detailed description of the invention
Below by specific embodiment, the present invention will be further described.
Reduction method is worked in coordination with in the redundancy source of the multi-source heterogeneous big data that the present invention provides, by isomery manifold smooth study HMSL Form with multi-source Reduced redundancy CMRR algorithm based on dependency, by the successive optimization of loop iteration process implementation model.
HMSL model in formula (7) can be reduced to:
Wherein, F ()=fs(·)+αgM(·)-βhD() is smooth object function, Z=[AZ BZ WZ MZ] represent excellent Change variable,For closing and for the convex set of single variable:
Due to F () be continuously differentiable function about Lipschitz continuous gradient L (list of references: Y.Nesterov.Introductory lectures on convex optimization,volume 87.Springer Science&Business Media, 2004.):
Therefore, it is suitable for utilizing and accelerates Projected (Accelerated Projected Gradient, APG) algorithm (ginseng Examine document: Y.Nesterov.Introductory lectures on convex optimization, volume 87.Springer Science&Business Media, 2004.) problem in solution formula (15).
APG algorithm is first order gradient algorithm, and the method, during minimizing object function, can perform solution (feasible solution) is upper accelerates each gradient step, to obtain optimal solution.In solution procedure, APG method can build one Individual solution point sequence { ZiAnd a Searching point sequence { Si, utilize S in each iterationiUpdate Zi.And each set point s is convex CollectionOn euclidean be projected as:
Positive semidefinite projection (Positive Semi-definite Projection, PSP) that Weinberger et al. proposes Method (list of references: Kilian Q.Weinberger, Lawrence K.Saul.Distance Metric Learning for Large Margin Nearest Neighbor Classification.Journal of Machine Learning Research 10:207-244 (2009) .) object function can be minimized in the case of keeping positive semidefinite constraint.Thus, Just may utilize the problem in PSP solution formula (18).Algorithm 1 gives the details of PSP algorithm.
Meanwhile, it is possible to use gradient decline curve search (the Gradient Descent that Wen et al. proposes Method with Curvilinear Search, GDMCS) algorithm (list of references: Zaiwen Wen and Wotao Yin.A feasible method for optimization with orthogonality Constraints.Math.Program.142 (1-2): 397-434 (2013) .), keep during minimizing object function Orthogonality constraint in formula (18).Algorithm 2 gives the details of GDMCS algorithm.
And when utilizing the problem in APG Algorithm for Solving formula (15), set point S=[AS BS] in convex setOn Europe several Reed projection Z=[AZ BZ] it is:
By combining APG, PSP and GDMCS algorithm, just can problem in solution formula (19).Algorithm 3 gives the present invention The detail of the HMSL algorithm provided, wherein function Schmidt () represents Gram-Schmidt orthogonalization.
CMRR model in formula (14) can be reduced to:
Wherein,For smooth object function, t ()=| | | |*For non-differentiable function, Θ=[PΘ QΘ] representing optimized variable,For closing and for the convex set of single variable:
Owing to w () is continuously differentiable about the Lipschitz continuous gradient L in formula (17).So, equally Utilize the problem in APG solution formula (20).And each set point s is in convex setOn euclidean be projected as:
For the problem in solution formula (22), the GEC strategy provided according to the present invention, the Energy that the present invention provides () and Competition () function, can minimize object function in the case of keeping elementary transformation constraint.Algorithm 4 Giving the detail of Energy () function, this function calculates according to formula (11,12,13) and is obtained by gradient descent algorithm Gradient matrix G in the global energy of each element, and then obtain energy matrix E.Additionally, what the present invention provided Competition () function, the energy matrix E produced according to algorithm 4, create the elementary transformation matrix of a standard.Algorithm 5 Give the detail of Competition () function.By combining APG, Energy, and Competition algorithm, just may be used Problem in solution formula (22).Algorithm 6 gives the detail of CMRR method.
The collaborative reduction in the redundancy source of the multi-source heterogeneous big data that the present invention provides HMSL+CMRR framework, for multi-source heterogeneous The redundancy source problem of data, utilizes semantic complementary, the information correlativity between multi-source heterogeneous data and distribution similarity, based on son Space learning method, by the excavation of relatedness between existing irredundant multi-source heterogeneous data, works in coordination with and removes in multiple sources Three-dimensional redundancy and double-deck isomerism, reduce data dimension, refine data subset, repair the one-to-one relationship between isomery description, Simplify the redundancy source of multi-source heterogeneous data.
Above example is only limited in order to technical scheme to be described, the ordinary skill of this area Technical scheme can be modified or equivalent by personnel, without departing from the spirit and scope of the present invention, and this The protection domain of invention should be as the criterion with described in claims.

Claims (5)

1. a reduction method is worked in coordination with in the redundancy source of multi-source heterogeneous big data, and its step includes:
1) utilize the irredundant data of existing multi-source to learn multiple isomery linear transformations, a decision matrix and a manifold to smooth Tolerance, describes the relevant isomery between allos and is coupled together, with catch semantic complementary, the information correlativity between allos and point Cloth similarity, eliminates the isomerism between allos, thus constructs the feature isomorphic space of a low-dimensional;Feature in described low-dimensional In the isomorphic space, make the manifold distance of information associated description closer to, and the Euclidean distance of semantic complementary sample closer to;
2) in the feature isomorphic space of described low-dimensional, generalized primary transposition based on gradient energy competitive strategy is utilized to retrain, And based on semantic complementary, the information correlativity between allos and distribution similarity, eliminate multi-source redundant data three-dimensional redundancy and Double-deck isomerism.
2. the method for claim 1, it is characterised in that step 1) to semantic complementary, the information correlativity between allos Following Optimized model is set up with distribution similarity:
Ψ 1 : m i n A , B , W , M f S ( A , B , W ) + αg M ( A , B , M ) - βh D ( A , B ) s . t . A T A = I , B T B = I , a n d M ≥ 0 ,
Wherein,k≤min(dx,dy) it being characterized the dimension of isomorphism subspace, α and β is balance parameters, W is decision matrix, and M is the smooth tolerance of manifold;Utilize orthogonality constraint ATA=I and BTB=I eliminate in same source different characteristic it Between dependency, retrain by positive semidefinite simultaneouslyGuarantee model Ψ1The puppet to a good definition can be learnt Tolerance;Object function in above-mentioned formula is to comprise semantic subfunction, correlator function and the compound function of distribution subfunction, its Section 1 f in middle object functionS(A, B, W) is semantic subfunction based on multiple linear regression, is used for catching difference and comes Semantic complementarity between source;Object function introduces correlator function gMThe purpose of (A, B, M) measures different linear transformation A exactly And the smoothness between B, to extract the relevant information between isomery description;Section 3 distribution subfunction h in object functionD (A, B) is the cross validation being made up of the classification accuracy of separate sources, and this subfunction excavates difference based on mahalanobis distance tolerance Distribution similarity between source.
3. method as claimed in claim 2, it is characterised in that: step 2) described gradient energy competitive strategy is first by under gradient Fall method obtains gradient matrix G, then calculates each inner element G of gradient matrixijENERGY E between the sample of vertical directionbs Sample self-energy E with horizontal directionws, thus obtain each inner element GijGlobal energy Eglobe: by calculating matrix In G, the global energy of each element obtains energy matrix E;Described gradient energy competitive strategy is by each unit in comparator matrix E The energy of element, is set to 1 by the victor i.e. value of energy the maximum, and is all set with the go together value of element of same column of victor It is 0, and so forth, until constructing a generalized primary matrix Q.
4. as claimed in claim 2 or claim 3 method, it is characterised in that: step 2) utilize step 1) optimal solution (A that obtains*,B*, W*,M*), by multi-source heterogeneous redundant dataWithIt is configured to the redundant matrices H=X of feature isomorphismRA* And R=YRB*, wherein n2For source VxThe quantity of middle redundant samples, n3For source VyThe quantity of middle redundant samples, and n2≠n3;And By decision matrix W*Prediction redundant samples H and the classification of R, to repair the complementary relationship between the description of same target isomery;Meanwhile, Broad sense Applying Elementary Row Operations matrix P and Q arrived by study, exchanges the position of redundant samples in matrix H and R, thus by information phase The isomery profile matching closed is to together, and according to M*Eliminate exceed the quata redundancy and data of complementary relationship and describe superfluous redundancy, again build One-to-one relationship between the structure that starts something different description.
5. method as claimed in claim 4, it is characterised in that: step 2) set up following Optimized model:
Ω 1 : m i n P , Q | | P T HW * - Q T RW * | | F 2 + γ | | P T HM * R T Q | | F 2 + τ | | ( P T H + Q T R ) / 2 | | F 2 s . t . P ∈ Σ n 2 × n 4 , Q ∈ Σ n 3 × n 4 , a n d | | P | | 2 , 1 = | | Q | | 2 , 1 = n 4 ,
Wherein, P and Q is broad sense Applying Elementary Row Operations matrix,WithIt is two groups of broad sense elementary rows The set of transformation matrix, n4=min (n2,n3), γ and τ is balance parameters;Section 1 in object function utilizes step 1) study The multiple isomery linear transformation A arrived*And B*, and decision matrix W*, eliminate complementary relationship exceed the quata redundancy and sample characteristics various While redundancy, reinvent the one-to-one relationship between the description of same target isomery;Section 2 in object function is by step 1) The manifold that study is arrived smooth tolerance M*, eliminate the data in same source and describe superfluous redundancy, to extract the isomery description of coupling Between relevant information;Section 3 in object function is low-rank formal phase of normalization based on trace norm, describes as far as possible so that compound Linear separability.
CN201610166631.XA 2016-03-22 2016-03-22 Redundant source synergistic reducing method of multi-source heterogeneous big data Pending CN105843896A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610166631.XA CN105843896A (en) 2016-03-22 2016-03-22 Redundant source synergistic reducing method of multi-source heterogeneous big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610166631.XA CN105843896A (en) 2016-03-22 2016-03-22 Redundant source synergistic reducing method of multi-source heterogeneous big data

Publications (1)

Publication Number Publication Date
CN105843896A true CN105843896A (en) 2016-08-10

Family

ID=56582904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610166631.XA Pending CN105843896A (en) 2016-03-22 2016-03-22 Redundant source synergistic reducing method of multi-source heterogeneous big data

Country Status (1)

Country Link
CN (1) CN105843896A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280120A (en) * 2017-07-11 2018-07-13 厦门君沣信息科技有限公司 Mental health early warning system and method based on association rule
CN108596344A (en) * 2018-04-17 2018-09-28 惠州学院 A kind of complicated panel data learning method based on big data
CN110991470A (en) * 2019-07-03 2020-04-10 北京市安全生产科学技术研究院 Data dimension reduction method, portrait construction method and system and readable storage medium
US11062792B2 (en) 2017-07-18 2021-07-13 Analytics For Life Inc. Discovering genomes to use in machine learning techniques
US11139048B2 (en) 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions
CN117971356A (en) * 2024-03-29 2024-05-03 苏州元脑智能科技有限公司 Heterogeneous acceleration method, device, equipment and storage medium based on semi-supervised learning

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280120A (en) * 2017-07-11 2018-07-13 厦门君沣信息科技有限公司 Mental health early warning system and method based on association rule
US11062792B2 (en) 2017-07-18 2021-07-13 Analytics For Life Inc. Discovering genomes to use in machine learning techniques
US11139048B2 (en) 2017-07-18 2021-10-05 Analytics For Life Inc. Discovering novel features to use in machine learning techniques, such as machine learning techniques for diagnosing medical conditions
CN108596344A (en) * 2018-04-17 2018-09-28 惠州学院 A kind of complicated panel data learning method based on big data
CN108596344B (en) * 2018-04-17 2022-03-25 惠州学院 Complex panel data learning method based on big data
CN110991470A (en) * 2019-07-03 2020-04-10 北京市安全生产科学技术研究院 Data dimension reduction method, portrait construction method and system and readable storage medium
CN110991470B (en) * 2019-07-03 2022-04-15 北京市应急管理科学技术研究院 Data dimension reduction method, portrait construction method and system and readable storage medium
CN117971356A (en) * 2024-03-29 2024-05-03 苏州元脑智能科技有限公司 Heterogeneous acceleration method, device, equipment and storage medium based on semi-supervised learning
CN117971356B (en) * 2024-03-29 2024-06-14 苏州元脑智能科技有限公司 Heterogeneous acceleration method, device, equipment and storage medium based on semi-supervised learning

Similar Documents

Publication Publication Date Title
CN105843896A (en) Redundant source synergistic reducing method of multi-source heterogeneous big data
Zhou et al. Anomalynet: An anomaly detection network for video surveillance
US11450066B2 (en) 3D reconstruction method based on deep learning
US10412324B2 (en) Infrared image reconstruction method based on block-sparse compressive sensing and system thereof
CN114005096B (en) Feature enhancement-based vehicle re-identification method
CN104361363B (en) Depth deconvolution feature learning network, generation method and image classification method
CN113240613A (en) Image restoration method based on edge information reconstruction
CN110880019A (en) Method for adaptively training target domain classification model through unsupervised domain
CN101515328B (en) Local projection preserving method for identification of statistical noncorrelation
CN112464004A (en) Multi-view depth generation image clustering method
CN105893610A (en) Deficiency-source completion method of multi-source heterogeneous large data
CN108830301A (en) The semi-supervised data classification method of double Laplace regularizations based on anchor graph structure
CN114299542A (en) Video pedestrian re-identification method based on multi-scale feature fusion
CN113128600A (en) Structured depth incomplete multi-view clustering method
CN111881716A (en) Pedestrian re-identification method based on multi-view-angle generation countermeasure network
CN103268484A (en) Design method of classifier for high-precision face recognitio
CN105631469A (en) Bird image recognition method by multilayer sparse coding features
Yi et al. Dual pursuit for subspace learning
CN111489405B (en) Face sketch synthesis system for generating confrontation network based on condition enhancement
CN111598032B (en) Group behavior recognition method based on graph neural network
CN111488951B (en) Method for generating countermeasure metric learning model for RGB-D image classification
CN114241191A (en) Cross-modal self-attention-based non-candidate-box expression understanding method
CN117456136A (en) Digital twin scene intelligent generation method based on multi-mode visual recognition
Li et al. Robust Low-Rank Tensor Completion Based on Tensor Ring Rank via $\ell _ {p,\epsilon} $-Norm
CN104537124A (en) Multi-view metric learning method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160810