CN105160000B - Big data method for digging based on dimensionality reduction - Google Patents

Big data method for digging based on dimensionality reduction Download PDF

Info

Publication number
CN105160000B
CN105160000B CN201510566756.7A CN201510566756A CN105160000B CN 105160000 B CN105160000 B CN 105160000B CN 201510566756 A CN201510566756 A CN 201510566756A CN 105160000 B CN105160000 B CN 105160000B
Authority
CN
China
Prior art keywords
data
array
indicate
pretreated
arrays
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510566756.7A
Other languages
Chinese (zh)
Other versions
CN105160000A (en
Inventor
杨立波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Immeasurable Technology Co., Ltd
Original Assignee
Chengdu Bo Yuan Epoch Softcom Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Bo Yuan Epoch Softcom Ltd filed Critical Chengdu Bo Yuan Epoch Softcom Ltd
Priority to CN201510566756.7A priority Critical patent/CN105160000B/en
Publication of CN105160000A publication Critical patent/CN105160000A/en
Application granted granted Critical
Publication of CN105160000B publication Critical patent/CN105160000B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of big data method for digging and device based on dimensionality reduction, the method comprising the steps of:Data select;Data prediction, the data for executing inconvenient identification are converted into the information process of readily identified authority data;Object analysis is carried out for pretreated data;The low-dimensional data that pretreated data are transformed to another space replaces pretreated data to carry out subsequent processing with low-dimensional data;Numerical value;And data mining, and carry out assessment and feedback and the amendment of result.By the device of this method and execution this method, it can be effectively reduced the complexity of dimension, while also improving integrality and the accuracy of analysis, excavate valuable information as far as possible.

Description

Big data method for digging based on dimensionality reduction
Technical field
The present invention relates to electric processing data information fields, more specifically, are related to a kind of big data excavation based on dimensionality reduction Method and system.
Background technology
With society's industrialization, the continuous improvement of the level of IT application, nowadays data, which have replaced, is calculated as information calculating Center, cloud computing, big data are becoming a kind of trend and trend.Including memory capacity, availability, I/O performances, data safety All various aspects such as property, scalability.Big data is the very huge and complicated data set of scale.Big data has 4V:Volume is (big Amount), data volume increases continuously and healthily;Velocity (high speed), data I/O speed are faster;Variety (various), data type With source diversification;Value (value), there are the usable values of various aspects.
Because there are the data cells of big data quantity in big data, no doubt, the data cell of these big data quantities is with rich Rich valuable information is conducive to go deep into the processing operations such as mining data value.But the data cell of big data quantity is increasing While adding data value, also increase the number of dimensions of data so that data volume increase sharply, this be faced in this field it is very existing Real problem.The relevance of relatively high dimension and data can both make subsequent arithmetic become more complicated, and then substantially reduce Processing speed, and under the premise of limited sampling, may result in the reduction of data integrity and accuracy.Therefore right Advanced row Data Dimensionality Reduction before the data cell of big data quantity is handled and analyzed, is a very important job.
However, traditional dimension reduction method and the big data method for digging to it, it is difficult to which satisfaction is both effectively reduced dimension Complexity, while also improving integrality and the accuracy of analysis, excavate valuable information as far as possible, compel to be essential in this field Want a kind of big data method for digging can effectively solve the problem that above-mentioned technical problem.
Invention content
An object of the present invention is to provide a kind of big data method for digging and system based on dimensionality reduction, by this method and Execute this method system in device, the complexity of dimension can be effectively reduced, at the same also improve analysis integrality and Valuable information is excavated in accuracy as far as possible.
The technical solution that the present invention takes to solve above-mentioned technical problem is:A kind of big data excavation side based on dimensionality reduction Method, including step:S1:Carry out data selection;S2:Data prediction is carried out, the data of the inconvenient identification of execution, which are converted into, to be easy to know The information process of other authority data;S3:Object analysis is carried out for pretreated data;S4:By pretreated data Another space is transformed to, the most information of pretreated data is concentrated on into low-dimensional in the space by transformation, uses low-dimensional Data carry out subsequent processing instead of pretreated data;S5:Carry out Numerical value;And S6:As needed, it is based on above-mentioned mistake The step of journey, carries out data mining, and carries out assessment and feedback and the amendment of result.
According to another aspect of the present invention, wherein data selection is to determine the operation data pair that data mining task is related to As being extracted and the relevant data set of mining task from related data sources according to the requirement of data mining task.
According to another aspect of the present invention, wherein in the data prediction the step of, it is based on specification and attribute loop, is adopted Carried out with rough set it is brief, with for subsequent data be further processed provide conveniently, improve performance simultaneously realize preferably excavate Effect, and execute and eliminate noise, missing data processing, elimination of duplicate data operation.
According to another aspect of the present invention, wherein analyze pretreated data data type include at least data block or Data segment or individual data.
According to another aspect of the present invention, wherein step S4 includes step S41:Data after pretreatment and analysis For D, the data array for waiting for a × b of this step process is indicated, wherein a and b are positive integers;The data array of data object D In element be dij, wherein i, j indicate that corresponding row and column serial number in array, i and j are to be respectively smaller than equal to a and be less than or equal to b Positive integer;It is a × c arrays by a × b array transformations by step S4, a × c arrays can weigh in another form Neotectonics original data object D, wherein c are less than b.
According to another aspect of the present invention, wherein if data object D is the data rich in information content, b is at least c 2 times either c be the small at least an order of magnitude of 1 or c ratios b.
According to another aspect of the present invention, wherein step S4 further comprises step S42:By element dijIt is computed construction For:dij=[n, i1, i2, i3, i4, m1, m2]ij, data object D is [N, I1, I2, I3, I4, M1, M2], wherein N expressions pair The parameter of the type of elephant, data value can carry out assignment and specified, the data value and data according to data length and array size Length and/or array size are proportionate;I1, I2, I3, I4 indicate the parameter in the direction of object, and M1, M2 indicate object Positive, reverse transformation model parameter;N=ΘNΦTN;I1=ΘI1ΦTI1;I2=ΘI2ΦTI2;I3=ΘI3ΦT+ ΞI3;I4=ΘI4ΦTI4;M1=ΘM1ΦTM1;M2=ΘM2ΦTM2;Wherein ΘN、ΘI1、ΘI2、ΘI3、ΘI4、ΘM1、 ΘM2A × c the arrays for indicating score member prime component, are the low-dimensional datas in another space;Φ indicates b × c battle arrays of load element Row;And ΞN、ΞI1、ΞI2、ΞI3、ΞI4、ΞM1、ΞM2Indicate the remaining magnitude in a × b arrays;Subscript "T" indicate that the transposition of array is transported It calculates;And includes coordinate in each Θ, be the array formed by coordinate.
According to another aspect of the present invention, wherein in Numerical value operation, each item is considered as by the b in subspace The object to be operated that dimension data indicates.Estimation function Fb=Nc+Ic+Mc,
Wherein N '=ΘNΦT, I1 '=ΘI1ΦT, I2 '=ΘI2 ΦT, I3 '=ΘI3ΦT, I4 '=ΘI4ΦT, M1 '=ΘM1ΦT, M2 '=ΘM2ΦT;And Ω1And Ω2It is the diagonal matrix of c × c Row, can individually consider the vertex of b dimension spaces.
According to a further aspect of the invention, a kind of device executed in the above method in the system of step is provided.
Description of the drawings
In the accompanying drawings by way of example rather than showing the embodiment of the present invention by way of limitation, wherein:
Exemplary embodiment according to the present invention, Fig. 1 instantiate a kind of flow of the big data method for digging based on dimensionality reduction Figure.
Specific implementation mode
In the following description, refer to the attached drawing and several specific embodiments are shown by way of illustration.It will be appreciated that: It is contemplated that and other embodiment can be made without departing from the scope of the present disclosure or spirit.Therefore, described in detail below should not be by Think in a limiting sense.
Exemplary embodiment according to the present invention, Fig. 1 instantiate a kind of flow of the big data method for digging based on dimensionality reduction Figure.
First, in step sl, data selection is carried out.Data selection is to determine the operation data that data mining task is related to Object extracts and the relevant data set of mining task according to the requirement of data mining task from related data sources.
Secondly, in step s 2, data prediction is carried out, the data for executing inconvenient identification are converted into readily identified specification The information process of data.In this process, specification and attribute loop are the core of the process, wherein can be used rough set into Row is brief, to be provided conveniently for being further processed for subsequent data, improves performance and realizes better mining effect;It can hold Row eliminates noise, missing data processing, elimination of duplicate data operation.
Secondly, in step s3, object analysis is carried out for pretreated data.Preferably, pretreated data are analyzed Data type, include but not limited to data block or data segment or individual data.Particularly, data cell as described herein can be with Refer to the data for waiting for next step processing, either one or more data blocks or data segment, individual data, can also be it Meaning combination.Its range includes but not limited to content listed above.
Again, in step s 4, pretreated data are transformed into another space, will located in advance in the space by transformation The most information of the data of reason concentrates on low-dimensional, replaces pretreated data with low-dimensional data to carry out subsequent processing.Specifically For, above-mentioned steps S4 includes the following steps:S41, the above-mentioned data after pretreatment and analysis are D, and expression waits for this step Suddenly the data array of a × b handled, wherein a and b are all positive integers.Element in the data array of data object D is dij, Middle i, j indicate corresponding row and column serial number in array, and i and j are the positive integers being respectively smaller than equal to a and less than or equal to b.By step Rapid S4, is a × c arrays by a × b array transformations, and a × c arrays can reconfigure original data in another form Object D, wherein c are less than b.Preferably, if data object D is the data rich in information content, b is at least 2 times of c;It is preferred that Ground, c 1;Preferably, small at least an order of magnitude of c ratios b.S42, by element dijIt is computed and is configured to:dij=[n, i1, i2, I3, i4, m1, m2]ij, data object D is [N, I1, I2, I3, I4, M1, M2], and wherein N indicates the parameter of the type of object, Data value can carry out assignment according to data length and array size and specify, usually, the data value and data length and/or battle array Row size is proportionate;I1, I2, I3, I4 indicate the parameter in the direction of object, and M1, M2 indicate forward direction, the reverse transformation of object The parameter of model.Specifically, N=ΘNΦTN;I1=ΘI1ΦTI1;I2=ΘI2ΦT+ ΞI2;I3=ΘI3ΦTI3; I4=ΘI4ΦTI4;M1=ΘM1ΦTM1;M2=ΘM2ΦTM2;Wherein ΘN、ΘI1、ΘI2、ΘI3、ΘI4、ΘM1、ΘM2 A × c the arrays for indicating score member prime component, are exactly the low-dimensional data in another space;Φ indicates b × c battle arrays of load element Row;And ΞN、ΞI1、ΞI2、ΞI3、ΞI4、ΞM1、ΞM2Indicate the remaining magnitude in a × b arrays;Subscript "T" indicate array transposition operation. Include coordinate in each Θ, is the array formed by coordinate.
Again, in step s 5, Numerical value is carried out.In Numerical value operation, each item is considered as by subspace In b dimension datas indicate object to be operated.Estimation function Fb=Nc+Ic+Mc,
Wherein N '=ΘNΦT, I1 '=ΘI1ΦT, I2 '=ΘI2ΦT, I3 '=ΘI3ΦT, I4 '=ΘI4ΦT, M1 '= ΘM1ΦT, M2 '=ΘM2ΦT;And Ω1And Ω2It is the diagonal array of c × c, can individually considers the vertex of b dimension spaces.By this Estimation, can meet attribute constraint condition.
Again, in step s 6, as needed, based on above process the step of, carries out data mining, and assessed with And feedback and the amendment of result.Method and step well known in the art can be used in data digging method in this step.
Particularly, data cell as described herein can refer to pending data, either one or more data blocks Or data segment, individual data, can also be that it arbitrary is combined.Its range includes but not limited to content listed above.
By above procedure, the big data method for digging of the invention based on dimensionality reduction can be effectively reduced the complexity of dimension Property, while integrality and the accuracy of analysis are also improved, valuable information is excavated as far as possible.
It will be appreciated that:The example and reality of the present invention can be realized in the form of the combination of hardware, software or hardware and software Apply example.As described above, any main body for executing this method can be stored, in the form of volatile or non-volatile storage, such as Storage device, as ROM, no matter it is erasable or rewritable whether, or in the form of a memory, such as RAM, storage core Piece, equipment or integrated circuit or on the readable medium of light or magnetic, such as CD, DVD, disk or tape.It will be appreciated that: Storage device and storage medium are suitable for storing the example of the machine readable storage of one or more programs, upon being performed, One or more of programs realize the example of the present invention.Via any medium, such as it is loaded with by wired or wireless connection Signal of communication can electronically transmit the example of the present invention, and example includes suitably identical content.
It should be noted that:Because the present invention solves the problems, such as techniques discussed above, uses computer and communication is led Technical staff can instruct technological means to understand according to it after reading this description in domain, and obtain the skill Art effect, so claimed scheme belongs to the technical solution on patent law purposes in the following claims.In addition, because The technical solution being claimed for appended claims can be made or used in industry, therefore the technical solution has practicality Property.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in, Should all it forgive within protection scope of the present invention.Unless being otherwise expressly recited, otherwise disclosed each feature is only Equivalent or similar characteristics a example for general series.Therefore, protection scope of the present invention should be with the guarantor of claims It protects subject to range.

Claims (5)

1. a kind of big data method for digging based on dimensionality reduction, it is characterised in that include the following steps:
S1:Carry out data selection;
S2:Data prediction is carried out, the data for executing inconvenient identification are converted into the information processing of readily identified authority data Journey;
S3:Object analysis is carried out for pretreated data;
S4:Pretreated data are transformed into another space, by the information collection of pretreated data in the space by transformation In in low-dimensional, replace pretreated data with low-dimensional data to carry out subsequent processing;
S5:Carry out Numerical value;And
S6:As needed, the step of process based on above-mentioned S1 to S5, carries out data mining, and carries out the anti-of assessment and result Feedback and amendment;
Wherein data selection is to determine the operation data object that data mining task is related to, according to the requirement of data mining task Extraction and the relevant data set of mining task from related data sources;
Wherein the data prediction the step of in, be based on specification and attribute loop, carried out using rough set brief, and execute and disappear Except noise, missing data processing, elimination of duplicate data operation;
The data type for wherein analyzing pretreated data includes at least data block or data segment or individual data;
Wherein step S4 includes step S41:Data after pretreatment and analysis are D, indicate to wait for a of this step process × The data array of b, wherein a and b are all positive integers;Element in the data array of data object D is dij, wherein i, j indicate battle array Corresponding row and column serial number in row, i and j are the positive integers being respectively smaller than equal to a and less than or equal to b;By step S4, by a × b Array transformation is a × c arrays, and a × c arrays can reconfigure original data object D, wherein c in another form Less than b;
Wherein step S4 further comprises step S42:By element dijIt is computed and is configured to:dij=[n, i1, i2, i3, i4, m1, m2]ij, data object D be [N, I1, I2, I3, I4, M1, M2], wherein N indicate object type parameter, data value according to Data length and array size carry out assignment and specify, which is proportionate with data length and/or array size;I1, I2, I3, I4 indicate the parameter in the direction of object, and M1, M2 indicate forward direction, the parameter of reverse transformation model of object;N=ΘN ΦTN;I1=ΘI1ΦTI1;I2=ΘI2ΦTI2;I3=ΘI3ΦTI3;I4=ΘI4ΦTI4;M1=ΘM1ΦT+ ΞM1;M2=ΘM2ΦTM2;Wherein ΘN、ΘI1、ΘI2、ΘI3、ΘI4、ΘM1、ΘM2Indicate a × c arrays of score member prime component, It is the low-dimensional data in another space;Φ indicates b × c arrays of load element;And ΞN、ΞI1、ΞI2、ΞI3、ΞI4、ΞM1、ΞM2Table Show the remaining magnitude in a × b arrays;Subscript "T" indicate array transposition operation;And include coordinate in each Θ, be by The array that coordinate is formed.
2. the small at least an order of magnitude of the method as described in claim 1, wherein c ratio b.
3. the method before as described in any claim, wherein in Numerical value operation, each item is considered as by subspace In b dimension datas indicate object to be operated, estimation function FbWith N, I1, I2, I3, I4, M1, M2 are related.
4. method as claimed in claim 3, wherein:
Estimation function Fb=Nc+Ic+Mc,
Wherein N '=ΘNΦT, I1 '=ΘI1ΦT, I2 '=ΘI2ΦT, I3 '=ΘI3ΦT, I4 '=ΘI4ΦT, M1 '=ΘM1 ΦT, M2 '=ΘM2ΦT;And Ω1And Ω2It is the diagonal array of c × c, individually considers the vertex of b dimension spaces.
5. a kind of system for realizing any one of claim 1-4 methods includes for realizing the respective of each step Device.
CN201510566756.7A 2015-09-08 2015-09-08 Big data method for digging based on dimensionality reduction Active CN105160000B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510566756.7A CN105160000B (en) 2015-09-08 2015-09-08 Big data method for digging based on dimensionality reduction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510566756.7A CN105160000B (en) 2015-09-08 2015-09-08 Big data method for digging based on dimensionality reduction

Publications (2)

Publication Number Publication Date
CN105160000A CN105160000A (en) 2015-12-16
CN105160000B true CN105160000B (en) 2018-11-02

Family

ID=54800856

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510566756.7A Active CN105160000B (en) 2015-09-08 2015-09-08 Big data method for digging based on dimensionality reduction

Country Status (1)

Country Link
CN (1) CN105160000B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136520A (en) * 2013-03-25 2013-06-05 苏州大学 Shape matching and target recognition method based on PCA-SC algorithm
CN104077408A (en) * 2014-07-11 2014-10-01 浙江大学 Distributed semi-supervised content identification and classification method and device for large-scale cross-media data
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7711734B2 (en) * 2006-04-06 2010-05-04 Sas Institute Inc. Systems and methods for mining transactional and time series data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136520A (en) * 2013-03-25 2013-06-05 苏州大学 Shape matching and target recognition method based on PCA-SC algorithm
CN104077408A (en) * 2014-07-11 2014-10-01 浙江大学 Distributed semi-supervised content identification and classification method and device for large-scale cross-media data
CN104699772A (en) * 2015-03-05 2015-06-10 孟海东 Big data text classifying method based on cloud computing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Web文本分类研究及应用;柯慧燕;《中国优秀硕士学位论文全文数据库 信息科技辑》;20060815;第16页第17行、第21页第10-13行,第36页第1行-第37页第9行 *
一种基于PCA 的组合特征提取文本分类方法;李建林;《计算机应用研究》;20130831;第30卷(第8期);2400页第9-28行、第2401页左栏第10-15行 *

Also Published As

Publication number Publication date
CN105160000A (en) 2015-12-16

Similar Documents

Publication Publication Date Title
US9773029B2 (en) Generation of a data model
CN106980623B (en) Data model determination method and device
US20130262498A1 (en) Database query optimization
CN106997408A (en) Circuit verification
CN111027294A (en) Table summarizing method, device and system
Idris Python data analysis
US11841839B1 (en) Preprocessing and imputing method for structural data
CN105988889A (en) Data check method and apparatus
CN103678436A (en) Information processing system and information processing method
CN105989173A (en) Data processing method and device
CN104636401B (en) A kind of method and device of SCADA system data rewind
Yang et al. Xception-based general forensic method on small-size images
KR101361080B1 (en) Apparatus, method and computer readable recording medium for calculating between matrices
Cremona The elliptic curve database for conductors to 130000
US8781997B2 (en) Reusing reporting components in customized cubes
CN110110406A (en) A kind of Predicting Slope Stability method for realizing LS-SVM model based on Excel computing platform
CN105160000B (en) Big data method for digging based on dimensionality reduction
TW201327199A (en) Cloud online real time multi dimensional analysis system and method
CN106940836A (en) A kind of data analysing method and device
CN113434507B (en) Data textualization method, device, equipment and storage medium
CN109298686A (en) System and method for using business intelligence for rule-based design and manufacture technology
CN108062325A (en) Comparative approach and comparison system
Brazeau et al. Morphological phylogenetic analysis with inapplicable data
EP3539038B1 (en) Reduced memory nucleotide sequence comparison
CN104570759A (en) Fast binary tree method for point location problem in control system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190529

Address after: 101200 Information Building, No. 13 Linyin North Street, Pinggu District, Beijing, 18th Floor 1808-183

Patentee after: Wuliang Technology Co., Ltd.

Address before: 610000 Tianjiubei Lane 139, Chengdu High-tech Zone, Sichuan Province

Patentee before: Chengdu Bo Yuan epoch softcom limited

CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 101200 Information Building, No. 13 Linyin North Street, Pinggu District, Beijing, 18th Floor 1808-183

Patentee after: Immeasurable Technology Co., Ltd

Address before: 101200 Information Building, No. 13 Linyin North Street, Pinggu District, Beijing, 18th Floor 1808-183

Patentee before: Wuliang Technology Co., Ltd.