CN105160000B

CN105160000B - Big data method for digging based on dimensionality reduction

Info

Publication number: CN105160000B
Application number: CN201510566756.7A
Authority: CN
Inventors: 杨立波
Original assignee: Chengdu Bo Yuan Epoch Softcom Ltd
Current assignee: Immeasurable Technology Co., Ltd
Priority date: 2015-09-08
Filing date: 2015-09-08
Publication date: 2018-11-02
Anticipated expiration: 2035-09-08
Also published as: CN105160000A

Abstract

A kind of big data method for digging and device based on dimensionality reduction, the method comprising the steps of：Data select；Data prediction, the data for executing inconvenient identification are converted into the information process of readily identified authority data；Object analysis is carried out for pretreated data；The low-dimensional data that pretreated data are transformed to another space replaces pretreated data to carry out subsequent processing with low-dimensional data；Numerical value；And data mining, and carry out assessment and feedback and the amendment of result.By the device of this method and execution this method, it can be effectively reduced the complexity of dimension, while also improving integrality and the accuracy of analysis, excavate valuable information as far as possible.

Description

Big data method for digging based on dimensionality reduction

Technical field

The present invention relates to electric processing data information fields, more specifically, are related to a kind of big data excavation based on dimensionality reduction Method and system.

Background technology

With society's industrialization, the continuous improvement of the level of IT application, nowadays data, which have replaced, is calculated as information calculating Center, cloud computing, big data are becoming a kind of trend and trend.Including memory capacity, availability, I/O performances, data safety All various aspects such as property, scalability.Big data is the very huge and complicated data set of scale.Big data has 4V：Volume is (big Amount), data volume increases continuously and healthily；Velocity (high speed), data I/O speed are faster；Variety (various), data type With source diversification；Value (value), there are the usable values of various aspects.

Because there are the data cells of big data quantity in big data, no doubt, the data cell of these big data quantities is with rich Rich valuable information is conducive to go deep into the processing operations such as mining data value.But the data cell of big data quantity is increasing While adding data value, also increase the number of dimensions of data so that data volume increase sharply, this be faced in this field it is very existing Real problem.The relevance of relatively high dimension and data can both make subsequent arithmetic become more complicated, and then substantially reduce Processing speed, and under the premise of limited sampling, may result in the reduction of data integrity and accuracy.Therefore right Advanced row Data Dimensionality Reduction before the data cell of big data quantity is handled and analyzed, is a very important job.

However, traditional dimension reduction method and the big data method for digging to it, it is difficult to which satisfaction is both effectively reduced dimension Complexity, while also improving integrality and the accuracy of analysis, excavate valuable information as far as possible, compel to be essential in this field Want a kind of big data method for digging can effectively solve the problem that above-mentioned technical problem.

Invention content

An object of the present invention is to provide a kind of big data method for digging and system based on dimensionality reduction, by this method and Execute this method system in device, the complexity of dimension can be effectively reduced, at the same also improve analysis integrality and Valuable information is excavated in accuracy as far as possible.

The technical solution that the present invention takes to solve above-mentioned technical problem is：A kind of big data excavation side based on dimensionality reduction Method, including step：S1：Carry out data selection；S2：Data prediction is carried out, the data of the inconvenient identification of execution, which are converted into, to be easy to know The information process of other authority data；S3：Object analysis is carried out for pretreated data；S4：By pretreated data Another space is transformed to, the most information of pretreated data is concentrated on into low-dimensional in the space by transformation, uses low-dimensional Data carry out subsequent processing instead of pretreated data；S5：Carry out Numerical value；And S6：As needed, it is based on above-mentioned mistake The step of journey, carries out data mining, and carries out assessment and feedback and the amendment of result.

According to another aspect of the present invention, wherein data selection is to determine the operation data pair that data mining task is related to As being extracted and the relevant data set of mining task from related data sources according to the requirement of data mining task.

According to another aspect of the present invention, wherein in the data prediction the step of, it is based on specification and attribute loop, is adopted Carried out with rough set it is brief, with for subsequent data be further processed provide conveniently, improve performance simultaneously realize preferably excavate Effect, and execute and eliminate noise, missing data processing, elimination of duplicate data operation.

According to another aspect of the present invention, wherein analyze pretreated data data type include at least data block or Data segment or individual data.

According to another aspect of the present invention, wherein step S4 includes step S41：Data after pretreatment and analysis For D, the data array for waiting for a × b of this step process is indicated, wherein a and b are positive integers；The data array of data object D In element be d_ij, wherein i, j indicate that corresponding row and column serial number in array, i and j are to be respectively smaller than equal to a and be less than or equal to b Positive integer；It is a × c arrays by a × b array transformations by step S4, a × c arrays can weigh in another form Neotectonics original data object D, wherein c are less than b.

According to another aspect of the present invention, wherein if data object D is the data rich in information content, b is at least c 2 times either c be the small at least an order of magnitude of 1 or c ratios b.

According to another aspect of the present invention, wherein step S4 further comprises step S42：By element d_ijIt is computed construction For：d_ij=[n, i1, i2, i3, i4, m1, m2]_ij, data object D is [N, I1, I2, I3, I4, M1, M2], wherein N expressions pair The parameter of the type of elephant, data value can carry out assignment and specified, the data value and data according to data length and array size Length and/or array size are proportionate；I1, I2, I3, I4 indicate the parameter in the direction of object, and M1, M2 indicate object Positive, reverse transformation model parameter；N=Θ_NΦ^T+Ξ_N；I1=Θ_I1Φ^T+Ξ_I1；I2=Θ_I2Φ^T+Ξ_I2；I3=Θ_I3Φ^T+ Ξ_I3；I4=Θ_I4Φ^T+Ξ_I4；M1=Θ_M1Φ^T+Ξ_M1；M2=Θ_M2Φ^T+Ξ_M2；Wherein Θ_N、Θ_I1、Θ_I2、Θ_I3、Θ_I4、Θ_M1、 Θ_M2A × c the arrays for indicating score member prime component, are the low-dimensional datas in another space；Φ indicates b × c battle arrays of load element Row；And Ξ_N、Ξ_I1、Ξ_I2、Ξ_I3、Ξ_I4、Ξ_M1、Ξ_M2Indicate the remaining magnitude in a × b arrays；Subscript "^T" indicate that the transposition of array is transported It calculates；And includes coordinate in each Θ, be the array formed by coordinate.

According to another aspect of the present invention, wherein in Numerical value operation, each item is considered as by the b in subspace The object to be operated that dimension data indicates.Estimation function F^b=N_c+I_c+M_c,

Wherein N '=Θ_NΦ^T, I1 '=Θ_I1Φ^T, I2 '=Θ_I2 Φ^T, I3 '=Θ_I3Φ^T, I4 '=Θ_I4Φ^T, M1 '=Θ_M1Φ^T, M2 '=Θ_M2Φ^T；And Ω¹And Ω²It is the diagonal matrix of c × c Row, can individually consider the vertex of b dimension spaces.

According to a further aspect of the invention, a kind of device executed in the above method in the system of step is provided.

Description of the drawings

In the accompanying drawings by way of example rather than showing the embodiment of the present invention by way of limitation, wherein：

Exemplary embodiment according to the present invention, Fig. 1 instantiate a kind of flow of the big data method for digging based on dimensionality reduction Figure.

Specific implementation mode

In the following description, refer to the attached drawing and several specific embodiments are shown by way of illustration.It will be appreciated that： It is contemplated that and other embodiment can be made without departing from the scope of the present disclosure or spirit.Therefore, described in detail below should not be by Think in a limiting sense.

First, in step sl, data selection is carried out.Data selection is to determine the operation data that data mining task is related to Object extracts and the relevant data set of mining task according to the requirement of data mining task from related data sources.

Secondly, in step s 2, data prediction is carried out, the data for executing inconvenient identification are converted into readily identified specification The information process of data.In this process, specification and attribute loop are the core of the process, wherein can be used rough set into Row is brief, to be provided conveniently for being further processed for subsequent data, improves performance and realizes better mining effect；It can hold Row eliminates noise, missing data processing, elimination of duplicate data operation.

Secondly, in step s3, object analysis is carried out for pretreated data.Preferably, pretreated data are analyzed Data type, include but not limited to data block or data segment or individual data.Particularly, data cell as described herein can be with Refer to the data for waiting for next step processing, either one or more data blocks or data segment, individual data, can also be it Meaning combination.Its range includes but not limited to content listed above.

Again, in step s 4, pretreated data are transformed into another space, will located in advance in the space by transformation The most information of the data of reason concentrates on low-dimensional, replaces pretreated data with low-dimensional data to carry out subsequent processing.Specifically For, above-mentioned steps S4 includes the following steps：S41, the above-mentioned data after pretreatment and analysis are D, and expression waits for this step Suddenly the data array of a × b handled, wherein a and b are all positive integers.Element in the data array of data object D is d_ij, Middle i, j indicate corresponding row and column serial number in array, and i and j are the positive integers being respectively smaller than equal to a and less than or equal to b.By step Rapid S4, is a × c arrays by a × b array transformations, and a × c arrays can reconfigure original data in another form Object D, wherein c are less than b.Preferably, if data object D is the data rich in information content, b is at least 2 times of c；It is preferred that Ground, c 1；Preferably, small at least an order of magnitude of c ratios b.S42, by element d_ijIt is computed and is configured to：d_ij=[n, i1, i2, I3, i4, m1, m2]_ij, data object D is [N, I1, I2, I3, I4, M1, M2], and wherein N indicates the parameter of the type of object, Data value can carry out assignment according to data length and array size and specify, usually, the data value and data length and/or battle array Row size is proportionate；I1, I2, I3, I4 indicate the parameter in the direction of object, and M1, M2 indicate forward direction, the reverse transformation of object The parameter of model.Specifically, N=Θ_NΦ^T+Ξ_N；I1=Θ_I1Φ^T+Ξ_I1；I2=Θ_I2Φ^T+ Ξ_I2；I3=Θ_I3Φ^T+Ξ_I3； I4=Θ_I4Φ^T+Ξ_I4；M1=Θ_M1Φ^T+Ξ_M1；M2=Θ_M2Φ^T+Ξ_M2；Wherein Θ_N、Θ_I1、Θ_I2、Θ_I3、Θ_I4、Θ_M1、Θ_M2 A × c the arrays for indicating score member prime component, are exactly the low-dimensional data in another space；Φ indicates b × c battle arrays of load element Row；And Ξ_N、Ξ_I1、Ξ_I2、Ξ_I3、Ξ_I4、Ξ_M1、Ξ_M2Indicate the remaining magnitude in a × b arrays；Subscript "^T" indicate array transposition operation. Include coordinate in each Θ, is the array formed by coordinate.

Again, in step s 5, Numerical value is carried out.In Numerical value operation, each item is considered as by subspace In b dimension datas indicate object to be operated.Estimation function F^b=N_c+I_c+M_c,

Wherein N '=Θ_NΦ^T, I1 '=Θ_I1Φ^T, I2 '=Θ_I2Φ^T, I3 '=Θ_I3Φ^T, I4 '=Θ_I4Φ^T, M1 '= Θ_M1Φ^T, M2 '=Θ_M2Φ^T；And Ω¹And Ω²It is the diagonal array of c × c, can individually considers the vertex of b dimension spaces.By this Estimation, can meet attribute constraint condition.

Again, in step s 6, as needed, based on above process the step of, carries out data mining, and assessed with And feedback and the amendment of result.Method and step well known in the art can be used in data digging method in this step.

Particularly, data cell as described herein can refer to pending data, either one or more data blocks Or data segment, individual data, can also be that it arbitrary is combined.Its range includes but not limited to content listed above.

By above procedure, the big data method for digging of the invention based on dimensionality reduction can be effectively reduced the complexity of dimension Property, while integrality and the accuracy of analysis are also improved, valuable information is excavated as far as possible.

It will be appreciated that：The example and reality of the present invention can be realized in the form of the combination of hardware, software or hardware and software Apply example.As described above, any main body for executing this method can be stored, in the form of volatile or non-volatile storage, such as Storage device, as ROM, no matter it is erasable or rewritable whether, or in the form of a memory, such as RAM, storage core Piece, equipment or integrated circuit or on the readable medium of light or magnetic, such as CD, DVD, disk or tape.It will be appreciated that： Storage device and storage medium are suitable for storing the example of the machine readable storage of one or more programs, upon being performed, One or more of programs realize the example of the present invention.Via any medium, such as it is loaded with by wired or wireless connection Signal of communication can electronically transmit the example of the present invention, and example includes suitably identical content.

It should be noted that：Because the present invention solves the problems, such as techniques discussed above, uses computer and communication is led Technical staff can instruct technological means to understand according to it after reading this description in domain, and obtain the skill Art effect, so claimed scheme belongs to the technical solution on patent law purposes in the following claims.In addition, because The technical solution being claimed for appended claims can be made or used in industry, therefore the technical solution has practicality Property.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in, Should all it forgive within protection scope of the present invention.Unless being otherwise expressly recited, otherwise disclosed each feature is only Equivalent or similar characteristics a example for general series.Therefore, protection scope of the present invention should be with the guarantor of claims It protects subject to range.

Claims

1. a kind of big data method for digging based on dimensionality reduction, it is characterised in that include the following steps：

S1：Carry out data selection；

S2：Data prediction is carried out, the data for executing inconvenient identification are converted into the information processing of readily identified authority data Journey；

S3：Object analysis is carried out for pretreated data；

S4：Pretreated data are transformed into another space, by the information collection of pretreated data in the space by transformation In in low-dimensional, replace pretreated data with low-dimensional data to carry out subsequent processing；

S5：Carry out Numerical value；And

S6：As needed, the step of process based on above-mentioned S1 to S5, carries out data mining, and carries out the anti-of assessment and result Feedback and amendment；

Wherein data selection is to determine the operation data object that data mining task is related to, according to the requirement of data mining task Extraction and the relevant data set of mining task from related data sources；

Wherein the data prediction the step of in, be based on specification and attribute loop, carried out using rough set brief, and execute and disappear Except noise, missing data processing, elimination of duplicate data operation；

The data type for wherein analyzing pretreated data includes at least data block or data segment or individual data；

Wherein step S4 includes step S41：Data after pretreatment and analysis are D, indicate to wait for a of this step process × The data array of b, wherein a and b are all positive integers；Element in the data array of data object D is d_ij, wherein i, j indicate battle array Corresponding row and column serial number in row, i and j are the positive integers being respectively smaller than equal to a and less than or equal to b；By step S4, by a × b Array transformation is a × c arrays, and a × c arrays can reconfigure original data object D, wherein c in another form Less than b；

Wherein step S4 further comprises step S42：By element d_ijIt is computed and is configured to：d_ij=[n, i1, i2, i3, i4, m1, m2]_ij, data object D be [N, I1, I2, I3, I4, M1, M2], wherein N indicate object type parameter, data value according to Data length and array size carry out assignment and specify, which is proportionate with data length and/or array size；I1, I2, I3, I4 indicate the parameter in the direction of object, and M1, M2 indicate forward direction, the parameter of reverse transformation model of object；N=Θ_N Φ^T+Ξ_N；I1=Θ_I1Φ^T+Ξ_I1；I2=Θ_I2Φ^T+Ξ_I2；I3=Θ_I3Φ^T+Ξ_I3；I4=Θ_I4Φ^T+Ξ_I4；M1=Θ_M1Φ^T+ Ξ_M1；M2=Θ_M2Φ^T+Ξ_M2；Wherein Θ_N、Θ_I1、Θ_I2、Θ_I3、Θ_I4、Θ_M1、Θ_M2Indicate a × c arrays of score member prime component, It is the low-dimensional data in another space；Φ indicates b × c arrays of load element；And Ξ_N、Ξ_I1、Ξ_I2、Ξ_I3、Ξ_I4、Ξ_M1、Ξ_M2Table Show the remaining magnitude in a × b arrays；Subscript "^T" indicate array transposition operation；And include coordinate in each Θ, be by The array that coordinate is formed.

2. the small at least an order of magnitude of the method as described in claim 1, wherein c ratio b.

3. the method before as described in any claim, wherein in Numerical value operation, each item is considered as by subspace In b dimension datas indicate object to be operated, estimation function F^bWith N, I1, I2, I3, I4, M1, M2 are related.

4. method as claimed in claim 3, wherein：

Estimation function F^b=N_c+I_c+M_c,

Wherein N '=Θ_NΦ^T, I1 '=Θ_I1Φ^T, I2 '=Θ_I2Φ^T, I3 '=Θ_I3Φ^T, I4 '=Θ_I4Φ^T, M1 '=Θ_M1 Φ^T, M2 '=Θ_M2Φ^T；And Ω¹And Ω²It is the diagonal array of c × c, individually considers the vertex of b dimension spaces.

5. a kind of system for realizing any one of claim 1-4 methods includes for realizing the respective of each step Device.