Background technology
With society's industrialization, the continuous improvement of the level of IT application, nowadays data, which have replaced, is calculated as information calculating
Center, cloud computing, big data are becoming a kind of trend and trend.Including memory capacity, availability, I/O performances, data safety
All various aspects such as property, scalability.Big data is the very huge and complicated data set of scale.Big data has 4V:Volume is (big
Amount), data volume increases continuously and healthily;Velocity (high speed), data I/O speed are faster;Variety (various), data type
With source diversification;Value (value), there are the usable values of various aspects.
Because there are the data cells of big data quantity in big data, no doubt, the data cell of these big data quantities is with rich
Rich valuable information is conducive to go deep into the processing operations such as mining data value.But the data cell of big data quantity is increasing
While adding data value, also increase the number of dimensions of data so that data volume increase sharply, this be faced in this field it is very existing
Real problem.The relevance of relatively high dimension and data can both make subsequent arithmetic become more complicated, and then substantially reduce
Processing speed, and under the premise of limited sampling, may result in the reduction of data integrity and accuracy.Therefore right
Advanced row Data Dimensionality Reduction before the data cell of big data quantity is handled and analyzed, is a very important job.
However, traditional dimension reduction method and the big data method for digging to it, it is difficult to which satisfaction is both effectively reduced dimension
Complexity, while also improving integrality and the accuracy of analysis, excavate valuable information as far as possible, compel to be essential in this field
Want a kind of big data method for digging can effectively solve the problem that above-mentioned technical problem.
Invention content
An object of the present invention is to provide a kind of big data method for digging and system based on dimensionality reduction, by this method and
Execute this method system in device, the complexity of dimension can be effectively reduced, at the same also improve analysis integrality and
Valuable information is excavated in accuracy as far as possible.
The technical solution that the present invention takes to solve above-mentioned technical problem is:A kind of big data excavation side based on dimensionality reduction
Method, including step:S1:Carry out data selection;S2:Data prediction is carried out, the data of the inconvenient identification of execution, which are converted into, to be easy to know
The information process of other authority data;S3:Object analysis is carried out for pretreated data;S4:By pretreated data
Another space is transformed to, the most information of pretreated data is concentrated on into low-dimensional in the space by transformation, uses low-dimensional
Data carry out subsequent processing instead of pretreated data;S5:Carry out Numerical value;And S6:As needed, it is based on above-mentioned mistake
The step of journey, carries out data mining, and carries out assessment and feedback and the amendment of result.
According to another aspect of the present invention, wherein data selection is to determine the operation data pair that data mining task is related to
As being extracted and the relevant data set of mining task from related data sources according to the requirement of data mining task.
According to another aspect of the present invention, wherein in the data prediction the step of, it is based on specification and attribute loop, is adopted
Carried out with rough set it is brief, with for subsequent data be further processed provide conveniently, improve performance simultaneously realize preferably excavate
Effect, and execute and eliminate noise, missing data processing, elimination of duplicate data operation.
According to another aspect of the present invention, wherein analyze pretreated data data type include at least data block or
Data segment or individual data.
According to another aspect of the present invention, wherein step S4 includes step S41:Data after pretreatment and analysis
For D, the data array for waiting for a × b of this step process is indicated, wherein a and b are positive integers;The data array of data object D
In element be dij, wherein i, j indicate that corresponding row and column serial number in array, i and j are to be respectively smaller than equal to a and be less than or equal to b
Positive integer;It is a × c arrays by a × b array transformations by step S4, a × c arrays can weigh in another form
Neotectonics original data object D, wherein c are less than b.
According to another aspect of the present invention, wherein if data object D is the data rich in information content, b is at least c
2 times either c be the small at least an order of magnitude of 1 or c ratios b.
According to another aspect of the present invention, wherein step S4 further comprises step S42:By element dijIt is computed construction
For:dij=[n, i1, i2, i3, i4, m1, m2]ij, data object D is [N, I1, I2, I3, I4, M1, M2], wherein N expressions pair
The parameter of the type of elephant, data value can carry out assignment and specified, the data value and data according to data length and array size
Length and/or array size are proportionate;I1, I2, I3, I4 indicate the parameter in the direction of object, and M1, M2 indicate object
Positive, reverse transformation model parameter;N=ΘNΦT+ΞN;I1=ΘI1ΦT+ΞI1;I2=ΘI2ΦT+ΞI2;I3=ΘI3ΦT+
ΞI3;I4=ΘI4ΦT+ΞI4;M1=ΘM1ΦT+ΞM1;M2=ΘM2ΦT+ΞM2;Wherein ΘN、ΘI1、ΘI2、ΘI3、ΘI4、ΘM1、
ΘM2A × c the arrays for indicating score member prime component, are the low-dimensional datas in another space;Φ indicates b × c battle arrays of load element
Row;And ΞN、ΞI1、ΞI2、ΞI3、ΞI4、ΞM1、ΞM2Indicate the remaining magnitude in a × b arrays;Subscript "T" indicate that the transposition of array is transported
It calculates;And includes coordinate in each Θ, be the array formed by coordinate.
According to another aspect of the present invention, wherein in Numerical value operation, each item is considered as by the b in subspace
The object to be operated that dimension data indicates.Estimation function Fb=Nc+Ic+Mc,
Wherein N '=ΘNΦT, I1 '=ΘI1ΦT, I2 '=ΘI2
ΦT, I3 '=ΘI3ΦT, I4 '=ΘI4ΦT, M1 '=ΘM1ΦT, M2 '=ΘM2ΦT;And Ω1And Ω2It is the diagonal matrix of c × c
Row, can individually consider the vertex of b dimension spaces.
According to a further aspect of the invention, a kind of device executed in the above method in the system of step is provided.
Specific implementation mode
In the following description, refer to the attached drawing and several specific embodiments are shown by way of illustration.It will be appreciated that:
It is contemplated that and other embodiment can be made without departing from the scope of the present disclosure or spirit.Therefore, described in detail below should not be by
Think in a limiting sense.
Exemplary embodiment according to the present invention, Fig. 1 instantiate a kind of flow of the big data method for digging based on dimensionality reduction
Figure.
First, in step sl, data selection is carried out.Data selection is to determine the operation data that data mining task is related to
Object extracts and the relevant data set of mining task according to the requirement of data mining task from related data sources.
Secondly, in step s 2, data prediction is carried out, the data for executing inconvenient identification are converted into readily identified specification
The information process of data.In this process, specification and attribute loop are the core of the process, wherein can be used rough set into
Row is brief, to be provided conveniently for being further processed for subsequent data, improves performance and realizes better mining effect;It can hold
Row eliminates noise, missing data processing, elimination of duplicate data operation.
Secondly, in step s3, object analysis is carried out for pretreated data.Preferably, pretreated data are analyzed
Data type, include but not limited to data block or data segment or individual data.Particularly, data cell as described herein can be with
Refer to the data for waiting for next step processing, either one or more data blocks or data segment, individual data, can also be it
Meaning combination.Its range includes but not limited to content listed above.
Again, in step s 4, pretreated data are transformed into another space, will located in advance in the space by transformation
The most information of the data of reason concentrates on low-dimensional, replaces pretreated data with low-dimensional data to carry out subsequent processing.Specifically
For, above-mentioned steps S4 includes the following steps:S41, the above-mentioned data after pretreatment and analysis are D, and expression waits for this step
Suddenly the data array of a × b handled, wherein a and b are all positive integers.Element in the data array of data object D is dij,
Middle i, j indicate corresponding row and column serial number in array, and i and j are the positive integers being respectively smaller than equal to a and less than or equal to b.By step
Rapid S4, is a × c arrays by a × b array transformations, and a × c arrays can reconfigure original data in another form
Object D, wherein c are less than b.Preferably, if data object D is the data rich in information content, b is at least 2 times of c;It is preferred that
Ground, c 1;Preferably, small at least an order of magnitude of c ratios b.S42, by element dijIt is computed and is configured to:dij=[n, i1, i2,
I3, i4, m1, m2]ij, data object D is [N, I1, I2, I3, I4, M1, M2], and wherein N indicates the parameter of the type of object,
Data value can carry out assignment according to data length and array size and specify, usually, the data value and data length and/or battle array
Row size is proportionate;I1, I2, I3, I4 indicate the parameter in the direction of object, and M1, M2 indicate forward direction, the reverse transformation of object
The parameter of model.Specifically, N=ΘNΦT+ΞN;I1=ΘI1ΦT+ΞI1;I2=ΘI2ΦT+ ΞI2;I3=ΘI3ΦT+ΞI3;
I4=ΘI4ΦT+ΞI4;M1=ΘM1ΦT+ΞM1;M2=ΘM2ΦT+ΞM2;Wherein ΘN、ΘI1、ΘI2、ΘI3、ΘI4、ΘM1、ΘM2
A × c the arrays for indicating score member prime component, are exactly the low-dimensional data in another space;Φ indicates b × c battle arrays of load element
Row;And ΞN、ΞI1、ΞI2、ΞI3、ΞI4、ΞM1、ΞM2Indicate the remaining magnitude in a × b arrays;Subscript "T" indicate array transposition operation.
Include coordinate in each Θ, is the array formed by coordinate.
Again, in step s 5, Numerical value is carried out.In Numerical value operation, each item is considered as by subspace
In b dimension datas indicate object to be operated.Estimation function Fb=Nc+Ic+Mc,
Wherein N '=ΘNΦT, I1 '=ΘI1ΦT, I2 '=ΘI2ΦT, I3 '=ΘI3ΦT, I4 '=ΘI4ΦT, M1 '=
ΘM1ΦT, M2 '=ΘM2ΦT;And Ω1And Ω2It is the diagonal array of c × c, can individually considers the vertex of b dimension spaces.By this
Estimation, can meet attribute constraint condition.
Again, in step s 6, as needed, based on above process the step of, carries out data mining, and assessed with
And feedback and the amendment of result.Method and step well known in the art can be used in data digging method in this step.
Particularly, data cell as described herein can refer to pending data, either one or more data blocks
Or data segment, individual data, can also be that it arbitrary is combined.Its range includes but not limited to content listed above.
By above procedure, the big data method for digging of the invention based on dimensionality reduction can be effectively reduced the complexity of dimension
Property, while integrality and the accuracy of analysis are also improved, valuable information is excavated as far as possible.
It will be appreciated that:The example and reality of the present invention can be realized in the form of the combination of hardware, software or hardware and software
Apply example.As described above, any main body for executing this method can be stored, in the form of volatile or non-volatile storage, such as
Storage device, as ROM, no matter it is erasable or rewritable whether, or in the form of a memory, such as RAM, storage core
Piece, equipment or integrated circuit or on the readable medium of light or magnetic, such as CD, DVD, disk or tape.It will be appreciated that:
Storage device and storage medium are suitable for storing the example of the machine readable storage of one or more programs, upon being performed,
One or more of programs realize the example of the present invention.Via any medium, such as it is loaded with by wired or wireless connection
Signal of communication can electronically transmit the example of the present invention, and example includes suitably identical content.
It should be noted that:Because the present invention solves the problems, such as techniques discussed above, uses computer and communication is led
Technical staff can instruct technological means to understand according to it after reading this description in domain, and obtain the skill
Art effect, so claimed scheme belongs to the technical solution on patent law purposes in the following claims.In addition, because
The technical solution being claimed for appended claims can be made or used in industry, therefore the technical solution has practicality
Property.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto,
Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in,
Should all it forgive within protection scope of the present invention.Unless being otherwise expressly recited, otherwise disclosed each feature is only
Equivalent or similar characteristics a example for general series.Therefore, protection scope of the present invention should be with the guarantor of claims
It protects subject to range.