CN103559642A

CN103559642A - Financial data mining method based on cloud computing

Info

Publication number: CN103559642A
Application number: CN201310536760.XA
Authority: CN
Inventors: 向阳; 罗成; 张依杨; 张波; 袁书寒
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2013-11-04
Filing date: 2013-11-04
Publication date: 2014-02-05

Abstract

The invention discloses a financial data mining method based on cloud computing. The method comprises the steps that mistake correcting, format conversion and other preprocessing operations are carried out on obtained financial data; needed nerve cell grids are established and are distributed in a rectangular mode, and the number of the grids accounts for 1% of the number of the possessed samples; the adaptive training is carried out based on the established grids and the processed data; the convergence training is carried out based on the established grids and the processed data; the discretization processing is carried out on the data based on the weight of trained nerve cells to enable the samples to correspond to the nerve cells in a one-to-one mode; each discrete point is labeled and visualized. The financial data mining method has the advantages that the distributed storing and computing characteristics are utilized, dimension reduction and clustering are carried on the data according to the characteristics of a self-organization nerve network, the visualization technology is adopted, and the data are more vivid.

Description

Finance data mining method based on cloud computing

Technical field

The present invention relates to a kind of distributed finance data mining method, especially process the finance data mining method based on cloud computing of quick clustering under large data.

Background technology

Along with Internet develops rapidly, WWW (World Wide Web is called for short WWW) has become a huge information space, for user provides valuable information resources.And in the face of a large amount of financial financial data resources, how analyzing and processing becomes vital problem.Method be by high dimensional data dimensionality reduction to two dimension and carry out a visualization processing, and carry out decision analysis with this aid decision making person.

Self organizing neural network SOM(self-organization mapping net) be a kind of important kind of the neural network based on unsupervised learning method.The self-organized mapping network theory the earliest Kohen of Shi You Helsinki, Finland Polytechnics proposed in 1981.After this, be accompanied by neural network the developing rapidly of the middle and later periods eighties 20th century, the theoretical and application of Self-organizing Maps has also had significant progress.

It is a kind of guideless clustering method.In its simulation human brain, in the neurocyte of zones of different, divide the work different features, zones of different has different response characteristics, and this process completes automatically.Self-organized mapping network by find optimum reference vector set to input pattern set classify.Each reference vector is the connection weight vector that an output unit is corresponding.Compare with traditional pattern clustering method, its formed cluster centre can be mapped in a curved surface or plane, and keeps topological structure constant.Discrimination for unknown cluster centre can be realized with Self-organizing Maps.

Self organizing neural network is neural network one of research field the most glamorous, it can be inputted sample association by it and detect its regular and input sample relation each other, and according to the information self-adapting of these input samples, adjust network, the later response of network and input sample are adapted.The neuron of competitive type neural network can be identified similar input vector in groups by input message; Self-organizing map neural network can be identified similar input vector in groups equally by study, makes the neuron being in close proximity to each other in those network layers produce response to similar input vector.Different from competitive type neural network is, self-organizing map neural network not only can be learnt the distribution situation of input vector, can also learn the topological structure of input vector, its single neuron does not play a decisive role to pattern classification, and will lean on a plurality of neuronic synergies just can complete pattern classification.

Learning vector quantization LVQ(learning vector quantization) be a kind of for training supervised learning (supervised learning) method of competition layer.Competitive layer neural network can the classification of automatic learning to input vector pattern, but the distance between input vector is only depended in the classification that competition layer is carried out, and when two input vectors approach very much, competition layer just may be classified as a class them.In the design of competition layer, there is no such mechanism, strictly by area, judge that any two input vectors belong to same class or belong to inhomogeneity.And for LVQ network user's intended target classification results, network can pass through supervised learning, complete the accurate classification to input vector pattern.

Summary of the invention

Technical matters to be solved by this invention is that a kind of self organizing neural network characteristic of utilizing will be provided, to Data Dimensionality Reduction clustering processing, and the visual finance data mining method based on cloud computing.

In order to solve above technical matters, the invention provides a kind of finance data mining method based on cloud computing, the method comprises the following steps:

1) raw data is carried out to the pretreatment operation such as Data Migration, cleaning;

2) according to the data volume of raw data and dimension, determine the structure of neuronic distribution grid;

3) utilize data and the neuron network handled well to have acclimatization training;

4) utilize the result of above-mentioned data and acclimatization training to carry out convergence training;

5) utilize above-mentioned training result to carry out discretize processing and visualization processing to data.

The data pretreatment operation of described step 1) comprises the following steps:

11) raw data unification is converted to csv format file;

12) missing data in above-mentioned document is filled up, vacancy value substitutes with this attribute mean value;

Described step 2) the neuron network in is two-dimensional rectangle lattice, and its quantity is sample number 1%; In two-dimensional rectangle lattice, neuronic distance is Euclidean distance.

Described step 3) comprises following steps:

31) initial neighborhood scope being set is 2) in the radius of grid;

32) pass that neighborhood contraction coefficient and initial neighborhood constant is set is:

Figure 201310536760X100002DEST_PATH_IMAGE001

, wherein

for initial neighborhood scope,

Figure 201310536760X100002DEST_PATH_IMAGE003

for contraction coefficient;

33) Learning Step initial value is set and step-length is shunk constant ;

34) calculate every step circulation time Learning Step

and neighborhood function , their computing formula is:

Wherein

Figure 201310536760X100002DEST_PATH_IMAGE009

for two nodes in grid

with

distance;

35) sample is inputted successively, and to each input sample calculation triumph unit, apart from the neuron of this sample Euclidean distance minimum;

36) weight is upgraded to

individual neuronic weight more new formula is:

37) above-mentioned each sample at least will circulate and input 1000 times.

Described step 4) comprises following steps:

41) initial neighborhood scope being set is 2) in the radius of grid;

42) pass that neighborhood contraction coefficient and initial neighborhood constant is set is:

, wherein

for initial neighborhood scope, for contraction coefficient;

43) Learning Step initial value is set

and step-length is shunk constant

;

44) calculate every step circulation time Learning Step

and neighborhood function

, computing formula is:

Figure 201310536760X100002DEST_PATH_IMAGE013

Wherein

for two nodes in grid

with

distance;

45) sample is inputted successively, and to each input sample calculation triumph unit, apart from the neuron of this sample Euclidean distance minimum;

46) weight is upgraded to

individual neuronic weight more new formula is:

Figure 201310536760X100002DEST_PATH_IMAGE015

47) above-mentioned each sample at least will circulate and input 4000 times;

In step 47) finish rear fixing

with

Figure 201310536760X100002DEST_PATH_IMAGE017

constant continuation training.

Described step 5) is for making the result that the coordinate of neuron corresponding to the value of inner product maximum on grid is discretize of each neuronic weight and each sample.

Compared with prior art, the present invention has the following advantages:

1, well utilized the feature of distributed storage and calculating;

2, utilized self organizing neural network characteristic, to Data Dimensionality Reduction clustering processing;

3, adopted visualization technique, more vivid.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention.

Fig. 2 is the network topology figure of data.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

As shown in Figure 1, the invention provides a kind of finance data mining method based on cloud computing, the method comprises the following steps:

1) to the financial financial data obtaining correct mistakes, the pretreatment operation such as format conversion;

2) set up required neuron grid, grid is rectangular distribution, and its number is 1% of the sample size that has;

3) utilize the grid of having set up and the data of handling well to have acclimatization training;

4) utilize the grid of having set up and the data of handling well to carry out convergence training;

5) utilize the neuronic weight training to carry out discretize processing to data, make the corresponding neuron of each sample;

6) each discrete point is carried out to label processing and visualization processing;

11) raw data unification is converted to csv format file;

Described step 2) the neuron network in is two-dimensional rectangle lattice, and its quantity is sample number 1%.In two-dimensional rectangle lattice, neuronic distance is Euclidean distance.

Described step 3) comprises following steps:

31) initial neighborhood scope being set is 2) in the radius of grid;

, wherein

for initial neighborhood scope,

for contraction coefficient;

33) Learning Step initial value is set

and step-length is shunk constant ;

34) calculate every step circulation time Learning Step

and neighborhood function , their computing formula is:

Wherein

for two nodes in grid with

distance;

36) weight is upgraded to

individual neuronic weight more new formula is:

37) above-mentioned each sample at least will circulate and input 1000 times.

Described step 4) comprises following steps:

41) initial neighborhood scope being set is 2) in the radius of grid;

, wherein

for initial neighborhood scope,

for contraction coefficient;

43) Learning Step initial value is set and step-length is shunk constant

;

44) calculate every step circulation time Learning Step and neighborhood function , computing formula is:

Wherein

for two nodes in grid

with distance;

46) weight is upgraded to

individual neuronic weight more new formula is:

47) above-mentioned each sample at least will circulate and input 4000 times; In step 47) finish rear fixing

with

constant continuation training.

The example of specifically take is further set forth technical solution of the present invention as example.

Finance financial data is very complicated, comprise many indexs, as shown below is the financial index that emerging Rong invests a certain period, common way is that the tendency of each index is analyzed, but this analytical approach has been ignored the correlativity between each index, and be also difficult to the security financial data on whole market to unify comprehensive analyzing and processing, to determine the relation between each security.This method the finance of the listed company to all are unified to process and by these data compressions to one two-dimensional grid in order to show, the topological relation of lively displaying Liao Ge listed company.

In These parameters, take out a kind of large class, as profitability, analyze, collected the above-mentioned data of all listed companies, and process by algorithm, obtain a two-dimensional grid as shown in Figure 2, by different colors, can clearly find out the network topology in above-mentioned data.

Claims

1. the finance data mining method based on cloud computing, the method comprises the following steps:

2. the finance data mining method based on cloud computing according to claim 1, is characterized in that, the data pretreatment operation of described step 1) comprises the following steps:

11) raw data unification is converted to csv format file;

12) missing data in above-mentioned document is filled up, vacancy value substitutes with this attribute mean value.

3. the finance data mining method based on cloud computing according to claim 1, is characterized in that described step 2) in neuron network be two-dimensional rectangle lattice, its quantity is sample number 1%.

4. the finance data mining method based on cloud computing according to claim 3, is characterized in that, in described two-dimensional rectangle lattice, neuronic distance is Euclidean distance.

5. the finance data mining method based on cloud computing according to claim 1, is characterized in that, described step 3) comprises following steps:

31) initial neighborhood scope being set is 2) in the radius of grid;

32) pass that neighborhood contraction coefficient and initial neighborhood constant is set is: , wherein

Figure 201310536760X100001DEST_PATH_IMAGE002

for initial neighborhood scope,

for contraction coefficient;

33) Learning Step initial value is set

Figure 201310536760X100001DEST_PATH_IMAGE004

and step-length is shunk constant ;

34) calculate every step circulation time Learning Step and neighborhood function

, computing formula is:

Figure 201310536760X100001DEST_PATH_IMAGE008

Wherein

for two nodes in grid

with

distance;

36) weight is upgraded to

individual neuronic weight more new formula is:

Figure 201310536760X100001DEST_PATH_IMAGE012

37) above-mentioned each sample at least will circulate and input 1000 times.

6. the finance data mining method based on cloud computing according to claim 1, is characterized in that, described step 4) comprises following steps:

41) initial neighborhood scope being set is 2) in the radius of grid;

, wherein

for initial neighborhood scope,

for contraction coefficient;

43) Learning Step initial value is set

and step-length is shunk constant

;

44) calculate every step circulation time Learning Step

and neighborhood function

, computing formula is:

Wherein for two nodes in grid

with

distance;

46) weight is upgraded to individual neuronic weight more new formula is:

Figure 201310536760X100001DEST_PATH_IMAGE014

47) above-mentioned each sample at least will circulate and input 4000 times;

In step 47) finish rear fixing

with

Figure 201310536760X100001DEST_PATH_IMAGE016

constant continuation training.

7. the finance data mining method based on cloud computing according to claim 1, is characterized in that, step 5) is for making the result that the coordinate of neuron corresponding to the value of inner product maximum on grid is discretize of each neuronic weight and each sample.