CN105354238A - Distribution-based big data mining method - Google Patents

Distribution-based big data mining method Download PDF

Info

Publication number
CN105354238A
CN105354238A CN201510651258.2A CN201510651258A CN105354238A CN 105354238 A CN105354238 A CN 105354238A CN 201510651258 A CN201510651258 A CN 201510651258A CN 105354238 A CN105354238 A CN 105354238A
Authority
CN
China
Prior art keywords
data
division
classification
training
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510651258.2A
Other languages
Chinese (zh)
Inventor
杨立波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Bo Yuan Epoch Softcom Ltd
Original Assignee
Chengdu Bo Yuan Epoch Softcom Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Bo Yuan Epoch Softcom Ltd filed Critical Chengdu Bo Yuan Epoch Softcom Ltd
Priority to CN201510651258.2A priority Critical patent/CN105354238A/en
Publication of CN105354238A publication Critical patent/CN105354238A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a distribution-based big data mining method, system and device. The distribution-based big data mining method is characterized by comprising the following steps: according to the demands of a user on available data, defining data; obtaining the data from a source, preparing the data, browsing the data, and integrating and checking the data to remove wrong or inconsistent data; processing the data; and testing, verifying, deploying and updating a result. Through the measurement method, system and device, big data mining efficiency, safety and accuracy can be improved, and calculation and storage cost can be reduced.

Description

Based on distributed large data digging method
Technical field
The present invention relates to large processing data information field, more specifically, relate to a kind of based on distributed large data digging method, system and device.
Background technology
Along with social industrialization, the improving constantly of the level of IT application, nowadays data have replaced the center calculating the information that becomes and calculate, and cloud computing, large data are becoming a kind of trend and trend.Comprise all many-sides such as memory capacity, availability, I/O performance, data security, extensibility.Large data are data sets that scale is very huge and complicated.Large data have 4V:Volume (in a large number), and data volume increases continuously and healthily; Velocity (at a high speed), data I/O speed is faster; Variety (various), data type and source variation; Value (value), there is the usable value of each side in it.But, owing to comprising the information of magnanimity in large data, and adopt the cost of the summation of the numerous small block data more distributed than analyzing and processing that assess the cost of central database method for digging high a lot, so be most preferably mode carrying out distributed large data mining to the available data resource in magnanimity information, and because need data to learn, and distribute naturally effective learning method that learning process is raising study, so distributed study is desirable way.
But, in the prior art, all much data digging method usefulness is not high enough, security, accuracy can not reach gratifying degree simultaneously, such as calculate simultaneously, the cost such as storage also fails to carry out enough optimization, therefore, need in this area a kind of can effectively solve the problems of the technologies described above based on distributed large data digging method.
Summary of the invention
An object of the present invention is to provide a kind of based on distributed large data digging method, system and its apparatus, by the method and the device performing the method, can improve large data mining usefulness, security, accuracy, and reduce the costs such as calculating, storage.
The present invention solves the problems of the technologies described above the technical scheme taked to be: one, based on distributed large data digging method, is characterized in that comprising the following steps: in step sl, according to the demand of user to data available, and definition data; In step s 2, obtain data from source, prepare data, browsing data is also integrated, is checked data, with that remove mistake or inconsistent data; In step s3, data are processed; And in step s 4 which, result tested, verify, dispose and upgrade.
According to a further aspect of the invention, provide a kind of for realizing the described system based on distributed large data digging method.
According to a further aspect of the invention, providing a kind of described based on distributed large data mining equipment for realizing, comprising each device for realizing each step.
Accompanying drawing explanation
By the mode of example instead of by the mode of restriction, embodiments of the invention are shown in the accompanying drawings, wherein:
According to embodiments of the invention, Fig. 1 illustrates a kind of process flow diagram based on distributed large data digging method.
According to first embodiment of the invention, Fig. 2 illustrates the process flow diagram to data processing.
According to second embodiment of the present invention, Fig. 3 illustrates another process flow diagram to data processing.
According to the 3rd embodiment of the present invention, Fig. 4 illustrates another process flow diagram to data processing.
According to the 4th embodiment of the present invention, Fig. 5 illustrates another process flow diagram to data processing.
Embodiment
In the following description, also several specific embodiment is shown by way of illustration with reference to accompanying drawing.It is to be appreciated that: can imagine and other embodiments can be made and do not depart from the scope of the present disclosure or spirit.Therefore, below describe in detail and should not be considered to have limited significance.
According to embodiments of the invention, Fig. 1 is exemplified with a kind of process flow diagram based on distributed large data digging method, and wherein the method can be applicable to and is suitable for based on distributed large data framework.
First, in step sl, according to the demand of user to data available, definition data.
Secondly, in step s 2, obtain data from source, prepare data, browsing data is also integrated, is checked data, with that remove mistake or inconsistent data.
Again, in step s3, data are processed.
Again, in step s 4 which, result is tested, verify, dispose and upgrades.
Preferably, the method is applied to based in distributed large data framework.
In above-mentioned steps S1, user can be the operating main body for dissimilar large data in different field, can be people, also can be the mechanism of such as electronic equipment and so on, this mechanism is the device containing the base conditioning functions such as processor, storer, bus, power circuit, preferably, this mechanism also can have the input equipment of such as keyboard, keypad, touch-screen and so on as required, can also have the display device of such as graphic user interface and so on.Different field comprises the various fields of existing and later exploitation, even can comprise multiple field or crossing domain simultaneously.The definition of data is depended on to the requirement of user.
In step s 2, the mode obtaining data is arbitrary, can adopt the various modes of existing and later exploitation.In like manner, integrate and/or check that the mode of data is also any.
In step s 4 which, the mode tested result, verify, dispose and upgrade also is arbitrary, can adopt the various modes of existing and later exploitation.
According to first embodiment of the invention, Fig. 2 illustrates the process flow diagram to data processing.In step s3, following steps are preferably included to the process of data: S31: the data integrated before and check are decomposed, become trained valid data; S32: on the basis of trained valid data, the division of training classification; S33: the division of the classification of training is delivered to all nodes; S34: form first new trained collection; S35: to the subset that independent node-node transmission is new; S36: at independent Nodes, is formed and comprises the new trained collection of second of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on the basis of second new trained collection.
By above process, large data mining usefulness is greatly improved, and security, accuracy are also improved, and also reduces the costs such as calculating, storage simultaneously.
According to second embodiment of the present invention, Fig. 3 illustrates another process flow diagram to data processing.Alternately, in step s3, or can following steps be preferably included: S31 ': the data integrated before and check are decomposed, become trained valid data; S32 ': on the basis of trained valid data, the division of training classification; S33 ': the division of the classification of training is delivered to all nodes; S34 ': by applying new decision method, uses effective collection formation first collection; S35 ': to the new subset of independent node-node transmission data; S36 ': at independent Nodes, is formed and comprises second collection of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.
According to the 3rd embodiment of the present invention, Fig. 4 illustrates another process flow diagram to data processing.Alternately, in step s3, or following steps can be preferably included: S31 ": on the basis of the subset of data, to the division of training classification; S32 ": the division of the classification of training is delivered to all nodes; S33 ": by applying new decision method, use effective collection formation first collection; S34 ": to the new subset of independent node-node transmission data; S35 ": at independent Nodes, formed and comprise second collection of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.
According to the 4th embodiment of the present invention, Fig. 5 illustrates another process flow diagram to data processing.Alternately, in step s3 " ': the data integrated before and check are decomposed, becomes trained valid data, or can to preferably include following steps: S31; S32 " ': on the basis of trained valid data, the division of training classification; S33 " ': the division of the classification of training is delivered to all nodes; S34 " ': in the mistake of the division of valid data centralized calculation classification; S35 " ': to the mistake of the division of the classification of independent node-node transmission calculating; S36 " ': at independent Nodes, is formed and comprises second collection of the mistake of the division of the classification of the calculating of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.
By above three kinds of alternative process, large data mining usefulness is greatly improved, and security, accuracy are also improved, and also reduces the costs such as calculating, storage simultaneously.
It is to be appreciated that: the form of combination of hardware, software or hardware and software example of the present invention and embodiment can be realized.As mentioned above, the main body of this method of any execution can be stored, with the form of volatibility or non-volatile memories, such as memory device, whether no matter erasable picture ROM, maybe can rewrite, or in the form of a memory, such as such as RAM, memory chip, equipment or integrated circuit or on light or the readable medium of magnetic, such as such as CD, DVD, disk or tape.It is to be appreciated that: memory device and storage medium are the examples being suitable for the machine readable storage storing one or more program, upon being performed, described one or more program realizes example of the present invention.Via any medium, such as by the signal of communication that wired or wireless connection is loaded with, example of the present invention can be transmitted electronically, and example suitably comprises identical content.
It is to be noted that because the invention solves above-described technical matters; have employed technician in computing machine and the communications field after reading this description can according to the accessible technological means of its training centre; and obtain described technique effect, so scheme claimed in the following claims belongs to the technical scheme on patent law purposes.In addition, because the claimed technical scheme of claims can manufacture in the industry or use, therefore this technical scheme possesses practicality.
The above; be only the present invention's preferably embodiment, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should forgive within protection scope of the present invention.Unless otherwise clearly stated, otherwise disclosed each feature is only the general equivalence of series or an example of similar characteristics.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (10)

1., based on a distributed large data digging method, it is characterized in that comprising the following steps:
In step sl, according to the demand of user to data available, definition data;
In step s 2, obtain data from source, prepare data, browsing data is also integrated, is checked data, with that remove mistake or inconsistent data;
In step s3, data are processed; And
In step s 4 which, result tested, verify, dispose and upgrade.
2. the method for claim 1, wherein user is the operating main body for dissimilar large data in different field, can be people, also can be a kind of mechanism, this mechanism be the device containing the base conditioning functions such as processor, storer, bus, power circuit.
3. method as claimed in claim 2, wherein this mechanism also can have input equipment as required, also has display device.
4. method as claimed in claim 2, wherein different field comprises the various fields of existing and later exploitation, can comprise multiple field or crossing domain simultaneously.
5., as the method before as described in arbitrary claim, wherein the process of data is comprised the following steps:
S31: the data integrated before and check are decomposed, becomes trained valid data;
S32: on the basis of trained valid data, the division of training classification;
S33: the division of the classification of training is delivered to all nodes;
S34: form first new trained collection;
S35: to the subset that independent node-node transmission is new; And
S36: at independent Nodes, is formed and comprises the new trained collection of second of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on the basis of second new trained collection.
6. the method according to any one of claim 1-4, wherein comprises the following steps the process of data:
S31 ': the data integrated before and check are decomposed, becomes trained valid data;
S32 ': on the basis of trained valid data, the division of training classification;
S33 ': the division of the classification of training is delivered to all nodes;
S34 ': by applying new decision method, uses effective collection formation first collection;
S35 ': to the new subset of independent node-node transmission data; And
S36 ': at independent Nodes, is formed and comprises second collection of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.
7. the method according to any one of claim 1-4, wherein comprises the following steps the process of data:
S31 ": on the basis of the subset of data, to the division of training classification;
S32 ": the division of the classification of training is delivered to all nodes;
S33 ": by applying new decision method, use effective collection formation first collection;
S34 ": to the new subset of independent node-node transmission data; And
S35 ": at independent Nodes, formed and comprise second collection of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.
8. the method according to any one of claim 1-4, wherein comprises the following steps the process of data:
" ': the data integrated before and check are decomposed, becomes trained valid data to S31;
S32 " ': on the basis of trained valid data, the division of training classification;
S33 " ': the division of the classification of training is delivered to all nodes;
S34 " ': in the mistake of the division of valid data centralized calculation classification;
S35 " ': to the mistake of the division of the classification of independent node-node transmission calculating; And
S36 " ': at independent Nodes, is formed and comprises second collection of the mistake of the division of the classification of the calculating of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.
9. one kind for realizing the system based on distributed large data digging method according to any one of claim 1-8.
10. for realize according to any one of claim 1-8 based on a distributed large data mining equipment, comprise each device for realizing each step.
CN201510651258.2A 2015-10-10 2015-10-10 Distribution-based big data mining method Pending CN105354238A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510651258.2A CN105354238A (en) 2015-10-10 2015-10-10 Distribution-based big data mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510651258.2A CN105354238A (en) 2015-10-10 2015-10-10 Distribution-based big data mining method

Publications (1)

Publication Number Publication Date
CN105354238A true CN105354238A (en) 2016-02-24

Family

ID=55330211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510651258.2A Pending CN105354238A (en) 2015-10-10 2015-10-10 Distribution-based big data mining method

Country Status (1)

Country Link
CN (1) CN105354238A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040215598A1 (en) * 2002-07-10 2004-10-28 Jerzy Bala Distributed data mining and compression method and system
CN101048732A (en) * 2004-08-31 2007-10-03 国际商业机器公司 Object oriented architecture for data integration service
CN103714139A (en) * 2013-12-20 2014-04-09 华南理工大学 Parallel data mining method for identifying a mass of mobile client bases
CN104679773A (en) * 2013-11-29 2015-06-03 中国科学院深圳先进技术研究院 Mass transaction data frequent itemset mining method and querying method
CN104933053A (en) * 2014-03-18 2015-09-23 中国银联股份有限公司 Classification of class-imbalanced data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040215598A1 (en) * 2002-07-10 2004-10-28 Jerzy Bala Distributed data mining and compression method and system
CN101048732A (en) * 2004-08-31 2007-10-03 国际商业机器公司 Object oriented architecture for data integration service
CN104679773A (en) * 2013-11-29 2015-06-03 中国科学院深圳先进技术研究院 Mass transaction data frequent itemset mining method and querying method
CN103714139A (en) * 2013-12-20 2014-04-09 华南理工大学 Parallel data mining method for identifying a mass of mobile client bases
CN104933053A (en) * 2014-03-18 2015-09-23 中国银联股份有限公司 Classification of class-imbalanced data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
韦艳艳: "分布式数据挖掘的分类器组合问题及相关技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
TWI718643B (en) Method and device for identifying abnormal groups
Oñorbe et al. How to zoom: bias, contamination and Lagrange volumes in multimass cosmological simulations
CN105373800A (en) Classification method and device
CN107680003A (en) The node tree generation method and device of project supervision task
CN105718848A (en) Quality evaluation method and apparatus of fingerprint images
CN104850500A (en) Data processing method and device used for data storage
CN106033425A (en) A data processing device and a data processing method
CN108734304A (en) A kind of training method of data model, device and computer equipment
CN104573434A (en) Account protection method, device and system
CN104424256A (en) Method and device for generating Bloom filter
CN103957116A (en) Decision-making method and system of cloud failure data
CN104731843A (en) Balancing provenance and accuracy tradeoffs in data modeling
CN112465141A (en) Model compression method, model compression device, electronic device and medium
US20170199912A1 (en) Behavior topic grids
CN111768096A (en) Rating method and device based on algorithm model, electronic equipment and storage medium
Galathiya et al. Classification with an improved decision tree algorithm
CN108131127B (en) Method and device for obtaining gas-oil ratio of production of foam oil type extra heavy oil field
CN105808748A (en) MIB (Management Information Base) version contrast method and device
CN104657475A (en) Method and system for data analysis
US9904922B2 (en) Efficient tail calculation to exploit data correlation
KR20210097204A (en) Methods and devices for outputting information
CN107133072A (en) One kind operation performs method and apparatus
CN105354238A (en) Distribution-based big data mining method
CN105260448A (en) Big data information analysis method
CN104536887A (en) Communication data detection method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160224