CN105354238A

CN105354238A - Distribution-based big data mining method

Info

Publication number: CN105354238A
Application number: CN201510651258.2A
Authority: CN
Inventors: 杨立波
Original assignee: Chengdu Bo Yuan Epoch Softcom Ltd
Current assignee: Chengdu Bo Yuan Epoch Softcom Ltd
Priority date: 2015-10-10
Filing date: 2015-10-10
Publication date: 2016-02-24

Abstract

The invention provides a distribution-based big data mining method, system and device. The distribution-based big data mining method is characterized by comprising the following steps: according to the demands of a user on available data, defining data; obtaining the data from a source, preparing the data, browsing the data, and integrating and checking the data to remove wrong or inconsistent data; processing the data; and testing, verifying, deploying and updating a result. Through the measurement method, system and device, big data mining efficiency, safety and accuracy can be improved, and calculation and storage cost can be reduced.

Description

Based on distributed large data digging method

Technical field

The present invention relates to large processing data information field, more specifically, relate to a kind of based on distributed large data digging method, system and device.

Background technology

Along with social industrialization, the improving constantly of the level of IT application, nowadays data have replaced the center calculating the information that becomes and calculate, and cloud computing, large data are becoming a kind of trend and trend.Comprise all many-sides such as memory capacity, availability, I/O performance, data security, extensibility.Large data are data sets that scale is very huge and complicated.Large data have 4V:Volume (in a large number), and data volume increases continuously and healthily; Velocity (at a high speed), data I/O speed is faster; Variety (various), data type and source variation; Value (value), there is the usable value of each side in it.But, owing to comprising the information of magnanimity in large data, and adopt the cost of the summation of the numerous small block data more distributed than analyzing and processing that assess the cost of central database method for digging high a lot, so be most preferably mode carrying out distributed large data mining to the available data resource in magnanimity information, and because need data to learn, and distribute naturally effective learning method that learning process is raising study, so distributed study is desirable way.

But, in the prior art, all much data digging method usefulness is not high enough, security, accuracy can not reach gratifying degree simultaneously, such as calculate simultaneously, the cost such as storage also fails to carry out enough optimization, therefore, need in this area a kind of can effectively solve the problems of the technologies described above based on distributed large data digging method.

Summary of the invention

An object of the present invention is to provide a kind of based on distributed large data digging method, system and its apparatus, by the method and the device performing the method, can improve large data mining usefulness, security, accuracy, and reduce the costs such as calculating, storage.

The present invention solves the problems of the technologies described above the technical scheme taked to be: one, based on distributed large data digging method, is characterized in that comprising the following steps: in step sl, according to the demand of user to data available, and definition data; In step s 2, obtain data from source, prepare data, browsing data is also integrated, is checked data, with that remove mistake or inconsistent data; In step s3, data are processed; And in step s 4 which, result tested, verify, dispose and upgrade.

According to a further aspect of the invention, provide a kind of for realizing the described system based on distributed large data digging method.

According to a further aspect of the invention, providing a kind of described based on distributed large data mining equipment for realizing, comprising each device for realizing each step.

Accompanying drawing explanation

By the mode of example instead of by the mode of restriction, embodiments of the invention are shown in the accompanying drawings, wherein:

According to embodiments of the invention, Fig. 1 illustrates a kind of process flow diagram based on distributed large data digging method.

According to first embodiment of the invention, Fig. 2 illustrates the process flow diagram to data processing.

According to second embodiment of the present invention, Fig. 3 illustrates another process flow diagram to data processing.

According to the 3rd embodiment of the present invention, Fig. 4 illustrates another process flow diagram to data processing.

According to the 4th embodiment of the present invention, Fig. 5 illustrates another process flow diagram to data processing.

Embodiment

In the following description, also several specific embodiment is shown by way of illustration with reference to accompanying drawing.It is to be appreciated that: can imagine and other embodiments can be made and do not depart from the scope of the present disclosure or spirit.Therefore, below describe in detail and should not be considered to have limited significance.

According to embodiments of the invention, Fig. 1 is exemplified with a kind of process flow diagram based on distributed large data digging method, and wherein the method can be applicable to and is suitable for based on distributed large data framework.

First, in step sl, according to the demand of user to data available, definition data.

Secondly, in step s 2, obtain data from source, prepare data, browsing data is also integrated, is checked data, with that remove mistake or inconsistent data.

Again, in step s3, data are processed.

Again, in step s 4 which, result is tested, verify, dispose and upgrades.

Preferably, the method is applied to based in distributed large data framework.

In above-mentioned steps S1, user can be the operating main body for dissimilar large data in different field, can be people, also can be the mechanism of such as electronic equipment and so on, this mechanism is the device containing the base conditioning functions such as processor, storer, bus, power circuit, preferably, this mechanism also can have the input equipment of such as keyboard, keypad, touch-screen and so on as required, can also have the display device of such as graphic user interface and so on.Different field comprises the various fields of existing and later exploitation, even can comprise multiple field or crossing domain simultaneously.The definition of data is depended on to the requirement of user.

In step s 2, the mode obtaining data is arbitrary, can adopt the various modes of existing and later exploitation.In like manner, integrate and/or check that the mode of data is also any.

In step s 4 which, the mode tested result, verify, dispose and upgrade also is arbitrary, can adopt the various modes of existing and later exploitation.

According to first embodiment of the invention, Fig. 2 illustrates the process flow diagram to data processing.In step s3, following steps are preferably included to the process of data: S31: the data integrated before and check are decomposed, become trained valid data; S32: on the basis of trained valid data, the division of training classification; S33: the division of the classification of training is delivered to all nodes; S34: form first new trained collection; S35: to the subset that independent node-node transmission is new; S36: at independent Nodes, is formed and comprises the new trained collection of second of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on the basis of second new trained collection.

By above process, large data mining usefulness is greatly improved, and security, accuracy are also improved, and also reduces the costs such as calculating, storage simultaneously.

According to second embodiment of the present invention, Fig. 3 illustrates another process flow diagram to data processing.Alternately, in step s3, or can following steps be preferably included: S31 ＇: the data integrated before and check are decomposed, become trained valid data; S32 ＇: on the basis of trained valid data, the division of training classification; S33 ＇: the division of the classification of training is delivered to all nodes; S34 ＇: by applying new decision method, uses effective collection formation first collection; S35 ＇: to the new subset of independent node-node transmission data; S36 ＇: at independent Nodes, is formed and comprises second collection of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.

According to the 3rd embodiment of the present invention, Fig. 4 illustrates another process flow diagram to data processing.Alternately, in step s3, or following steps can be preferably included: S31 ": on the basis of the subset of data, to the division of training classification; S32 ": the division of the classification of training is delivered to all nodes; S33 ": by applying new decision method, use effective collection formation first collection; S34 ": to the new subset of independent node-node transmission data; S35 ": at independent Nodes, formed and comprise second collection of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.

According to the 4th embodiment of the present invention, Fig. 5 illustrates another process flow diagram to data processing.Alternately, in step s3 " ＇: the data integrated before and check are decomposed, becomes trained valid data, or can to preferably include following steps: S31; S32 " ＇: on the basis of trained valid data, the division of training classification; S33 " ＇: the division of the classification of training is delivered to all nodes; S34 " ＇: in the mistake of the division of valid data centralized calculation classification; S35 " ＇: to the mistake of the division of the classification of independent node-node transmission calculating; S36 " ＇: at independent Nodes, is formed and comprises second collection of the mistake of the division of the classification of the calculating of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.

By above three kinds of alternative process, large data mining usefulness is greatly improved, and security, accuracy are also improved, and also reduces the costs such as calculating, storage simultaneously.

It is to be appreciated that: the form of combination of hardware, software or hardware and software example of the present invention and embodiment can be realized.As mentioned above, the main body of this method of any execution can be stored, with the form of volatibility or non-volatile memories, such as memory device, whether no matter erasable picture ROM, maybe can rewrite, or in the form of a memory, such as such as RAM, memory chip, equipment or integrated circuit or on light or the readable medium of magnetic, such as such as CD, DVD, disk or tape.It is to be appreciated that: memory device and storage medium are the examples being suitable for the machine readable storage storing one or more program, upon being performed, described one or more program realizes example of the present invention.Via any medium, such as by the signal of communication that wired or wireless connection is loaded with, example of the present invention can be transmitted electronically, and example suitably comprises identical content.

It is to be noted that because the invention solves above-described technical matters; have employed technician in computing machine and the communications field after reading this description can according to the accessible technological means of its training centre; and obtain described technique effect, so scheme claimed in the following claims belongs to the technical scheme on patent law purposes.In addition, because the claimed technical scheme of claims can manufacture in the industry or use, therefore this technical scheme possesses practicality.

The above; be only the present invention's preferably embodiment, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should forgive within protection scope of the present invention.Unless otherwise clearly stated, otherwise disclosed each feature is only the general equivalence of series or an example of similar characteristics.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims

1., based on a distributed large data digging method, it is characterized in that comprising the following steps:

In step sl, according to the demand of user to data available, definition data;

In step s 2, obtain data from source, prepare data, browsing data is also integrated, is checked data, with that remove mistake or inconsistent data;

In step s3, data are processed; And

In step s 4 which, result tested, verify, dispose and upgrade.

2. the method for claim 1, wherein user is the operating main body for dissimilar large data in different field, can be people, also can be a kind of mechanism, this mechanism be the device containing the base conditioning functions such as processor, storer, bus, power circuit.

3. method as claimed in claim 2, wherein this mechanism also can have input equipment as required, also has display device.

4. method as claimed in claim 2, wherein different field comprises the various fields of existing and later exploitation, can comprise multiple field or crossing domain simultaneously.

5., as the method before as described in arbitrary claim, wherein the process of data is comprised the following steps:

S31: the data integrated before and check are decomposed, becomes trained valid data;

S32: on the basis of trained valid data, the division of training classification;

S33: the division of the classification of training is delivered to all nodes;

S34: form first new trained collection;

S35: to the subset that independent node-node transmission is new; And

S36: at independent Nodes, is formed and comprises the new trained collection of second of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on the basis of second new trained collection.

6. the method according to any one of claim 1-4, wherein comprises the following steps the process of data:

S31 ＇: the data integrated before and check are decomposed, becomes trained valid data;

S32 ＇: on the basis of trained valid data, the division of training classification;

S33 ＇: the division of the classification of training is delivered to all nodes;

S34 ＇: by applying new decision method, uses effective collection formation first collection;

S35 ＇: to the new subset of independent node-node transmission data; And

S36 ＇: at independent Nodes, is formed and comprises second collection of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.

7. the method according to any one of claim 1-4, wherein comprises the following steps the process of data:

S31 ": on the basis of the subset of data, to the division of training classification;

S32 ": the division of the classification of training is delivered to all nodes;

S33 ": by applying new decision method, use effective collection formation first collection;

S34 ": to the new subset of independent node-node transmission data; And

S35 ": at independent Nodes, formed and comprise second collection of the new subset of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.

8. the method according to any one of claim 1-4, wherein comprises the following steps the process of data:

" ＇: the data integrated before and check are decomposed, becomes trained valid data to S31;

S32 " ＇: on the basis of trained valid data, the division of training classification;

S33 " ＇: the division of the classification of training is delivered to all nodes;

S34 " ＇: in the mistake of the division of valid data centralized calculation classification;

S35 " ＇: to the mistake of the division of the classification of independent node-node transmission calculating; And

S36 " ＇: at independent Nodes, is formed and comprises second collection of the mistake of the division of the classification of the calculating of all data, and to the training that the division of the classification of entirety is carried out again on second basis collected.

9. one kind for realizing the system based on distributed large data digging method according to any one of claim 1-8.

10. for realize according to any one of claim 1-8 based on a distributed large data mining equipment, comprise each device for realizing each step.