CN106971011A - A kind of big data analysis method based on cloud platform - Google Patents

A kind of big data analysis method based on cloud platform Download PDF

Info

Publication number
CN106971011A
CN106971011A CN201710356074.2A CN201710356074A CN106971011A CN 106971011 A CN106971011 A CN 106971011A CN 201710356074 A CN201710356074 A CN 201710356074A CN 106971011 A CN106971011 A CN 106971011A
Authority
CN
China
Prior art keywords
data
analysis
framework
big data
big
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201710356074.2A
Other languages
Chinese (zh)
Inventor
陈彬强
蔡勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhaoqing Chicco Motor Co Ltd
Original Assignee
Zhaoqing Chicco Motor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhaoqing Chicco Motor Co Ltd filed Critical Zhaoqing Chicco Motor Co Ltd
Priority to CN201710356074.2A priority Critical patent/CN106971011A/en
Publication of CN106971011A publication Critical patent/CN106971011A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of big data analysis method based on cloud platform, methods described includes:Determine data analysis target and plan;According to the data analysis target of determination and plan, the analysis framework of the big data based on cloud platform is created;Big data to be analyzed is obtained, and carries out data preparation and processing;Data filtering is carried out to data, complete and unduplicated data are obtained;Data are clustered, and to data analysis;Result is tested, verified, assessed and disposed.Using the embodiment of the present invention, accuracy, promptness and the flexibility of big data analysis are improved.

Description

A kind of big data analysis method based on cloud platform
Technical field
The present invention relates to big data analysis technical field, more particularly to a kind of big data analysis method based on cloud platform.
Background technology
With society's industrialization, the continuous improvement of the level of IT application, nowadays data, which have replaced, is calculated as information calculating Center, cloud computing, big data turn into a kind of trend and trend.Including memory capacity, availability, I/O performances, data peace All many-sides such as Quan Xing, scalability.Big data is the very huge and complicated data set of scale.Big data has 4V:Volume (a large amount of), data volume increases continuously and healthily;Velocity (high speed), data I/O speed are faster;Variety (various), data The types and sources variation;Value (value), there is the usable value of each side in it.Due to including the letter of magnanimity in big data Breath, available data resource in magnanimity information carries out distributed big data analysis and excavation is most preferably mode.However, Distributed data system of the prior art and associated database can not be satisfied with growing data volume and analysis is dug Pick demand, and data-handling efficiency is not high enough, respond it is not prompt enough because its can not effectively obtain, store, managing, Excavate and analyze the data of this feature, it is difficult to embody the accuracy, promptness and flexibility of data processing.
Therefore, in order to meeting the challenge in big data epoch, the accuracy of big data analysis, promptness and flexibly are improved Property, particularly improve precision of analysis, promptness and flexibility and improve its quality, can there is a need in the art for one kind Effectively solve the big data information analysis method of above-mentioned technical problem.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of big data analysis method based on cloud platform, improves big data and divide Accuracy, promptness and the flexibility of analysis.
To reach above-mentioned purpose, the embodiment of the invention discloses a kind of big data analysis method based on cloud platform, method Including:
Determine data analysis target and plan;
According to the data analysis target of determination and plan, the analysis framework of the big data based on cloud platform is created;
Big data to be analyzed is obtained, and carries out data preparation and processing;
Data filtering is carried out to data, complete and unduplicated data are obtained;
Data are clustered, and to data analysis;
Result is tested, verified, assessed and disposed.
Optionally, wherein the different characteristic having for different pieces of information, characteristic and/or attribute are come mining analysis requirement and category Sex object.
Optionally, the analysis framework can use central data processing framework, or distributed data processing framework.
Optionally, the analysis framework can be any form of framework of the characteristic based on big data.
Optionally, it is described to obtain big data to be analyzed, and data preparation and processing are carried out, including:
For processing data, first posting data;
Data storage;
A kind of form is converted data to, the form is the value of a pair of binary formats;
Obtain the identifier of data and corresponding description;
Every predetermined period of time is updated the data, but need to ensure to be unable to all data of posting.
Optionally, the period is set automatically according to needs or data characteristicses come artificial or machine.
Optionally, it is described that data are clustered, and to data analysis, including:
The associated data of identification;
It is determined that each pending data point;
Data volume is reduced using cluster machine learning algorithm;
Carry out analyze data collection using the cluster machine learning algorithm.
Optionally, it is described that data are clustered, and to data analysis, including:
For each pending data point, the value of a pair of binary formats is generated;
The value of a pair of binary formats further comprises cluster identifier and corresponding to the coordinate value of the data point;
For the sum of each cluster generation input;
Send the value relevant with identical cluster;
The result of cluster is stored as incoherent data.
Optionally, the machine learning algorithm is mean algorithm.
Optionally, it is described that data filtering is carried out to data, complete and unduplicated data are obtained, including:
Using Hadoop distributed modes, data filtering is carried out to data, complete and unduplicated data are obtained.
It can be seen that, using a kind of big data analysis method based on cloud platform provided in an embodiment of the present invention, determine data point Analyse target and plan;According to the data analysis target of determination and plan, the analysis framework of the big data based on cloud platform is created;Obtain Big data to be analyzed is obtained, and carries out data preparation and processing;Data filtering is carried out to data, complete and unduplicated number is obtained According to;Data are clustered, and to data analysis;Result is tested, verified, assessed and disposed.Thus, it is possible to meet big The challenge of data age, improves accuracy, promptness and the flexibility of big data analysis.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of the big data analysis method based on cloud platform provided in an embodiment of the present invention.
A kind of flow chart that Fig. 2 is step S103 in Fig. 1 provided in an embodiment of the present invention.
A kind of flow chart that Fig. 3 is step S105 in Fig. 1 provided in an embodiment of the present invention.
Another flow chart that Fig. 4 is step S105 in Fig. 1 provided in an embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
Fig. 1 is a kind of schematic flow sheet of the big data analysis method based on cloud platform provided in an embodiment of the present invention.Such as Shown in Fig. 1, this method may include steps of:
S101, determines data analysis target and plan;
S102, according to the data analysis target of determination and plan, creates the analysis framework of the big data based on cloud platform;
S103, obtains big data to be analyzed, and carry out data preparation and processing;
Data are carried out data filtering, obtain complete and unduplicated data by S104;
Data are clustered by S105, and to data analysis;
S106, is tested result, is verified, assessed and is disposed.
Embodiments in accordance with the present invention, first, in step S101, determine data analysis target and plan.Wherein it is directed to Different characteristic, characteristic and/or the attribute that different pieces of information has come mining analysis requirement and attributes object.Because different data tools There are different features, characteristic and/or attribute, the big data of such as social media is based on interpersonal interaction;Military news Big data it is implicit or concentrated the data of military issue weapons or military trend;The big data of social news reflect spin and Including the consciousness tendency from media releasing personnel;For the big data of the technical news of some country, area or research institution Contain its research emphasis, personnel and Financing Disposition, output efficiency, possible application scope and to research and application field Leading action/influence, etc..For these contexts, it is desirable to have mining analysis requirement and category for different pieces of information Sex object, so as to strengthen the specific aim of big data analysis, the accuracy of the clustering after being establishes solid foundation.
Secondly, in step S102, according to the data analysis target of determination and plan, the analysis based on big data is created Framework.Specifically, the analysis framework can be any form of framework of the characteristic based on big data.Because different data Take on a different character, characteristic and/or attribute, so based on this, framework targetedly can be built with reference to it.The framework can base In arbitrary framework, such as, but not limited to:Central data processing framework, or distributed data processing framework can be used, certainly The framework that can also be taken other form, but on condition that the characteristic based on big data.
Again, in step s 103, big data to be analyzed is obtained, and carries out data preparation and processing.Fig. 2 is this hair The flow chart for the S103 that bright embodiment is provided.As shown in Fig. 2 obtain big data to be analyzed, and carry out data preparation and processing, The preparation of data can provide safeguard for subsequent analysis.Specifically, it may include steps of:A1:In order to handle number According to first posting data;A2:Data storage;A3:A kind of form is converted data to, the form is the value of a pair of binary formats; A4:Obtain the identifier of data and corresponding description;A5:Every predetermined period of time is updated the data, but need to ensure to be unable to posting All data, the period can as needed or data characteristicses carry out artificial or machine and set automatically.Pass through above-mentioned steps, number It is that accurate analysis is prepared according to being able to carry out preliminary treatment.
Again, in step S104, it is possible to use Hadoop distributed modes, data filtering is carried out to data, obtained Whole and unduplicated data.
Wherein, Hadoop is a distributed system architecture developed by Apache funds club.User can be In the case of not knowing about distributed low-level details, distributed program is developed.Make full use of cluster power carry out high-speed computation and Storage.
Hadoop realizes a distributed file system(Hadoop Distributed File System), referred to as HDFS.The characteristics of HDFS has high fault tolerance, and be designed to be deployed in cheap(low-cost)On hardware;And it is provided High-throughput(high throughput)Carry out the data of access application, being adapted to those has super large data set(large data set)Application program.HDFS is relaxed(relax)POSIX requirement, can be accessed in the form of streaming(streaming access)Data in file system.
The design that Hadoop framework is most crucial is exactly:HDFS and MapReduce.HDFS is provided for the data of magnanimity and deposited Storage, then MapReduce provides calculating for the data of magnanimity.
Again, in step S105, data are clustered, and to data analysis.Embodiments in accordance with the present invention, Fig. 3 For a kind of S105 flow chart provided in an embodiment of the present invention, it is illustrated that the flow chart for being clustered and being analyzed to data.Tool For body, it may include steps of:B1:The associated data of identification;B2:It is determined that each pending data point;B3:Use Machine learning algorithm is clustered to reduce data volume;B4:Carry out analyze data collection using the cluster machine learning algorithm.
Also, Fig. 4 is another S105 provided in an embodiment of the present invention flow chart.As shown in figure 4, described enter to data Row cluster, and to data analysis, may include steps of:B1:The associated data of identification;B2:It is determined that each pending Data point;B3:Data volume is reduced using cluster machine learning algorithm;B4:Number is analyzed using the cluster machine learning algorithm According to collection;B5:For each pending data point, the value of a pair of binary formats is generated;B6:The value of a pair of binary formats is entered One step includes cluster identifier and corresponding to the coordinate value of the data point;B7:For the sum of each cluster generation input;B8:Hair Send the value relevant with identical cluster;B9:The result of cluster is stored as incoherent data.By above-mentioned steps, based on big The data that data are obtained are analyzed in detail, so as to drastically increase the accuracy of big data analysis.Preferably, in step In B3 and B4, machine learning algorithm for example can be mean algorithm.
Finally, in step s 106, result tested, verified, assessed and disposed.Specifically, in step S106 In, the mode tested result, verified, assessed and disposed be it is arbitrary, can using it is existing and develop later it is various Mode.
It can be seen that, handled more than, the information analysis method of big data can meet the challenge in big data epoch completely, carry Accuracy, promptness and the flexibility of tall and big data analysis.
It should be noted that herein, all relational terms according to first and second or the like are used merely to one Entity or operation make a distinction with another entity or operation, and not necessarily require or imply between these entities or operation There is any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to contain Lid nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including the key element.
Each embodiment in this specification is described by the way of related, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.It is real especially for device Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
Can one of ordinary skill in the art will appreciate that realizing that all or part of step in above method embodiment is To instruct the hardware of correlation to complete by program, described program can be stored in computer read/write memory medium, The storage medium designated herein obtained, according to:ROM/RAM, magnetic disc, CD etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims (10)

1. a kind of big data analysis method based on cloud platform, it is characterised in that methods described includes:
Determine data analysis target and plan;
According to the data analysis target of determination and plan, the analysis framework of the big data based on cloud platform is created;
Big data to be analyzed is obtained, and carries out data preparation and processing;
Data filtering is carried out to data, complete and unduplicated data are obtained;
Data are clustered, and to data analysis;
Result is tested, verified, assessed and disposed.
2. according to the method described in claim 1, it is characterised in that different characteristic, the characteristic wherein having for different pieces of information And/or attribute comes mining analysis requirement and attributes object.
3. method according to claim 2, it is characterised in that the analysis framework can use central data processing framework, Or distributed data processing framework.
4. method according to claim 2, it is characterised in that the analysis framework can be the characteristic based on big data Any form of framework.
5. the method according to claim any one of 1-4, it is characterised in that acquisition big data to be analyzed, goes forward side by side Row data prepare and handled, including:
For processing data, first posting data;
Data storage;
A kind of form is converted data to, the form is the value of a pair of binary formats;
Obtain the identifier of data and corresponding description;
Every predetermined period of time is updated the data, but need to ensure to be unable to all data of posting.
6. method according to claim 5, it is characterised in that the period is according to needs or data characteristicses messenger What work or machine were set automatically.
7. the method according to any one of claim 1-4, it is characterised in that described that data are clustered, and logarithm According to analysis, including:
The associated data of identification;
It is determined that each pending data point;
Data volume is reduced using cluster machine learning algorithm;
Carry out analyze data collection using the cluster machine learning algorithm.
8. method according to claim 7, it is characterised in that described that data are clustered, and to data analysis, bag Include:
For each pending data point, the value of a pair of binary formats is generated;
The value of a pair of binary formats further comprises cluster identifier and corresponding to the coordinate value of the data point;
For the sum of each cluster generation input;
Send the value relevant with identical cluster;
The result of cluster is stored as incoherent data.
9. the method according to any one of claim 7 or 8, it is characterised in that the machine learning algorithm is that average is calculated Method.
10. the method according to claim any one of 1-9, it is characterised in that described to carry out data filtering to data, is obtained Complete and unduplicated data, including:
Using Hadoop distributed modes, data filtering is carried out to data, complete and unduplicated data are obtained.
CN201710356074.2A 2017-05-19 2017-05-19 A kind of big data analysis method based on cloud platform Withdrawn CN106971011A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710356074.2A CN106971011A (en) 2017-05-19 2017-05-19 A kind of big data analysis method based on cloud platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710356074.2A CN106971011A (en) 2017-05-19 2017-05-19 A kind of big data analysis method based on cloud platform

Publications (1)

Publication Number Publication Date
CN106971011A true CN106971011A (en) 2017-07-21

Family

ID=59325805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710356074.2A Withdrawn CN106971011A (en) 2017-05-19 2017-05-19 A kind of big data analysis method based on cloud platform

Country Status (1)

Country Link
CN (1) CN106971011A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107741879A (en) * 2017-10-19 2018-02-27 郑州云海信息技术有限公司 A kind of big data processing method and its device
CN108038228A (en) * 2017-12-25 2018-05-15 佛山市车品匠汽车用品有限公司 A kind of method for digging and device based on database

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104320460A (en) * 2014-10-24 2015-01-28 西安未来国际信息股份有限公司 Big data processing method
CN105260448A (en) * 2015-10-10 2016-01-20 成都博元时代软件有限公司 Big data information analysis method
CN106202192A (en) * 2016-06-28 2016-12-07 浪潮软件集团有限公司 Workflow-based big data analysis method
CN106339439A (en) * 2016-08-22 2017-01-18 成都众易通科技有限公司 Big data analysis method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104320460A (en) * 2014-10-24 2015-01-28 西安未来国际信息股份有限公司 Big data processing method
CN105260448A (en) * 2015-10-10 2016-01-20 成都博元时代软件有限公司 Big data information analysis method
CN106202192A (en) * 2016-06-28 2016-12-07 浪潮软件集团有限公司 Workflow-based big data analysis method
CN106339439A (en) * 2016-08-22 2017-01-18 成都众易通科技有限公司 Big data analysis method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107741879A (en) * 2017-10-19 2018-02-27 郑州云海信息技术有限公司 A kind of big data processing method and its device
CN108038228A (en) * 2017-12-25 2018-05-15 佛山市车品匠汽车用品有限公司 A kind of method for digging and device based on database

Similar Documents

Publication Publication Date Title
CN110363449B (en) Risk identification method, device and system
CN102591917B (en) Data processing method and system and related device
CN106709012A (en) Method and device for analyzing big data
CN113157448A (en) System and method for managing feature processing
CN105843841A (en) Small file storing method and system
US20150100596A1 (en) System and method for performing set operations with defined sketch accuracy distribution
US10268749B1 (en) Clustering sparse high dimensional data using sketches
CN107577724A (en) A kind of big data processing method
CN107748752A (en) A kind of data processing method and device
CN112765468A (en) Personalized user service customization method and device
CN106971011A (en) A kind of big data analysis method based on cloud platform
Chen Higher mathematics teaching resource scheduling system based on cloud computing
CN107871055A (en) A kind of data analysing method and device
US11783221B2 (en) Data exposure for transparency in artificial intelligence
Gupta et al. Feature selection: an overview
Pranav et al. Data mining in cloud computing
CN108256694A (en) Based on Fuzzy time sequence forecasting system, the method and device for repeating genetic algorithm
CN112613562B (en) Data analysis system and method based on multi-center cloud computing
CN109032940A (en) A kind of test scene input method, device, equipment and storage medium
Ma The Research of Stock Predictive Model based on the Combination of CART and DBSCAN
CN108268620A (en) A kind of Document Classification Method based on hadoop data minings
Zhang et al. Self‐Adaptive K‐Means Based on a Covering Algorithm
CN106528872B (en) A kind of data search method under big data environment
CN108090182B (en) A kind of distributed index method and system of extensive high dimensional data
Han et al. Research on data mining and visualization technology

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20170721

WW01 Invention patent application withdrawn after publication