CN106971011A

CN106971011A - A kind of big data analysis method based on cloud platform

Info

Publication number: CN106971011A
Application number: CN201710356074.2A
Authority: CN
Inventors: 陈彬强; 蔡勇
Original assignee: Zhaoqing Chicco Motor Co Ltd
Current assignee: Zhaoqing Chicco Motor Co Ltd
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2017-07-21

Abstract

The embodiment of the invention discloses a kind of big data analysis method based on cloud platform, methods described includes：Determine data analysis target and plan；According to the data analysis target of determination and plan, the analysis framework of the big data based on cloud platform is created；Big data to be analyzed is obtained, and carries out data preparation and processing；Data filtering is carried out to data, complete and unduplicated data are obtained；Data are clustered, and to data analysis；Result is tested, verified, assessed and disposed.Using the embodiment of the present invention, accuracy, promptness and the flexibility of big data analysis are improved.

Description

A kind of big data analysis method based on cloud platform

Technical field

The present invention relates to big data analysis technical field, more particularly to a kind of big data analysis method based on cloud platform.

Background technology

With society's industrialization, the continuous improvement of the level of IT application, nowadays data, which have replaced, is calculated as information calculating Center, cloud computing, big data turn into a kind of trend and trend.Including memory capacity, availability, I/O performances, data peace All many-sides such as Quan Xing, scalability.Big data is the very huge and complicated data set of scale.Big data has 4V：Volume (a large amount of), data volume increases continuously and healthily；Velocity (high speed), data I/O speed are faster；Variety (various), data The types and sources variation；Value (value), there is the usable value of each side in it.Due to including the letter of magnanimity in big data Breath, available data resource in magnanimity information carries out distributed big data analysis and excavation is most preferably mode.However, Distributed data system of the prior art and associated database can not be satisfied with growing data volume and analysis is dug Pick demand, and data-handling efficiency is not high enough, respond it is not prompt enough because its can not effectively obtain, store, managing, Excavate and analyze the data of this feature, it is difficult to embody the accuracy, promptness and flexibility of data processing.

Therefore, in order to meeting the challenge in big data epoch, the accuracy of big data analysis, promptness and flexibly are improved Property, particularly improve precision of analysis, promptness and flexibility and improve its quality, can there is a need in the art for one kind Effectively solve the big data information analysis method of above-mentioned technical problem.

The content of the invention

The purpose of the embodiment of the present invention is to provide a kind of big data analysis method based on cloud platform, improves big data and divide Accuracy, promptness and the flexibility of analysis.

To reach above-mentioned purpose, the embodiment of the invention discloses a kind of big data analysis method based on cloud platform, method Including：

Determine data analysis target and plan；

According to the data analysis target of determination and plan, the analysis framework of the big data based on cloud platform is created；

Big data to be analyzed is obtained, and carries out data preparation and processing；

Data filtering is carried out to data, complete and unduplicated data are obtained；

Data are clustered, and to data analysis；

Result is tested, verified, assessed and disposed.

Optionally, wherein the different characteristic having for different pieces of information, characteristic and/or attribute are come mining analysis requirement and category Sex object.

Optionally, the analysis framework can use central data processing framework, or distributed data processing framework.

Optionally, the analysis framework can be any form of framework of the characteristic based on big data.

Optionally, it is described to obtain big data to be analyzed, and data preparation and processing are carried out, including：

For processing data, first posting data；

Data storage；

A kind of form is converted data to, the form is the value of a pair of binary formats；

Obtain the identifier of data and corresponding description；

Every predetermined period of time is updated the data, but need to ensure to be unable to all data of posting.

Optionally, the period is set automatically according to needs or data characteristicses come artificial or machine.

Optionally, it is described that data are clustered, and to data analysis, including：

The associated data of identification；

It is determined that each pending data point；

Data volume is reduced using cluster machine learning algorithm；

Carry out analyze data collection using the cluster machine learning algorithm.

For each pending data point, the value of a pair of binary formats is generated；

The value of a pair of binary formats further comprises cluster identifier and corresponding to the coordinate value of the data point；

For the sum of each cluster generation input；

Send the value relevant with identical cluster；

The result of cluster is stored as incoherent data.

Optionally, the machine learning algorithm is mean algorithm.

Optionally, it is described that data filtering is carried out to data, complete and unduplicated data are obtained, including：

Using Hadoop distributed modes, data filtering is carried out to data, complete and unduplicated data are obtained.

It can be seen that, using a kind of big data analysis method based on cloud platform provided in an embodiment of the present invention, determine data point Analyse target and plan；According to the data analysis target of determination and plan, the analysis framework of the big data based on cloud platform is created；Obtain Big data to be analyzed is obtained, and carries out data preparation and processing；Data filtering is carried out to data, complete and unduplicated number is obtained According to；Data are clustered, and to data analysis；Result is tested, verified, assessed and disposed.Thus, it is possible to meet big The challenge of data age, improves accuracy, promptness and the flexibility of big data analysis.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of schematic flow sheet of the big data analysis method based on cloud platform provided in an embodiment of the present invention.

A kind of flow chart that Fig. 2 is step S103 in Fig. 1 provided in an embodiment of the present invention.

A kind of flow chart that Fig. 3 is step S105 in Fig. 1 provided in an embodiment of the present invention.

Another flow chart that Fig. 4 is step S105 in Fig. 1 provided in an embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

Fig. 1 is a kind of schematic flow sheet of the big data analysis method based on cloud platform provided in an embodiment of the present invention.Such as Shown in Fig. 1, this method may include steps of：

S101, determines data analysis target and plan；

S102, according to the data analysis target of determination and plan, creates the analysis framework of the big data based on cloud platform；

S103, obtains big data to be analyzed, and carry out data preparation and processing；

Data are carried out data filtering, obtain complete and unduplicated data by S104；

Data are clustered by S105, and to data analysis；

S106, is tested result, is verified, assessed and is disposed.

Embodiments in accordance with the present invention, first, in step S101, determine data analysis target and plan.Wherein it is directed to Different characteristic, characteristic and/or the attribute that different pieces of information has come mining analysis requirement and attributes object.Because different data tools There are different features, characteristic and/or attribute, the big data of such as social media is based on interpersonal interaction；Military news Big data it is implicit or concentrated the data of military issue weapons or military trend；The big data of social news reflect spin and Including the consciousness tendency from media releasing personnel；For the big data of the technical news of some country, area or research institution Contain its research emphasis, personnel and Financing Disposition, output efficiency, possible application scope and to research and application field Leading action/influence, etc..For these contexts, it is desirable to have mining analysis requirement and category for different pieces of information Sex object, so as to strengthen the specific aim of big data analysis, the accuracy of the clustering after being establishes solid foundation.

Secondly, in step S102, according to the data analysis target of determination and plan, the analysis based on big data is created Framework.Specifically, the analysis framework can be any form of framework of the characteristic based on big data.Because different data Take on a different character, characteristic and/or attribute, so based on this, framework targetedly can be built with reference to it.The framework can base In arbitrary framework, such as, but not limited to：Central data processing framework, or distributed data processing framework can be used, certainly The framework that can also be taken other form, but on condition that the characteristic based on big data.

Again, in step s 103, big data to be analyzed is obtained, and carries out data preparation and processing.Fig. 2 is this hair The flow chart for the S103 that bright embodiment is provided.As shown in Fig. 2 obtain big data to be analyzed, and carry out data preparation and processing, The preparation of data can provide safeguard for subsequent analysis.Specifically, it may include steps of：A1：In order to handle number According to first posting data；A2：Data storage；A3：A kind of form is converted data to, the form is the value of a pair of binary formats； A4：Obtain the identifier of data and corresponding description；A5：Every predetermined period of time is updated the data, but need to ensure to be unable to posting All data, the period can as needed or data characteristicses carry out artificial or machine and set automatically.Pass through above-mentioned steps, number It is that accurate analysis is prepared according to being able to carry out preliminary treatment.

Again, in step S104, it is possible to use Hadoop distributed modes, data filtering is carried out to data, obtained Whole and unduplicated data.

Wherein, Hadoop is a distributed system architecture developed by Apache funds club.User can be In the case of not knowing about distributed low-level details, distributed program is developed.Make full use of cluster power carry out high-speed computation and Storage.

Hadoop realizes a distributed file system（Hadoop Distributed File System）, referred to as HDFS.The characteristics of HDFS has high fault tolerance, and be designed to be deployed in cheap（low-cost）On hardware；And it is provided High-throughput（high throughput）Carry out the data of access application, being adapted to those has super large data set（large data set）Application program.HDFS is relaxed（relax）POSIX requirement, can be accessed in the form of streaming（streaming access）Data in file system.

The design that Hadoop framework is most crucial is exactly：HDFS and MapReduce.HDFS is provided for the data of magnanimity and deposited Storage, then MapReduce provides calculating for the data of magnanimity.

Again, in step S105, data are clustered, and to data analysis.Embodiments in accordance with the present invention, Fig. 3 For a kind of S105 flow chart provided in an embodiment of the present invention, it is illustrated that the flow chart for being clustered and being analyzed to data.Tool For body, it may include steps of：B1：The associated data of identification；B2：It is determined that each pending data point；B3：Use Machine learning algorithm is clustered to reduce data volume；B4：Carry out analyze data collection using the cluster machine learning algorithm.

Also, Fig. 4 is another S105 provided in an embodiment of the present invention flow chart.As shown in figure 4, described enter to data Row cluster, and to data analysis, may include steps of：B1：The associated data of identification；B2：It is determined that each pending Data point；B3：Data volume is reduced using cluster machine learning algorithm；B4：Number is analyzed using the cluster machine learning algorithm According to collection；B5：For each pending data point, the value of a pair of binary formats is generated；B6：The value of a pair of binary formats is entered One step includes cluster identifier and corresponding to the coordinate value of the data point；B7：For the sum of each cluster generation input；B8：Hair Send the value relevant with identical cluster；B9：The result of cluster is stored as incoherent data.By above-mentioned steps, based on big The data that data are obtained are analyzed in detail, so as to drastically increase the accuracy of big data analysis.Preferably, in step In B3 and B4, machine learning algorithm for example can be mean algorithm.

Finally, in step s 106, result tested, verified, assessed and disposed.Specifically, in step S106 In, the mode tested result, verified, assessed and disposed be it is arbitrary, can using it is existing and develop later it is various Mode.

It can be seen that, handled more than, the information analysis method of big data can meet the challenge in big data epoch completely, carry Accuracy, promptness and the flexibility of tall and big data analysis.

It should be noted that herein, all relational terms according to first and second or the like are used merely to one Entity or operation make a distinction with another entity or operation, and not necessarily require or imply between these entities or operation There is any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to contain Lid nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including the key element.

Each embodiment in this specification is described by the way of related, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.It is real especially for device Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.

Can one of ordinary skill in the art will appreciate that realizing that all or part of step in above method embodiment is To instruct the hardware of correlation to complete by program, described program can be stored in computer read/write memory medium, The storage medium designated herein obtained, according to：ROM/RAM, magnetic disc, CD etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of big data analysis method based on cloud platform, it is characterised in that methods described includes：

Determine data analysis target and plan；

Data are clustered, and to data analysis；

Result is tested, verified, assessed and disposed.

2. according to the method described in claim 1, it is characterised in that different characteristic, the characteristic wherein having for different pieces of information And/or attribute comes mining analysis requirement and attributes object.

3. method according to claim 2, it is characterised in that the analysis framework can use central data processing framework, Or distributed data processing framework.

4. method according to claim 2, it is characterised in that the analysis framework can be the characteristic based on big data Any form of framework.

5. the method according to claim any one of 1-4, it is characterised in that acquisition big data to be analyzed, goes forward side by side Row data prepare and handled, including：

For processing data, first posting data；

Data storage；

Obtain the identifier of data and corresponding description；

6. method according to claim 5, it is characterised in that the period is according to needs or data characteristicses messenger What work or machine were set automatically.

7. the method according to any one of claim 1-4, it is characterised in that described that data are clustered, and logarithm According to analysis, including：