CN107153710A

CN107153710A - A kind of big data processing method and system

Info

Publication number: CN107153710A
Application number: CN201710356324.2A
Authority: CN
Inventors: 陈彬强; 蔡勇
Original assignee: Zhaoqing Chicco Motor Co Ltd
Current assignee: Zhaoqing Chicco Motor Co Ltd
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2017-09-12

Abstract

The embodiment of the invention discloses a kind of big data processing method and system, methods described includes：Data acquisition is carried out according to behaviors such as the conventional historical viewings of user, purchaser records；Using Hadoop distributed modes, to the data collecting module collected to data filter, obtained complete and unduplicated data；By the complete and unduplicated data after the data filtering modular filtration, it is converted into computer language and is stored in database；The information stored in the database is called, and the data in the database called out are handled using cloud computing.Using the embodiment of the present invention, rely on cloud computing to carry out distributed data digging to big data, can effectively excavate website user's behavioral data, and do cloud computing processing effectively in real time.

Description

A kind of big data processing method and system

Technical field

The present invention relates to big data processing technology field, more particularly to a kind of big data processing method and system.

Background technology

In recent years, the development of internet is more and more rapider, is also increasingly popularized using the people of internet, and people are using mutual When networking carries out daily activity, program, information are checked in such as net purchase, and commodity can all produce substantial amounts of data, and these Data are very valuable for e-commerce website or the Internet media class website, utilize the processing of these big datas Processing can obtain very valuable commercial value.

Big data is widely used in internet items application, great to the significance of website, at mass data Reason and the realization of cloud computing, can maximize help the Internet media class advertiser web site system and ecommerce class website big data Commodity supplying system obtains maximized lifting.The big data advertisement of the Internet media class website is read preference according to user and pushed, For the cloud computing of mass data, website browsing user's ecommerce class website big data business is pushed to by various advertisement forms Product are pushed to on-line purchase person, and behavior, buying behavior, product correlation, preference and use time rule are clicked on by handling user Rule pushes corresponding commodity and sales promotion information.

The appearance of big data, is triggering technology and Business Change deep in global range.Technically, big data makes The usual manner that information is extracted among data is changed.The machine played a significant role in search engine and online advertisement Device learns, it is considered to be big data plays the field of true value.Statistical disposition goes out the behavior of people, custom in the data of magnanimity Etc. mode, advertiser is at utmost helped to find accurate potential customers, so as to lift advertising results and follow-up purchase operation.

But current big data application have the shortcomings that it is many, for example：1st, the processing of data needs the number based on magnanimity According to accumulation.Current big data needs to be handled according to millions of users and its historical behavior, and the overwhelming majority is flat Platform or enterprise lack big data and relied on, often small data, middle data, in addition the number such as behavioural habits, purchaser record, browing record According to also deficienter；2nd, data processing needs powerful software and hardware to support.The calculating of current big data has higher threshold, so The calculating of big data is not also to popularize very much.Present big data, which is calculated, mainly the following two classes ecosphere：Big data of the increasing income ecosphere With the commercial big data ecosphere；3rd, data processing needs to rely on the decoding of a large amount of professional persons.The behavior model of big data, is needed There are stronger mathematical statistics requirement, microcomputer modelling requirement, the current country also lacks such talent.Such as need to be grasped data Use ability, probability statistics of base management system etc.；4th, also there is erroneous judgement in data processed result.The result of big data is past It is wrong toward the otherness for not possessing real-time, specific aim, initial data sampling precision and statistical method, and modeling structure By mistake, it can all cause processing wrong.In addition different usage scenarios also brings along entirely different result.

The content of the invention

The purpose of the embodiment of the present invention is to provide a kind of big data processing method and system, relies on cloud computing can be to big Data carry out distributed data digging, can effectively excavate website user's behavioral data, and do cloud computing processing effectively in real time.

To reach above-mentioned purpose, the embodiment of the invention discloses a kind of big data processing method, method includes：

Data acquisition is carried out according to behaviors such as the conventional historical viewings of user, purchaser records；

Using Hadoop distributed modes, to the data collecting module collected to data filter, obtain complete and Unduplicated data；

By the complete and unduplicated data after the data filtering modular filtration, it is converted into computer language and is stored in data In storehouse；

The information stored in the database is called, and the data in the database called out are handled using cloud computing.

To reach above-mentioned purpose, the embodiment of the invention also discloses a kind of big data processing system, the system includes：

Big data acquisition module, the data acquisition module is used to be entered according to behaviors such as the conventional historical viewings of user, purchaser records Row data acquisition；

Big data filtering module, the data filtering module is used to utilize Hadoop distributed modes, to the data acquisition module The data that block is collected are filtered, obtained complete and unduplicated data；

Collector, for the complete and unduplicated data for obtaining the data filtering modular filtration, is converted into computer Language；

Database, the complete and unduplicated data that the data filtering modular filtration is obtained are converted by the collector Computer language can be stored in the database；

Operating system, by the operating system, can call the data message stored in the database；

Cloud computing module, the cloud computing module can handle the data in the database.

Optionally, the data handling system can also include：

The webserver, the data cube computation in multiple databases can be got up by the webserver, and there is provided bigger Data.

Optionally, the operating system is (SuSE) Linux OS.

Optionally, the webserver is the Apache webservers.

Optionally, the database is MySQL databases.

Optionally, the collector is Perl, PHP or Python programming language.

Optionally, the data of the data collecting module collected carry out distributed data mining by the cloud computing, Required data are effectively excavated with this.

Optionally, the data handling system can also include：

Storm topological structure frameworks, can be corrected in real time by the topological structure framework in the case where not needing professional The deviation of data processing.

Optionally, the data handling system can also include：

The simple Storm topological structures of MapReduce functions, the simple Storm topological structures of the MapReduce functions can The deviation of correction data processing in real time.

It can be seen that, a kind of big data processing method and system provided in an embodiment of the present invention, according to big data processing system energy Enough lift the precision and the precision of store merchandise display of the advertizing of website；Caused by big data system treatment technology Platform can understand the behavioural habits and preference of user, and the real-time dynamic interaction during its use rapidly, allow interested wide Accuse and commodity are shown in the appropriate time with friendly form of websites, solve conventional ads and merchandise display not accurately Problem；Solve defect of the domestic enterprise on software and hardware, and operating personnel lack experience, help platform overcomes original number The problems such as according in disorder, big data model modeling, data processing and prediction, there is provided the support of real-time and relative efficiency data；Rely on Cloud computing can carry out distributed data digging to big data, can effectively excavate website user's behavioral data, and in real time effectively Do cloud computing processing in ground；Also, the Storm topological structures wherein included can in real time be rectified in the case where not needing professional Correction data treatment deviation.

Certainly, any product or method for implementing the present invention it is not absolutely required to while reaching all the above excellent Point.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of schematic flow sheet of big data processing method provided in an embodiment of the present invention.

Fig. 2 is a kind of distributed data digging schematic diagram provided in an embodiment of the present invention.

Fig. 3 is a kind of Storm topological structures configuration diagram provided in an embodiment of the present invention.

Fig. 4 is a kind of simple Storm topological structures schematic diagram of MapReduce functions provided in an embodiment of the present invention.

Fig. 5 is a kind of Hadoop clouds framework allocation plan schematic diagram provided in an embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

Referring to Fig. 1, Fig. 1 is a kind of schematic flow sheet of big data processing method provided in an embodiment of the present invention, can be wrapped Include following steps：

S101, data acquisition is carried out according to behaviors such as the conventional historical viewings of user, purchaser records；

S102, using Hadoop distributed modes, to the data collecting module collected to data filter, what is obtained is complete Whole and unduplicated data；Wherein, Hadoop distributed modes are prior art, and the embodiment of the present invention is not gone to live in the household of one's in-laws on getting married to it herein State；

S103, by the complete and unduplicated data after the data filtering modular filtration, is converted into computer language and stores In database；

S104, calls the information stored in the database, and handle the number in the database called out using cloud computing According to.

Big data processing system is applied on some ecommerce class websites, for example, apply on A stores.Wherein A stores Big data processing system is mainly included to be carried out largely timely handling to behaviors such as the conventional historical viewings of user, purchaser records, shape Into huge store Dynamic Data Warehouse, according to purchase preference and the frequency of purchase, user's commodity are pushed in time by data mining Information, automatic periodically send includes EDM, short message, the Commdity advertisement information for the diversified forms such as interior letter of standing.Further, big data is handled System is also the foundation as examination store product temperature and layout simultaneously, and the conventional product of hot topic is handled by system can be certainly It is dynamic to be ordered into most obvious position.(judge typically by IP addresses according to customer access pipe or account judges, strictly abide by Keep safe and secret principle), recommended products, hot product in website can quickly update adjustment with the operation of user, match User's commodity interested, so as at utmost lift the accurate sale of web site commodity.In order to realize A stores big data processing system A kind of function of system, big data processing system that the present invention is provided uses distributed computing architecture (LAMP), LAMP framework bags Include：(SuSE) Linux OS, the Apache webservers, MySQL database, Perl, PHP or Python programming language, All composition products are open source softwares, are the architecture frameworks of international mature.Compared with Java/J2EE frameworks, LAMP tools Have web resource abundant, light weight, it is safe the features such as, compared with the .NET frameworks of Microsoft, LAMP has general, cross-platform, high-performance Advantage.Store data real time backup, issued transaction effect are quick simultaneously, possess complete data processing function.Pass through cloud again Form of calculation, processing large-scale parallel (MPP) database, distributed data base etc., can quickly, largely, accurately handle business The buying habit of city user, pushes the product matched and is presented in diversified form in the vision of buyer, so that effectively Promote the probability and frequency of commodity purchasing.

Big data processing system is applied on some the Internet media class websites, for example, apply on a websites.Wherein a nets Big data of standing processing system, especially a websites big data ad system, can automatic lifting paid advertisement client maximum on website Degree matches potential customers, is handled by a large number of users behavioral data, is to browse web sites in a short time by cloud computing processing The associated advertising message of pushes customer.So as to promote advertisement of the online user to classification interested to be browsed, click on and look into The follow-up behavior such as see, be to realize the maximized core Internet technology of advertisement value.A advertiser web site systems are also supported simultaneously Internet overwhelming majority advertisement form, including word chain, display advertising, video ads etc..Possess sound Advertisement arrangement machine System, can precisely count advertisement PV, click on effect, data statistics etc..Possess Advertisers bid system, can according to cpc, cpm, The diversified forms such as cpa, cps, cpv are charged.In order to realize the function of a websites big data ad system, the present invention provides one Plant big data processing system and use distributed computing architecture (LAMP), LAMP frameworks include：Linux operating systems, The Apache webservers, MySQL databases, Perl, PHP or Python programming language, all composition products It is open source software, is the architecture framework of international mature.Compared with Java/J2EE frameworks, LAMP has Web resources Abundant, light weight, it is safe the features such as, compared with the .NET frameworks of Microsoft, LAMP has general, cross-platform, high-performance benefits. Simultaneously by cloud computing form, processing large-scale parallel (MPP) database, distributed data base etc., can quickly, largely, Accurately processing advertising message and diversified displaying in front of the user.

Distributed data digging according to Fig. 2, distributed data digging relies on the distributed treatment of cloud computing, distribution Formula database (PaaS) and cloud storage, virtualization technology (IaaS).Show cloud computing presentation by mobile terminal, PC ends Effect data.Website user's behavioral data can be effectively excavated, and effectively does cloud computing processing in real time, feedback user sense is emerging The advertising message and commodity of interest.

With the arriving of cloud era, big data has also attracted increasing concern.Big data is commonly used to describe one A large amount of unstructured datas and semi-structured data that company creates, these data are downloading to relevant database for handling When can overspending time and money.Big data processing is often linked together with cloud computing, because large data Ji Chu in real time Reason needs the framework as MapReduce to be shared out the work to tens of, hundreds of or even thousands of computer.

Big data needs special technology, effectively to handle the substantial amounts of data accommodated within the elapsed time.It is applied to The technology of big data, including MPP (MPP) database, data mining power network, distributed file system, distribution Formula database, cloud computing platform, internet and expansible storage system.

Storm topological structure frameworks according to Fig. 3, using Storm frameworks rapidly and efficiently, can be corrected in real time The deviation of big data processing, and need not the personnel of specialty can just draw more accurately data result.Storm is more than One traditional big data processing system, it is an example of Complex event processing (CEP) system.CEP systems are generally divided Class is for calculating and towards detection, wherein each system can be realized by user-defined algorithm in Storm.It is worth mentioning , a Storm topmost feature is that it focuses on fault-tolerant and management.Storm is realized at secure message Reason, so each tuple can carry out overall treatment by Storm topological structures；If it find that a tuple is also untreated, It automatic can reset at nozzle.Storm also achieves the fault detect of task level, when a task breaks down, disappears Breath can be redistributed quickly to start the process over automatically.Storm is managed comprising the processing more intelligent than Hadoop, flow meeting It is managed by supervisor, to ensure that resource is fully used.

Specifically, Storm also achieves a kind of data flow model, wherein data continue to flow through a conversion entity net Network, as shown in Figure 3.One data flow is abstractively referred to as a stream（Stream source, Stream source）, a stream is one unlimited Tuple sequence（Tuple stream, Tuple stream）.Tuple represents mark using some additional serializing codes just as a kind of Quasi- data type (such as integer, floating-point and byte arrays) or the structure of user defined type.Each flow by an only ID Definition, this ID can be used for the topological structure for building data source and receiver (sink).Stream originates from nozzle（Message source, Spout）, nozzle is by data from external source flows into Storm topological structures.Also, spout can launch tuple stream to disappearing Cease processor（Bolt）, Bolt can perform filtering, polymerization, inquiry database operation, and can be with the progress of one-level one-level Handle tuple stream, it is possible to carry out circulation and change（stream transformation）.

The simple Storm topological structures of MapReduce functions according to Fig. 3.For common platform or enterprise Industry, using more simple Storm models, the processing of low-volume traffic stream in can preferably adapting to, with it is more wide should Use field.Receiver (or providing the entity of conversion) is referred to as bolt.Bolt realizes single conversion and one on a stream All processing in Storm topological structures.Bolt can both realize MapReduce etc traditional function, can also realize more Complicated operation (single step function), such as filter, polymerize or communicated with database external entity.Typical Storm topologys Structure can realize multiple conversions, it is therefore desirable to multiple bolts with independent tuple stream.Nozzle and bolt are realized as in system One or more tasks.

It is noted that Storm can be used to be that word frequency easily realizes MapReduce（Map reduction）Function. As shown in Figure 4, nozzle generation textstream, bolt realizes Map（Mapping）Function (each list of a tokenized stream Word).Reduce is realized in stream from " map " bolt and then inflow one（Reduction）(so that word to be polymerize in the bolt of function Into sum).

Hadoop cloud framework allocation plan according to Fig. 5, it mainly illustrates the realization of cloud computing, passes through high in the clouds Efficient data processing is realized in configuration.Hadoop MapReduce use Master（Master）/Slave（From disk）Structure. Master is unique global administration person of whole cluster, and function includes：Job management, condition monitoring and task scheduling etc., i.e., JobTracker in MapReduce（Job controller）.Slave is responsible for the execution of task and the return of task status, i.e., TaskTracker in MapReduce（Task performer）.

Hadoop core is write using Java language, but supports the data processing write using various language to answer Use program.The realization of newest application program employs more abstruse route, to make full use of modern languages and their spy Property.

Specific operating procedure is as follows：Hadoop frameworks are realized first by five machines.

IP is followed successively by：

192.168.1.199(master)

192.168.1.200(slave)

192.168.1.201(slave)

192.168.1.202(slave)

192.168.1.203(slave)

First log into 119 servers：

[root@localhost~] #uname-ar

Linux localhost2.6.18-92.el5 #1 SMP Tue Jun 10 18:49:47 EDT 2008 i686

i686i386 GNU/Linux

Ensure the global uniqueness of computer name：

hadoop1.test.com-----192.168.1.203

hadoop2.test.com-----192.168.1.202

hadoop3.test.com-----192.168.1.201

hadoop4.test.com-----192.168.1.200

hadoop5.test.com-----192.168.1.199

Hostname is set：

Hostname hadoop5.test.com

[root@localhost~] #vi/etc/hosts

127.0.0.1 localhost.localdomain localhos

192.168.1.199 hadoop5.test.com

[root@localhost~] #uname-ar

Linux hadoop5.test.com2.6.18-92.el5 #1 SMP Tue Jun 10 18:49:47 EDT

2008i686 i686 i386 GNU/Linux

[root@localhost~] #vi/etc/sysconfig/network

NETWORKING=yes

NETWORKING_IPV6=no

#HOSTNAME=localhost.localdomain

HOSTNAME=hadoop5.test.com

GATEWAY=192.168.1.254

The setting that ssh without password is logged in：

Set up Master to each Slave SSH trusted certificates.Because Master will be started by SSH All Slave Hadoop, thus need to set up unidirectional or two-way certificate ensure need not to input again when order is performed it is close Code.Performed on Master and all Slave machines：ssh-keygen-t rsa.

When performing this order, it is seen that prompting only needs to carriage return.Then it will be produced below/root/.ssh/ Id_rsa.pub certificate file, (will remember to repair by scp on this file copy on Master machines to Slave Rename title), for example：

Scp root/.ssh/id_rsa.pub root@192.168.1.200:/root/.ssh/authorized_keys

Set up authorized_keys files, this file can be opened and looked at, that is, rsa public key conduct Key, user@IP are used as value.It can now test, need not be close from master ssh to slave Code.It is also same reversely to be set up by slave.It is why reverse, if always be in fact Master start and So It is not necessary to set up reverse, simply if it is desired to can also close Hadoop in Slave is accomplished by foundation if closing Reversely.

Specifically realize that the step of the Internet media class advertiser web site is pushed with e-commerce website merchandise display is as follows：

(a) behavioural informations such as the conventional historical viewings of data collecting module collected user, purchaser record are passed through；

(b) information collected is converted into computer language and be stored in database by conversion；And

(c) user carries out distributed data digging, feedback to the information in the database in webpage clicking according to cloud computing User's advertising message interested and commodity.

Wherein step (b) includes step：

(b.1) information collected is converted into computer language by a collector；

(b.2) information of computer language is converted into be stored in a database；

(b.3) information in multiple databases is connected by a webserver, realizes the formation of big data；And

(b.4) collected information is called at any time by an operating system.

Operating system is preferably Linux operating systems wherein described in step (b), and the webserver is preferred For the Apache webservers, the database is preferably MySQL databases, and the collector is preferably Perl, PHP Or Python programming languages.

In summary, big data processing system is combined with the basis of the various solutions of current big data technology, shape Into being concisely and efficiently technical finesse means.Suitable for medium-sized and small enterprises, media platform, electric business platform, cost performance is higher, can meet Data processing needed for day-to-day operations is supported, helps enterprise preferably to obtain income.

It should be noted that herein, all relational terms according to first and second or the like are used merely to one Entity or operation make a distinction with another entity or operation, and not necessarily require or imply between these entities or operation There is any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to contain Lid nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including the key element.

Each embodiment in this specification is described by the way of related, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.It is real especially for device Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.

Can one of ordinary skill in the art will appreciate that realizing that all or part of step in above method embodiment is To instruct the hardware of correlation to complete by program, described program can be stored in computer read/write memory medium, The storage medium designated herein obtained, according to：ROM/RAM, magnetic disc, CD etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of big data processing method, it is characterised in that methods described includes：

2. a kind of big data processing system, it is characterised in that the system includes：

3. big data processing system according to claim 2, it is characterised in that the data handling system can also be wrapped Include：

4. big data processing system according to claim 2, it is characterised in that the operating system is that Linux operates system System.

5. big data processing system according to claim 3, it is characterised in that the webserver is Apache networks Server.

6. big data processing system according to claim 2, it is characterised in that the database is MySQL databases.

7. big data processing system according to claim 2, it is characterised in that the collector be Perl, PHP or Person's Python programming languages.

8. according to any described big data processing system of claim 2 to 7, it is characterised in that the data acquisition module is adopted The data of collection carry out distributed data mining by the cloud computing, and required data are effectively excavated with this.

9. according to any described big data processing system of claim 2 to 7, it is characterised in that the data handling system is also It can include：

10. according to any described big data processing system of claim 2 to 7, it is characterised in that the data handling system is also It can include：