CN105930462A - Cloud computing platform based massive data processing method - Google Patents

Cloud computing platform based massive data processing method Download PDF

Info

Publication number
CN105930462A
CN105930462A CN201610255566.8A CN201610255566A CN105930462A CN 105930462 A CN105930462 A CN 105930462A CN 201610255566 A CN201610255566 A CN 201610255566A CN 105930462 A CN105930462 A CN 105930462A
Authority
CN
China
Prior art keywords
information
data
cloud computing
computing platform
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610255566.8A
Other languages
Chinese (zh)
Inventor
范东来
何宏靖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Business Big Data Technology Co Ltd
Original Assignee
Chengdu Business Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Business Big Data Technology Co Ltd filed Critical Chengdu Business Big Data Technology Co Ltd
Priority to CN201610255566.8A priority Critical patent/CN105930462A/en
Publication of CN105930462A publication Critical patent/CN105930462A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the technical field of Internet information processing, and particularly to a cloud computing platform based massive data processing method. The method comprises by setting filter conditions such as a field and the like, extracting key information elements in each file in original data, and forming a corresponding data record; storing each data record to a database; and based on this, according to the same information elements contained in different data records, using a big data processing framework under a cloud computing platform to abstract an association relation between information elements. The method disclosed by the present invention can analyze associated information and a corresponding association path hidden behind massive target information in the massive Internet information according to requirements and then provide an analysis result to a user through a query port, greatly saves time and manpower costs of related information compilation and analysis for the user, and provides effective technical support for target background analysis, marketing, market segmentation, risk prediction, risk prevention and control, and the like.

Description

Mass data processing method based on cloud computing platform
Technical field
The present invention relates to Internet technical field, particularly to magnanimity number based on cloud computing platform According to processing method.
Background technology
Along with development and the progress of science and technology of society, the contact between individuality or group becomes more Adding closely, contact closely promotes fast propagation and the growth of information, and the world today is already Entering the information age, along with explosive growth and the accumulation of information, big data age is the most recent Facing, the basic feature of big data can describe with 4 " V ", i.e. data volume big (Volume), Wide variety (Variety), value density low (Value), speed fast timeliness height (Velocity); As most important of which feature: data volume is big and value density low be but the such magnanimity number of puzzlement It is believed that a difficult problem for breath digging utilization, inside the data of magnanimity, obtain people the most accurately and close The information of the heart, difficult just as searching for a needle in a haystack;Meanwhile in the face of the information of magnanimity, as What removes to analyze the dependency between certain category information, and analyzes information intrinsic value behind with this, The value of data message, dependency in big data analysis is just embodied in aspect higher, deeper More important than cause effect relation, but in the face of the data of such magnanimity, it is desirable to analyze fast and accurately Go out the incidence relation between data, the most difficult.
Actually in the information ocean of numerous and complicated, the contact between some information often than with Contact between other information will the most much, and these to have certain information being closely connected past Toward reflection is real-life particular kind of relationship between men or between group, these Particular kind of relationship can make it influence each other in relevant society or economic activity or pin down;From For spreading network information angle, grasp the informational linkage node of some keys for social management With business activity, there is great positive effect, because for the angle of Information Communication, these Information (or risk) spread speed of important informational linkage node or coverage can compare Other information points are the widest;Such analysis can be used in such as public sentiment supervision, pathophoresis Control or the field such as advertisement putting.
For another one angle, for specific information object, how to analyze this target with Incidence relation between other targets has actual meaning in a lot of fields, because having The target of incidence relation often has bigger than individually simple individuality when carrying out various activity The face that affects, and have the target of incidence relation externally set up various movable time, by interior Mutually pining down or supporting of the incidence relation in portion, can be more multiple than the event trace of simple target Miscellaneous.And in actual life, the incidence relation between information object is extremely complex, and typically Being hiding, people can not be perceived by surface activity or surface information, is more difficult to Find out whether this target has incidence relation, or which kind of incidence relation with other targets.? Under such circumstances, the socio-economic activity of people can be brought very by these implicit incidence relations The most potential value or risk.The implicit associations analyzing these closes the data surface tying up to magnanimity Front will become more difficult, if these tasks are realized one by one by individual, huge by expending Manpower and time cost;It is badly in need of a kind of processing method, helps analyst to realize this huge numerous Trivial calculating process, it is provided that this analysis result.
Summary of the invention
It is an object of the invention to overcome the deficiency in the presence of prior art, it is provided that based on cloud meter Calculating the mass data processing method of platform, extracting in data base needs initial data to be processed, By the big data processing shelf of cloud computing platform, utilize information identical in different pieces of information record Element analysis goes out the incidence relation between magnanimity target information;The inventive method system can be in sea In amount internet information, it is arranged as required to analyze target, and then analyzes between different target Whether having incidence relation and be which kind of incidence relation, the degree of depth for data message is excavated and should With providing the very reliable approach easily of one, for target background analysis, marketing, city Field fine, risk profile and prevention and control etc. provide a kind of novel effective way.
For achieving the above object, the present invention provides mass data processing side based on cloud computing platform Method: by arranging the filterconditions such as field, to the key message list in initial data every document Unit extracts, and according to the order set, the key message unit extracted is arranged in a number According to record, and pieces of data record is stored in data base (usually non-relational data Storehouse), on this basis, according to information unit identical included in different pieces of information record, should The incidence relation between information unit is gone out with the distributed treatment Model Abstraction under cloud computing framework.
Concrete, described inventive method comprises implemented below step:
(1), in the basic data of every from initial data, the field according to arranging extracts Corresponding information, forms corresponding data record;
(2) in a data record, the first information and the second information are comprised, wherein the second letter Breath is the once related information of the first information;The second information and is comprised in the second data record Three information, wherein said 3rd information is the once related information of described second information;;Pass through Described 3rd information is become described first letter by the distributed treatment framework under cloud computing platform Two degree of related informations of breath;And take out from the first information through the second information to the 3rd information Associated path;
(3) as comprised the 4th information and the 3rd information in the 3rd data record, wherein the 4th letter Breath is the once related information of the 3rd information, by the distributed treatment framework under cloud computing platform By two degree of related informations that the 4th Information expansion is the first information;And take out from the first information Associated path through the second information to the 3rd information to the 4th information;
The like, take out the N degree related information with the first information as starting point and correspondence Associated path, wherein N >=1.
The wherein said first information, the second information, the 3rd information and the 4th information refer to information Content, the not order of representative information.(can be risen with target information as starting point by the inventive method The selection of point is arranged according to analyzing needs), find out other letters being associated with target step by step Breath and corresponding associated path, by associated path can be apparent from demonstrate analysis target with The approach that specifically associates between related information.And the calculating of incidence relation of the present invention is in terms of cloud The big data processing shelf calculating platform realizes, can place to the target parallel of magnanimity simultaneously Reason, say, that from basic data to the calculating of N degree related information, be all multiple targets The most also column processing.Can be seen that the increase step by step along with degree of association N, the complexity of calculating It is continuously increased with data dimension, and the most complicated data handling procedure is by cloud computing platform Big data processing shelf is (at the big data such as MapReduce and Spark under such as Hadoop Reason framework) it is able to the most quickly realize;MapReduce and Spark etc. are big, and data process Framework can make the Interface design upper strata instruction that user only need to provide according to Computational frame, is being not related to In the case of heart bottom running, process framework and automatically call the phase of inside according to upper strata instruction Close resource, and by task automatic segmentation, the multiple nodes being assigned to inside process, real Show the parallel of data efficiently to calculate, carried after the most automatically result being integrated after processing is completed Supply user;Tasks make progress is increasingly automated, is greatly saved manpower, improves number According to treatment effeciency.The present invention utilizes the big data processing shelf of cloud computing platform to be magnanimity target Associated context analysis provide the process approach of fast and reliable.
Original data storage in the present invention is in data base, and the source of described initial data is permissible It is the data crawled as required from interconnection, interconnection comprises the abundantest information source, From the Internet, crawl relevant information as required, and the information of acquisition carried out advanced treating, For the process of refinement of information, and good application provides a kind of brand-new approach.
Further, the calculating process in described N degree incidence relation, all once to associate pass Based on system, during being i.e. the tracking (calculating) of above-mentioned related information, nth degree associates Information is the once related information of N-1 degree related information.Follow the trail of related information the most step by step Calculating, calculate clear logic, running is simple, it is ensured that the accuracy rate of operation result.
Further, the data message extracted in described step (1) can first pass through clearly Wash and carry out data prediction.
Information unit (institute further, in described step (1), in described data record State the content that information unit refers to that each field is corresponding) between use separator separate, such as Space, comma, pause mark.Separate using separator between information unit, it is to avoid different information Location contents inter-adhesive, for the extraction of subsequent association information content with calculate and provide basis.
Further, by the field of data message extracted in described step (1) and content As key-value pair: wherein field is as " key ", and content corresponding to field is " value ";According to The starting point that the content that analyzing needs one of them field optional corresponding is followed the trail of as related information (associated information), and using content corresponding for other fields in every data record as quilt The once related information of related information, thus complete the calculating of once incidence relation.Once associated The calculating of information is the basis that follow-up N degree related information calculates.
Further, by the once related information of formation in described step (2) according to setting Structural order, stores.Once related information is deposited according to the structure set and order Storage so that the once related information data memory format that different target is formed is unified, it is simple to after The data of continuous step process.
Further, the once related information formed in described step (2), can be according to mesh Mark (origin information), once related information, the structural order of relational tags store.Institute State the description that correlation tag is once incidence relation between related information and target information to this, Can be associated data inquiry provide succinctly describe intuitively.
Further, the two degree of related informations formed in described step (3) according to: once closed Connection data, the sequential organization of two degree of associated datas store;And the different degree of association will be belonged to Information unit stamps the labelling of correspondence, the once number of the inside of incidence relation and two degree of incidence relations Identical with previous step structure according to storage organization;By corresponding labelling, can will belong to very easily Information in the different degrees of association is distinguished, and facilitates data during related information step-by-step calculation to extract And differentiation.
Further, described N degree related information, the order knot increased successively according to the degree of association Structure stores;And the information unit belonging to the different degree of association is stamped the labelling of correspondence.
Preferred as one, described data record and N degree related information are with the shape of tables of data Formula stores, with tables of data form storage data, storage organization specification, it is simple to inquiry and Calculate further.
Further, described data record and N degree related information are stored in non-relational data In storehouse, such as the NoSQL such as HBase, CouchDB, Cassandra, Mongodb In non-relational database.Compared with traditional relevant database, non-relational database has There are the features such as simple to operate, the most free, source code is open, download at any time, application cost is low; And when in the face of the various dimensions weak mode data that the scale of construction sharply increases, such as voice data, Video data, the storage of traditional non-relational database can not meet demand.
Further, the incidence relation formed in described step (2), (3) is stored in non-pass Being (such as HDFS) in the distributed file system of type data base, HDFS is as Hadoop Following distributed file system, has Error Tolerance, is suitable for being deployed on cheap machine, Run and maintenance cost is relatively low.HDFS is highly suitable for storing large-scale dataset simultaneously; Use HDFS to store pending data and can meet mass data storage, the need of high fault tolerance Want, and for using other distributed computing frameworks of Hadoop to provide convenience.
Preferred as one, the incidence relation in described step (2), step (3) passes through Hadoop Under MapReduce Computational frame realize.
Further, two degree of incidence relations in described step (2), step (3) pass through Spark Computational frame realizes.The big data processing shelf of Spark is used to realize the calculating of incidence relation, Spark as the replacement scheme of MapReduce, can compatible HDFS distributed storage layer, The ecosystem circle of Hadoop can be incorporated.Spark is can to build big datarams to calculate Platform, and make full use of internal memory calculating, it is achieved the quick process of mass data.
Further, two degree of incidence relations in described step (3) pass through Spark calculation block SQL statement in frame realizes, and the concrete join algorithm in use SQL realizes: ratio As said: comprise structurized two column informations in a tables of data: the first information, the second information; Second information, the 3rd information;Then by join algorithm, can very easily by the first information and Second information connects together through the second information, defines the first information, the second information, the The new data result of three information.
Compared with prior art, beneficial effects of the present invention: the present invention is based on cloud computing platform Mass data processing method, from the basic data of magnanimity, the related keyword of extraction and analysis target Information, utilizes information unit identical in different pieces of information record, will have the relevant of implicit contact Information excavating out, according to association the number of degrees by the association how far amount of carrying out between information Change;And by the tracking step by step of related information, clear and succinct has sketched the contours of relevant information Associated path and interrelational form, follow the trail of for relevant issues and background search provide a kind of quickly Passage reliably.
Moreover the present invention realizes magnanimity pass with the big data processing shelf of cloud computing platform Process parallel while connection data, makes the Interface design that user only need to provide according to Computational frame Upper strata instructs, in the case of being indifferent to bottom running, and the cutting of task and the tune of resource It is automatically obtained with by big data processing shelf, and after processing is completed that result is the most whole User it is supplied to after conjunction;It is increasingly automated that task completes, and is greatly saved manpower, improves The treatment effeciency of data.
In a word, the inventive method be data message the degree of depth excavate and application provide one very Reliable approach the most easily, target background analysis, marketing, the market segments, risk profile and Prevention and control etc. provide effective technology and support.
Accompanying drawing illustrates:
Fig. 1 is that the flow process that realizes of this mass data processing method based on cloud computing platform is illustrated Figure.
Fig. 2 is the association algorithm signal of this mass data processing method based on cloud computing platform Figure.
Fig. 3 is the 3 data record schematic diagrams extracted in the step (1) in embodiment 1.
Fig. 4 is to form once incidence relation tables of data storage in the step (2) in embodiment 1 Structural representation.
Fig. 5 is that the two degree of incidence relation tables of data formed in the step (3) in embodiment 1 are deposited Storage structure schematic diagram.
Fig. 6 is three degree of incidence relation tables of data storage organization schematic diagrams in embodiment 1.
Fig. 7 is the once incidence relation path signal in embodiment 1 with target with A as starting point Figure.
Fig. 8 is the once incidence relation path signal in embodiment 1 with target with C as starting point Figure.
Fig. 9 is the once incidence relation path signal in embodiment 1 with target with H as starting point Figure.
Figure 10 is two degree of incidence relation path signals in embodiment 1 with target with A as starting point Figure.
Figure 11 is two degree of incidence relation path signals in embodiment 1 with target with C as starting point Figure.
Figure 12 is two degree of incidence relation path signals in embodiment 1 with target with H as starting point Figure.
Figure 13 is three degree of incidence relation path signals in embodiment 1 with target with A as starting point Figure.
Figure 14 is three degree of incidence relation path signals in embodiment 1 with target with C as starting point Figure.
Figure 15 is three degree of incidence relation path signals in embodiment 1 with target with H as starting point Figure.
Should be appreciated that accompanying drawing of the present invention is schematically, do not represent concrete step and path.
Detailed description of the invention
Below in conjunction with test example and detailed description of the invention, the present invention is described in further detail. But this should not being interpreted as, the scope of the above-mentioned theme of the present invention is only limitted to below example, all bases The technology realized in present invention belongs to the scope of the present invention.
It is an object of the invention to overcome the deficiency in the presence of prior art, it is provided that based on cloud meter Calculating the mass data processing method of platform, extracting in data base needs initial data to be processed, By the big data processing shelf of cloud computing platform, utilize information identical in different pieces of information record Element analysis goes out the incidence relation between magnanimity target information;The inventive method system can be in sea In amount internet information, it is arranged as required to analyze target, and then analyzes between different target Whether having incidence relation and be which kind of incidence relation, the degree of depth for data message is excavated and should With providing the very reliable approach easily of one, for target background analysis, marketing, city Field fine, risk profile and prevention and control etc. provide a kind of novel effective way.
For achieving the above object, the present invention provides mass data processing side based on cloud computing platform Method: by arranging the filterconditions such as field, to the key message list in initial data every document Unit extracts, and according to the order set, the key message unit extracted is arranged in a number According to record, and pieces of data record is stored in data base (usually non-relational data Storehouse), on this basis, according to information unit identical included in different pieces of information record, should The incidence relation between information unit is gone out with the distributed treatment Model Abstraction under cloud computing framework.
Concrete, described inventive method comprises the step that realizes as shown in Figure 1:
(1), in the basic data of every from initial data, the field according to arranging extracts Corresponding information, forms corresponding data record;
(2) in a data record, the first information and the second information are comprised, wherein the second letter Breath is the once related information of the first information;The second information and is comprised in the second data record Three information, wherein said 3rd information is the once related information of described second information;Pass through cloud Described 3rd information is become the described first information by the distributed treatment framework calculated under platform Two degree of related informations;And take out from the first information through the second information to the 3rd information Associated path;
(3) as comprised the 4th information and the 3rd information in the 3rd data record, wherein the 4th letter Breath is the once related information of the 3rd information, by the distributed treatment framework under cloud computing platform By two degree of related informations that the 4th Information expansion is the first information;And take out from the first information Associated path through the second information to the 3rd information to the 4th information;
The like, take out the N degree related information with the first information as starting point and correspondence Associated path, wherein N > 1.
The wherein said first information, the second information and the 3rd information refer to the content of information, no The order of representative information.Can (the selection of starting point with target information as starting point by the inventive method Arrange according to analyzing needs), find out other information being associated with target and corresponding step by step Associated path, demonstrate analysis target and related information by what associated path can be apparent from Between the approach that specifically associates.And the calculating of incidence relation of the present invention is with cloud computing platform Big data processing shelf realizes, can process to the target parallel of magnanimity simultaneously, the most just It is to say, from basic data to the calculating of N degree related information, is all that multiple target is the most arranged side by side Process.Can be seen that the increase step by step along with degree of association N, the complexity of calculating and data dimension Degree is continuously increased, and the most complicated data handling procedure is passed through at the big data of cloud computing platform Reason framework (the big data processing shelf such as MapReduce and Spark under such as Hadoop) It is able to the most quickly realize;The big data processing shelf such as MapReduce and Spark can Make user have only to the Interface design upper strata instruction provided according to Computational frame, be indifferent to bottom In the case of running, process framework and automatically call the relevant money of inside according to upper strata instruction Source, and by task automatic segmentation, the different nodes being assigned to inside process, it is achieved that The parallel of data efficiently calculates, and is supplied to after the most automatically result being integrated after processing is completed User;Tasks make progress is increasingly automated, is greatly saved manpower, improves data Treatment effeciency.The present invention utilizes the big data processing shelf pass for magnanimity target of cloud computing platform Connection context analyzer provides the process approach of fast and reliable.
Original data storage in the present invention is in data base, and the source of described initial data is permissible It is the data crawled as required from interconnection, interconnection comprises the abundantest information source, From the Internet, crawl relevant information as required, and the information of acquisition carried out advanced treating, Process of refinement and good application for information provide a kind of brand-new approach.
Further, the calculating process in described N degree incidence relation, all once to associate pass Based on system, during being i.e. the tracking (calculating) of above-mentioned related information, nth degree associates Information is the once related information of N-1 degree related information.Follow the trail of related information the most step by step Calculating, calculate clear logic, running is simple, it is ensured that the accuracy rate of operation result.
Further, the data message extracted in described step (1) can first pass through clearly Washing and carry out data prediction, the data message extracted according to field from basic data is general For JSON type, the dependency of its data is not strong, and there may be some data structures and do not advise Model, data clean-up performance (comprising some uncorrelated, useless or wrong data) not Situation, is i.e. so-called weak structure;Want to be abstracted into these weak structure data messages once to close Connection information, needs first these data to be arranged accordingly, and this process arranged counts exactly Data preprocess process, described data prediction can use and include the derivation of Field Sanitization, field, sky Value processes, sampling of data record screens, record collects, it is additional to record, record merges and record The methods such as sequence, can solve the missing value in data, redundancy and data and differ data prediction The problems such as cause;The most described data cleansing is exactly to need basic data realization according to analyze The process of ETL (extracting-conversion-loading).
Information unit (institute further, in described step (1), in described data record State the content that information unit refers to that each field is corresponding) between use separator separate, such as Space, comma, pause mark.Separate using separator between information unit, it is to avoid different information Location contents inter-adhesive, for the extraction of subsequent association information content with calculate and provide basis.
Further, by the field of data message extracted in described step (1) and content As key-value pair: wherein field is as " key ", and content corresponding to field is " value ";According to The starting point that the content that analyzing needs one of them field optional corresponding is followed the trail of as related information (associated information), and using content corresponding for other fields in every data record as quilt The once related information of related information, thus complete the calculating of once incidence relation.Once associated The calculating of information is the basis that follow-up N degree related information calculates.
Further, by the once related information of formation in described step (2) according to setting Structural order, stores.Once related information is deposited according to the structure set and order Storage so that the once related information data memory format that different target is formed is unified, it is simple to after The data of continuous step process.
Further, the once related information formed in described step (2), can be according to mesh Mark (origin information), once related information, the structural order of relational tags store.Institute State the description that correlation tag is once incidence relation between related information and target information to this, Can be associated data inquiry provide succinctly describe intuitively.
Further, the two degree of related informations formed in described step (2) according to: once closed Connection data, the sequential organization of two degree of associated datas store;And the different degree of association will be belonged to Information unit stamps the labelling of correspondence, the once number of the inside of incidence relation and two degree of incidence relations Identical with previous step structure according to storage organization;By corresponding labelling, can will belong to very easily Information in the different degrees of association is distinguished, and facilitates data during related information step-by-step calculation to extract And differentiation.
Further, described N degree related information, the order knot increased successively according to the degree of association Structure stores;And the information unit belonging to the different degree of association is stamped the labelling of correspondence.
Preferred as one, described data record and N degree related information are with the shape of tables of data Formula stores, with tables of data form storage data, storage organization specification, it is simple to inquiry and Calculate further.
Further, described data record and N degree related information are stored in non-relational data In storehouse, such as the non-relational data such as HBase, CouchDB, Cassandra, Mongodb In storehouse.Compared with traditional relevant database, non-relational database have simple to operate, The features such as the most freely, source code is open, download at any time, application cost is low;And in the face of body During the various dimensions non-structure data that amount sharply increases, such as voice data, video data, pass The storage of the non-relational database of system can not meet demand.
Further, the incidence relation formed in described step (2), (3) is stored in non-pass Being (such as HDFS) in the distributed file system of type data base, HDFS is as Hadoop Following distributed file system, has Error Tolerance, is suitable for being deployed on cheap machine, Run and maintenance cost is relatively low.HDFS is highly suitable for large-scale dataset simultaneously;Use HDFS stores pending data can meet mass data storage, the needs of high fault tolerance, And for using other processing modes of Hadoop to provide convenience.
Preferred as one, the incidence relation in described step (2), step (3) passes through Hadoop Under map-reduce Computational frame realize.
Further, two degree of incidence relations in described step (2), step (3) pass through Spark Computational frame realizes.The big data processing shelf of Spark is used to realize the calculating of incidence relation, Spark as the replacement scheme of MapReduce, can compatible HDFS distributed storage layer, The ecosystem circle of Hadoop can be incorporated.Spark is can to build big datarams to calculate Platform, and make full use of internal memory calculating, it is achieved the real-time process of mass data.
Further, two degree of incidence relations in described step (3) pass through Spark calculation block SQL statement in frame realizes, and the concrete join algorithm in use SQL realizes: ratio As said: comprise structurized two column informations in a tables of data: the first information, the second information; Second information, the 3rd information;Then by join algorithm, can very easily by the first information and Second information connects together through the second information, defines the first information, the second information, the The new data result of three information.
Embodiment 1
Below using wherein 3 data as a little example, the analysis of incidence relation is described Journey.Assume that (field of setting includes: the first field, second through field in initial data Field, the 3rd field and the 4th field) extract, the data extracted comprise such as Fig. 3 3 shown data records, the first field that wherein the first data record comprises, the second field, Information content corresponding to 3rd field and the 4th field is followed successively by: A, B, D and E; The first field, the second field, the 3rd field and the 4th word comprised in Article 2 data record Information content corresponding to Duan is followed successively by: C, B, F and G;Article 3 information comprises The first field, the second field, the 3rd field and the information content corresponding to the 4th field depend on Secondary it is: H, F, I.Assume content corresponding for the first field as the starting point of association analysis, Then the first data record can be formed: the once incidence relation of A-B, A-D, A-E, wherein B, D, E are the once related information of A, and A is also the once association letter of B, D, E simultaneously Breath;Second data record can form the once incidence relation of C-B, C-F, C-G, wherein B, F, G are the once related information of C, and C is also the once association letter of B, F, G simultaneously Breath;3rd data record can be formed: the once incidence relation of H-F, H-I, and wherein F, I are The once related information of H, H is also the once related information of F, I simultaneously.To once associate Relation stores with the storage format of list structure, then can be formed as described in Figure 4 is structurized Two row.
Above-mentioned once associate on the basis of, according in the once incidence relation of A-B and C-B Identical information unit B, is abstracted into two degree of related informations of A by C, with A as starting point, Form the associated path of A-B-C.According to identical in the once incidence relation of C-B with A-B Information unit B, is abstracted into two degree of related informations of C by breath A, with C as starting point, is formed The associated path of C-B-A.According to information list identical in the once incidence relation of C-F with H-F Unit F, is abstracted into two degree of related informations of C by H;With C as starting point, form C-F-H's Associated path.According to information unit F identical in the once incidence relation of H-F with C-F, will C is abstracted into two degree of related informations of H;With H as starting point, form the associated path of H-F-C. The associated data storage forming two degree of incidence relations can use as shown in Figure 5 with tables of data form Storage organization.
Further, in above-mentioned two degree of associations and once on the basis of incidence relation, with first Information A is the starting point once related information according to two degree of related information C of A, can take out The associated path of A-B-C-F, A-B-C-G, wherein F and G is three degree of related informations of A. With C as starting point, according to the once related information of two degree of related information A and H of C, can take out As going out, the associated path of C-B-A-E, C-B-A-D, C-F-H-I, wherein D, E, I are C Three degree of related informations.Same with H as starting point, according to its two degree of related information C once Related information, can form the associated path of H-F-C-B, H-F-C-G, and wherein B and G is H Three degree of related informations.The data table memory of three degree of related informations is as shown in Figure 6.
It should be noted that and need to remove closed path during related information calculates, come with this Avoid the error loop in calculating.
Formed according to once incidence relation data with A, C and H as starting point in embodiment Merge integrate after associated path respectively as shown in Fig. 7, Fig. 8, Fig. 9;Two degree of association roads Footpath is as shown in Figure 10, Figure 11, Figure 12;Three degree of associated path such as Figure 13, Figure 14, Figure 15 Shown in.
The calculating process of the explanation related information of being diagrammatically only by property of the present embodiment, indeed according to needs Number of targets to be analyzed can reach ten thousand, 100,000, million magnitudes;And from above-described embodiment It can be seen that along with the increase of the association number of degrees, the data volume of required calculating sharply increases, magnanimity The amount of calculation of multidimensional related information of calculating target huger, and the present invention uses cloud computing The big data processing shelf of platform, can be carried out magnanimity target parallel according to said method Calculate, and then achieve incidence relation analysis and the excavation of magnanimity target information.
Although detailed description of the invention illustrative to the present invention is described above, in order to this Technology neck artisans understand that the present invention, it should be apparent that the invention is not restricted to be embodied as The scope of mode, from the point of view of those skilled in the art, as long as various change is in institute Attached claim limits and in the spirit and scope of the present invention that determine, during these changes aobvious and Being clear to, all utilize the innovation and creation of present inventive concept all at the row of protection.

Claims (16)

1. mass data processing method based on cloud computing platform, it is characterised in that comprise with Under realize process:
(1), in the basic data of every from initial data, the field according to arranging extracts Corresponding information, forms corresponding data record;
(2) in a data record, the first information and the second information are comprised, wherein the second letter Breath is the once related information of the first information;The second information and is comprised in the second data record Three information, wherein said 3rd information is the once related information of described second information;Pass through cloud Described 3rd information is become the described first information by the distributed treatment framework calculated under platform Two degree of related informations;And take out from the first information through the second information to the 3rd information Associated path;
(3) as comprised the 4th information and the 3rd information in the 3rd data record, wherein the 4th letter Breath is the once related information of the 3rd information, by the distributed treatment framework under cloud computing platform By two degree of related informations that the 4th Information expansion is the first information;And take out from the first information Associated path through the second information to the 3rd information to the 4th information;
The like, take out the N degree related information with the first information as starting point and correspondence Associated path, wherein N > 1.
2. mass data processing method based on cloud computing platform as claimed in claim 1, It is characterized in that, during the calculating of described related information, calculate nth degree association in path Information is the once related information of N-1 degree related information.
3. mass data processing method based on cloud computing platform as claimed in claim 1, It is characterized in that, described initial data crawl from the related web page of the Internet as required and Come.
4. mass data processing method based on cloud computing platform as claimed in claim 3, It is characterized in that, described data record realizes data prediction through over cleaning.
5. mass data processing method based on cloud computing platform as claimed in claim 4, It is characterized in that, sieved by Field Sanitization, field derivation, processing empty value, sampling of data record Choosing, record collect, record additional, record merging and, or record ordering method to carry out data clear Wash.
6. mass data processing method based on cloud computing platform as claimed in claim 1, It is characterized in that, the information unit in each data record of described step (1) is according to unification Structural order stores.
7. mass data processing method based on cloud computing platform as claimed in claim 6, It is characterized in that, in described step (1), described data record is carried out with the form of tables of data Storage.
8. mass data processing method based on cloud computing platform as claimed in claim 6, It is characterized in that, in described step (1), the chien shih of the information unit in described data record Separate with separator.
9. mass data processing method based on cloud computing platform as claimed in claim 6, It is characterized in that, the once related information formed in described step (2), according to origin information, The structural order that once related information, association were describing stores.
10. mass data processing method based on cloud computing platform as claimed in claim 6, It is characterized in that, the sequential organization that described N degree related information increases successively according to the degree of association is entered Row storage.
11. mass data processing methods based on cloud computing platform as claimed in claim 10, It is characterized in that, by and the information unit belonging to the different degree of association is stamped the labelling of correspondence.
12. mass data processing methods based on cloud computing platform as claimed in claim 1, It is characterized in that, N degree associated data is stored respectively in different data bases.
13. mass data processing methods based on cloud computing platform as claimed in claim 12, It is characterized in that, it is distributed that described N degree associated data is stored respectively in disparate databases In file system.
14. mass datas based on cloud computing platform as described in one of claim 1 to 13 Processing method, it is characterised in that the N degree incidence relation in described step (2) passes through Hadoop Under MapReduce Computational frame calculate.
15. mass datas based on cloud computing platform as described in one of claim 1 to 13 Processing method, it is characterised in that described N degree incidence relation comes real by Spark Computational frame Existing.
16. mass data processing methods based on cloud computing platform as claimed in claim 15, It is characterized in that, the N >=2 degree incidence relation in described step (3) passes through Spark calculation block Join statement in frame realizes.
CN201610255566.8A 2016-04-21 2016-04-21 Cloud computing platform based massive data processing method Pending CN105930462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610255566.8A CN105930462A (en) 2016-04-21 2016-04-21 Cloud computing platform based massive data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610255566.8A CN105930462A (en) 2016-04-21 2016-04-21 Cloud computing platform based massive data processing method

Publications (1)

Publication Number Publication Date
CN105930462A true CN105930462A (en) 2016-09-07

Family

ID=56839725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610255566.8A Pending CN105930462A (en) 2016-04-21 2016-04-21 Cloud computing platform based massive data processing method

Country Status (1)

Country Link
CN (1) CN105930462A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109669996A (en) * 2018-12-29 2019-04-23 恒睿(重庆)人工智能技术研究院有限公司 Information dynamic updating method and device
CN113094415A (en) * 2019-12-23 2021-07-09 北京懿医云科技有限公司 Data extraction method and device, computer readable medium and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109669996A (en) * 2018-12-29 2019-04-23 恒睿(重庆)人工智能技术研究院有限公司 Information dynamic updating method and device
CN113094415A (en) * 2019-12-23 2021-07-09 北京懿医云科技有限公司 Data extraction method and device, computer readable medium and electronic equipment
CN113094415B (en) * 2019-12-23 2024-03-29 北京懿医云科技有限公司 Data extraction method, data extraction device, computer readable medium and electronic equipment

Similar Documents

Publication Publication Date Title
Osman A novel big data analytics framework for smart cities
US9251277B2 (en) Mining trajectory for spatial temporal analytics
Yan et al. A hybrid model and computing platform for spatio-semantic trajectories
CN105930466A (en) Massive data processing method
CN107590250A (en) A kind of space-time orbit generation method and device
CN108737492A (en) A method of the navigation based on big data system and location-based service
CN105956016A (en) Associated information visualization processing system
CN105930465A (en) Data mining processing method
Roth et al. Event data warehousing for complex event processing
CN113779169B (en) Space-time data stream model self-enhancement method
CN105956018A (en) Massive associated data analysis and visualization implementation method based on cloud computing platform
CN102609501B (en) Data cleaning method based on real-time historical database
CN109977125A (en) A kind of big data safety analysis plateform system based on network security
CN109062769A (en) The method, apparatus and equipment of IT system performance risk trend prediction
CN104219088A (en) Hive-based network alarm information OLAP method
Singh et al. Analysis on data mining models for Internet Of Things
Arora et al. Big data: A review of analytics methods & techniques
CN103150470A (en) Visualization method for concept drift of data stream in dynamic data environment
CN105930462A (en) Cloud computing platform based massive data processing method
Jiang et al. Spatial and spatiotemporal big data science
CN109828995A (en) A kind of diagram data detection method, the system of view-based access control model feature
CN105956019A (en) Big data analysis processing method
CN105930463A (en) Cloud computing platform based big data processing method
CN107562909A (en) A kind of big data analysis system and its analysis method for merging search and calculating
CN110019453A (en) A kind of method and system that tax data is handled based on distributed system infrastructure platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160907