CN105930462A - Cloud computing platform based massive data processing method - Google Patents
Cloud computing platform based massive data processing method Download PDFInfo
- Publication number
- CN105930462A CN105930462A CN201610255566.8A CN201610255566A CN105930462A CN 105930462 A CN105930462 A CN 105930462A CN 201610255566 A CN201610255566 A CN 201610255566A CN 105930462 A CN105930462 A CN 105930462A
- Authority
- CN
- China
- Prior art keywords
- information
- data
- cloud computing
- computing platform
- degree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to the technical field of Internet information processing, and particularly to a cloud computing platform based massive data processing method. The method comprises by setting filter conditions such as a field and the like, extracting key information elements in each file in original data, and forming a corresponding data record; storing each data record to a database; and based on this, according to the same information elements contained in different data records, using a big data processing framework under a cloud computing platform to abstract an association relation between information elements. The method disclosed by the present invention can analyze associated information and a corresponding association path hidden behind massive target information in the massive Internet information according to requirements and then provide an analysis result to a user through a query port, greatly saves time and manpower costs of related information compilation and analysis for the user, and provides effective technical support for target background analysis, marketing, market segmentation, risk prediction, risk prevention and control, and the like.
Description
Technical field
The present invention relates to Internet technical field, particularly to magnanimity number based on cloud computing platform
According to processing method.
Background technology
Along with development and the progress of science and technology of society, the contact between individuality or group becomes more
Adding closely, contact closely promotes fast propagation and the growth of information, and the world today is already
Entering the information age, along with explosive growth and the accumulation of information, big data age is the most recent
Facing, the basic feature of big data can describe with 4 " V ", i.e. data volume big (Volume),
Wide variety (Variety), value density low (Value), speed fast timeliness height (Velocity);
As most important of which feature: data volume is big and value density low be but the such magnanimity number of puzzlement
It is believed that a difficult problem for breath digging utilization, inside the data of magnanimity, obtain people the most accurately and close
The information of the heart, difficult just as searching for a needle in a haystack;Meanwhile in the face of the information of magnanimity, as
What removes to analyze the dependency between certain category information, and analyzes information intrinsic value behind with this,
The value of data message, dependency in big data analysis is just embodied in aspect higher, deeper
More important than cause effect relation, but in the face of the data of such magnanimity, it is desirable to analyze fast and accurately
Go out the incidence relation between data, the most difficult.
Actually in the information ocean of numerous and complicated, the contact between some information often than with
Contact between other information will the most much, and these to have certain information being closely connected past
Toward reflection is real-life particular kind of relationship between men or between group, these
Particular kind of relationship can make it influence each other in relevant society or economic activity or pin down;From
For spreading network information angle, grasp the informational linkage node of some keys for social management
With business activity, there is great positive effect, because for the angle of Information Communication, these
Information (or risk) spread speed of important informational linkage node or coverage can compare
Other information points are the widest;Such analysis can be used in such as public sentiment supervision, pathophoresis
Control or the field such as advertisement putting.
For another one angle, for specific information object, how to analyze this target with
Incidence relation between other targets has actual meaning in a lot of fields, because having
The target of incidence relation often has bigger than individually simple individuality when carrying out various activity
The face that affects, and have the target of incidence relation externally set up various movable time, by interior
Mutually pining down or supporting of the incidence relation in portion, can be more multiple than the event trace of simple target
Miscellaneous.And in actual life, the incidence relation between information object is extremely complex, and typically
Being hiding, people can not be perceived by surface activity or surface information, is more difficult to
Find out whether this target has incidence relation, or which kind of incidence relation with other targets.?
Under such circumstances, the socio-economic activity of people can be brought very by these implicit incidence relations
The most potential value or risk.The implicit associations analyzing these closes the data surface tying up to magnanimity
Front will become more difficult, if these tasks are realized one by one by individual, huge by expending
Manpower and time cost;It is badly in need of a kind of processing method, helps analyst to realize this huge numerous
Trivial calculating process, it is provided that this analysis result.
Summary of the invention
It is an object of the invention to overcome the deficiency in the presence of prior art, it is provided that based on cloud meter
Calculating the mass data processing method of platform, extracting in data base needs initial data to be processed,
By the big data processing shelf of cloud computing platform, utilize information identical in different pieces of information record
Element analysis goes out the incidence relation between magnanimity target information;The inventive method system can be in sea
In amount internet information, it is arranged as required to analyze target, and then analyzes between different target
Whether having incidence relation and be which kind of incidence relation, the degree of depth for data message is excavated and should
With providing the very reliable approach easily of one, for target background analysis, marketing, city
Field fine, risk profile and prevention and control etc. provide a kind of novel effective way.
For achieving the above object, the present invention provides mass data processing side based on cloud computing platform
Method: by arranging the filterconditions such as field, to the key message list in initial data every document
Unit extracts, and according to the order set, the key message unit extracted is arranged in a number
According to record, and pieces of data record is stored in data base (usually non-relational data
Storehouse), on this basis, according to information unit identical included in different pieces of information record, should
The incidence relation between information unit is gone out with the distributed treatment Model Abstraction under cloud computing framework.
Concrete, described inventive method comprises implemented below step:
(1), in the basic data of every from initial data, the field according to arranging extracts
Corresponding information, forms corresponding data record;
(2) in a data record, the first information and the second information are comprised, wherein the second letter
Breath is the once related information of the first information;The second information and is comprised in the second data record
Three information, wherein said 3rd information is the once related information of described second information;;Pass through
Described 3rd information is become described first letter by the distributed treatment framework under cloud computing platform
Two degree of related informations of breath;And take out from the first information through the second information to the 3rd information
Associated path;
(3) as comprised the 4th information and the 3rd information in the 3rd data record, wherein the 4th letter
Breath is the once related information of the 3rd information, by the distributed treatment framework under cloud computing platform
By two degree of related informations that the 4th Information expansion is the first information;And take out from the first information
Associated path through the second information to the 3rd information to the 4th information;
The like, take out the N degree related information with the first information as starting point and correspondence
Associated path, wherein N >=1.
The wherein said first information, the second information, the 3rd information and the 4th information refer to information
Content, the not order of representative information.(can be risen with target information as starting point by the inventive method
The selection of point is arranged according to analyzing needs), find out other letters being associated with target step by step
Breath and corresponding associated path, by associated path can be apparent from demonstrate analysis target with
The approach that specifically associates between related information.And the calculating of incidence relation of the present invention is in terms of cloud
The big data processing shelf calculating platform realizes, can place to the target parallel of magnanimity simultaneously
Reason, say, that from basic data to the calculating of N degree related information, be all multiple targets
The most also column processing.Can be seen that the increase step by step along with degree of association N, the complexity of calculating
It is continuously increased with data dimension, and the most complicated data handling procedure is by cloud computing platform
Big data processing shelf is (at the big data such as MapReduce and Spark under such as Hadoop
Reason framework) it is able to the most quickly realize;MapReduce and Spark etc. are big, and data process
Framework can make the Interface design upper strata instruction that user only need to provide according to Computational frame, is being not related to
In the case of heart bottom running, process framework and automatically call the phase of inside according to upper strata instruction
Close resource, and by task automatic segmentation, the multiple nodes being assigned to inside process, real
Show the parallel of data efficiently to calculate, carried after the most automatically result being integrated after processing is completed
Supply user;Tasks make progress is increasingly automated, is greatly saved manpower, improves number
According to treatment effeciency.The present invention utilizes the big data processing shelf of cloud computing platform to be magnanimity target
Associated context analysis provide the process approach of fast and reliable.
Original data storage in the present invention is in data base, and the source of described initial data is permissible
It is the data crawled as required from interconnection, interconnection comprises the abundantest information source,
From the Internet, crawl relevant information as required, and the information of acquisition carried out advanced treating,
For the process of refinement of information, and good application provides a kind of brand-new approach.
Further, the calculating process in described N degree incidence relation, all once to associate pass
Based on system, during being i.e. the tracking (calculating) of above-mentioned related information, nth degree associates
Information is the once related information of N-1 degree related information.Follow the trail of related information the most step by step
Calculating, calculate clear logic, running is simple, it is ensured that the accuracy rate of operation result.
Further, the data message extracted in described step (1) can first pass through clearly
Wash and carry out data prediction.
Information unit (institute further, in described step (1), in described data record
State the content that information unit refers to that each field is corresponding) between use separator separate, such as
Space, comma, pause mark.Separate using separator between information unit, it is to avoid different information
Location contents inter-adhesive, for the extraction of subsequent association information content with calculate and provide basis.
Further, by the field of data message extracted in described step (1) and content
As key-value pair: wherein field is as " key ", and content corresponding to field is " value ";According to
The starting point that the content that analyzing needs one of them field optional corresponding is followed the trail of as related information
(associated information), and using content corresponding for other fields in every data record as quilt
The once related information of related information, thus complete the calculating of once incidence relation.Once associated
The calculating of information is the basis that follow-up N degree related information calculates.
Further, by the once related information of formation in described step (2) according to setting
Structural order, stores.Once related information is deposited according to the structure set and order
Storage so that the once related information data memory format that different target is formed is unified, it is simple to after
The data of continuous step process.
Further, the once related information formed in described step (2), can be according to mesh
Mark (origin information), once related information, the structural order of relational tags store.Institute
State the description that correlation tag is once incidence relation between related information and target information to this,
Can be associated data inquiry provide succinctly describe intuitively.
Further, the two degree of related informations formed in described step (3) according to: once closed
Connection data, the sequential organization of two degree of associated datas store;And the different degree of association will be belonged to
Information unit stamps the labelling of correspondence, the once number of the inside of incidence relation and two degree of incidence relations
Identical with previous step structure according to storage organization;By corresponding labelling, can will belong to very easily
Information in the different degrees of association is distinguished, and facilitates data during related information step-by-step calculation to extract
And differentiation.
Further, described N degree related information, the order knot increased successively according to the degree of association
Structure stores;And the information unit belonging to the different degree of association is stamped the labelling of correspondence.
Preferred as one, described data record and N degree related information are with the shape of tables of data
Formula stores, with tables of data form storage data, storage organization specification, it is simple to inquiry and
Calculate further.
Further, described data record and N degree related information are stored in non-relational data
In storehouse, such as the NoSQL such as HBase, CouchDB, Cassandra, Mongodb
In non-relational database.Compared with traditional relevant database, non-relational database has
There are the features such as simple to operate, the most free, source code is open, download at any time, application cost is low;
And when in the face of the various dimensions weak mode data that the scale of construction sharply increases, such as voice data,
Video data, the storage of traditional non-relational database can not meet demand.
Further, the incidence relation formed in described step (2), (3) is stored in non-pass
Being (such as HDFS) in the distributed file system of type data base, HDFS is as Hadoop
Following distributed file system, has Error Tolerance, is suitable for being deployed on cheap machine,
Run and maintenance cost is relatively low.HDFS is highly suitable for storing large-scale dataset simultaneously;
Use HDFS to store pending data and can meet mass data storage, the need of high fault tolerance
Want, and for using other distributed computing frameworks of Hadoop to provide convenience.
Preferred as one, the incidence relation in described step (2), step (3) passes through Hadoop
Under MapReduce Computational frame realize.
Further, two degree of incidence relations in described step (2), step (3) pass through Spark
Computational frame realizes.The big data processing shelf of Spark is used to realize the calculating of incidence relation,
Spark as the replacement scheme of MapReduce, can compatible HDFS distributed storage layer,
The ecosystem circle of Hadoop can be incorporated.Spark is can to build big datarams to calculate
Platform, and make full use of internal memory calculating, it is achieved the quick process of mass data.
Further, two degree of incidence relations in described step (3) pass through Spark calculation block
SQL statement in frame realizes, and the concrete join algorithm in use SQL realizes: ratio
As said: comprise structurized two column informations in a tables of data: the first information, the second information;
Second information, the 3rd information;Then by join algorithm, can very easily by the first information and
Second information connects together through the second information, defines the first information, the second information, the
The new data result of three information.
Compared with prior art, beneficial effects of the present invention: the present invention is based on cloud computing platform
Mass data processing method, from the basic data of magnanimity, the related keyword of extraction and analysis target
Information, utilizes information unit identical in different pieces of information record, will have the relevant of implicit contact
Information excavating out, according to association the number of degrees by the association how far amount of carrying out between information
Change;And by the tracking step by step of related information, clear and succinct has sketched the contours of relevant information
Associated path and interrelational form, follow the trail of for relevant issues and background search provide a kind of quickly
Passage reliably.
Moreover the present invention realizes magnanimity pass with the big data processing shelf of cloud computing platform
Process parallel while connection data, makes the Interface design that user only need to provide according to Computational frame
Upper strata instructs, in the case of being indifferent to bottom running, and the cutting of task and the tune of resource
It is automatically obtained with by big data processing shelf, and after processing is completed that result is the most whole
User it is supplied to after conjunction;It is increasingly automated that task completes, and is greatly saved manpower, improves
The treatment effeciency of data.
In a word, the inventive method be data message the degree of depth excavate and application provide one very
Reliable approach the most easily, target background analysis, marketing, the market segments, risk profile and
Prevention and control etc. provide effective technology and support.
Accompanying drawing illustrates:
Fig. 1 is that the flow process that realizes of this mass data processing method based on cloud computing platform is illustrated
Figure.
Fig. 2 is the association algorithm signal of this mass data processing method based on cloud computing platform
Figure.
Fig. 3 is the 3 data record schematic diagrams extracted in the step (1) in embodiment 1.
Fig. 4 is to form once incidence relation tables of data storage in the step (2) in embodiment 1
Structural representation.
Fig. 5 is that the two degree of incidence relation tables of data formed in the step (3) in embodiment 1 are deposited
Storage structure schematic diagram.
Fig. 6 is three degree of incidence relation tables of data storage organization schematic diagrams in embodiment 1.
Fig. 7 is the once incidence relation path signal in embodiment 1 with target with A as starting point
Figure.
Fig. 8 is the once incidence relation path signal in embodiment 1 with target with C as starting point
Figure.
Fig. 9 is the once incidence relation path signal in embodiment 1 with target with H as starting point
Figure.
Figure 10 is two degree of incidence relation path signals in embodiment 1 with target with A as starting point
Figure.
Figure 11 is two degree of incidence relation path signals in embodiment 1 with target with C as starting point
Figure.
Figure 12 is two degree of incidence relation path signals in embodiment 1 with target with H as starting point
Figure.
Figure 13 is three degree of incidence relation path signals in embodiment 1 with target with A as starting point
Figure.
Figure 14 is three degree of incidence relation path signals in embodiment 1 with target with C as starting point
Figure.
Figure 15 is three degree of incidence relation path signals in embodiment 1 with target with H as starting point
Figure.
Should be appreciated that accompanying drawing of the present invention is schematically, do not represent concrete step and path.
Detailed description of the invention
Below in conjunction with test example and detailed description of the invention, the present invention is described in further detail.
But this should not being interpreted as, the scope of the above-mentioned theme of the present invention is only limitted to below example, all bases
The technology realized in present invention belongs to the scope of the present invention.
It is an object of the invention to overcome the deficiency in the presence of prior art, it is provided that based on cloud meter
Calculating the mass data processing method of platform, extracting in data base needs initial data to be processed,
By the big data processing shelf of cloud computing platform, utilize information identical in different pieces of information record
Element analysis goes out the incidence relation between magnanimity target information;The inventive method system can be in sea
In amount internet information, it is arranged as required to analyze target, and then analyzes between different target
Whether having incidence relation and be which kind of incidence relation, the degree of depth for data message is excavated and should
With providing the very reliable approach easily of one, for target background analysis, marketing, city
Field fine, risk profile and prevention and control etc. provide a kind of novel effective way.
For achieving the above object, the present invention provides mass data processing side based on cloud computing platform
Method: by arranging the filterconditions such as field, to the key message list in initial data every document
Unit extracts, and according to the order set, the key message unit extracted is arranged in a number
According to record, and pieces of data record is stored in data base (usually non-relational data
Storehouse), on this basis, according to information unit identical included in different pieces of information record, should
The incidence relation between information unit is gone out with the distributed treatment Model Abstraction under cloud computing framework.
Concrete, described inventive method comprises the step that realizes as shown in Figure 1:
(1), in the basic data of every from initial data, the field according to arranging extracts
Corresponding information, forms corresponding data record;
(2) in a data record, the first information and the second information are comprised, wherein the second letter
Breath is the once related information of the first information;The second information and is comprised in the second data record
Three information, wherein said 3rd information is the once related information of described second information;Pass through cloud
Described 3rd information is become the described first information by the distributed treatment framework calculated under platform
Two degree of related informations;And take out from the first information through the second information to the 3rd information
Associated path;
(3) as comprised the 4th information and the 3rd information in the 3rd data record, wherein the 4th letter
Breath is the once related information of the 3rd information, by the distributed treatment framework under cloud computing platform
By two degree of related informations that the 4th Information expansion is the first information;And take out from the first information
Associated path through the second information to the 3rd information to the 4th information;
The like, take out the N degree related information with the first information as starting point and correspondence
Associated path, wherein N > 1.
The wherein said first information, the second information and the 3rd information refer to the content of information, no
The order of representative information.Can (the selection of starting point with target information as starting point by the inventive method
Arrange according to analyzing needs), find out other information being associated with target and corresponding step by step
Associated path, demonstrate analysis target and related information by what associated path can be apparent from
Between the approach that specifically associates.And the calculating of incidence relation of the present invention is with cloud computing platform
Big data processing shelf realizes, can process to the target parallel of magnanimity simultaneously, the most just
It is to say, from basic data to the calculating of N degree related information, is all that multiple target is the most arranged side by side
Process.Can be seen that the increase step by step along with degree of association N, the complexity of calculating and data dimension
Degree is continuously increased, and the most complicated data handling procedure is passed through at the big data of cloud computing platform
Reason framework (the big data processing shelf such as MapReduce and Spark under such as Hadoop)
It is able to the most quickly realize;The big data processing shelf such as MapReduce and Spark can
Make user have only to the Interface design upper strata instruction provided according to Computational frame, be indifferent to bottom
In the case of running, process framework and automatically call the relevant money of inside according to upper strata instruction
Source, and by task automatic segmentation, the different nodes being assigned to inside process, it is achieved that
The parallel of data efficiently calculates, and is supplied to after the most automatically result being integrated after processing is completed
User;Tasks make progress is increasingly automated, is greatly saved manpower, improves data
Treatment effeciency.The present invention utilizes the big data processing shelf pass for magnanimity target of cloud computing platform
Connection context analyzer provides the process approach of fast and reliable.
Original data storage in the present invention is in data base, and the source of described initial data is permissible
It is the data crawled as required from interconnection, interconnection comprises the abundantest information source,
From the Internet, crawl relevant information as required, and the information of acquisition carried out advanced treating,
Process of refinement and good application for information provide a kind of brand-new approach.
Further, the calculating process in described N degree incidence relation, all once to associate pass
Based on system, during being i.e. the tracking (calculating) of above-mentioned related information, nth degree associates
Information is the once related information of N-1 degree related information.Follow the trail of related information the most step by step
Calculating, calculate clear logic, running is simple, it is ensured that the accuracy rate of operation result.
Further, the data message extracted in described step (1) can first pass through clearly
Washing and carry out data prediction, the data message extracted according to field from basic data is general
For JSON type, the dependency of its data is not strong, and there may be some data structures and do not advise
Model, data clean-up performance (comprising some uncorrelated, useless or wrong data) not
Situation, is i.e. so-called weak structure;Want to be abstracted into these weak structure data messages once to close
Connection information, needs first these data to be arranged accordingly, and this process arranged counts exactly
Data preprocess process, described data prediction can use and include the derivation of Field Sanitization, field, sky
Value processes, sampling of data record screens, record collects, it is additional to record, record merges and record
The methods such as sequence, can solve the missing value in data, redundancy and data and differ data prediction
The problems such as cause;The most described data cleansing is exactly to need basic data realization according to analyze
The process of ETL (extracting-conversion-loading).
Information unit (institute further, in described step (1), in described data record
State the content that information unit refers to that each field is corresponding) between use separator separate, such as
Space, comma, pause mark.Separate using separator between information unit, it is to avoid different information
Location contents inter-adhesive, for the extraction of subsequent association information content with calculate and provide basis.
Further, by the field of data message extracted in described step (1) and content
As key-value pair: wherein field is as " key ", and content corresponding to field is " value ";According to
The starting point that the content that analyzing needs one of them field optional corresponding is followed the trail of as related information
(associated information), and using content corresponding for other fields in every data record as quilt
The once related information of related information, thus complete the calculating of once incidence relation.Once associated
The calculating of information is the basis that follow-up N degree related information calculates.
Further, by the once related information of formation in described step (2) according to setting
Structural order, stores.Once related information is deposited according to the structure set and order
Storage so that the once related information data memory format that different target is formed is unified, it is simple to after
The data of continuous step process.
Further, the once related information formed in described step (2), can be according to mesh
Mark (origin information), once related information, the structural order of relational tags store.Institute
State the description that correlation tag is once incidence relation between related information and target information to this,
Can be associated data inquiry provide succinctly describe intuitively.
Further, the two degree of related informations formed in described step (2) according to: once closed
Connection data, the sequential organization of two degree of associated datas store;And the different degree of association will be belonged to
Information unit stamps the labelling of correspondence, the once number of the inside of incidence relation and two degree of incidence relations
Identical with previous step structure according to storage organization;By corresponding labelling, can will belong to very easily
Information in the different degrees of association is distinguished, and facilitates data during related information step-by-step calculation to extract
And differentiation.
Further, described N degree related information, the order knot increased successively according to the degree of association
Structure stores;And the information unit belonging to the different degree of association is stamped the labelling of correspondence.
Preferred as one, described data record and N degree related information are with the shape of tables of data
Formula stores, with tables of data form storage data, storage organization specification, it is simple to inquiry and
Calculate further.
Further, described data record and N degree related information are stored in non-relational data
In storehouse, such as the non-relational data such as HBase, CouchDB, Cassandra, Mongodb
In storehouse.Compared with traditional relevant database, non-relational database have simple to operate,
The features such as the most freely, source code is open, download at any time, application cost is low;And in the face of body
During the various dimensions non-structure data that amount sharply increases, such as voice data, video data, pass
The storage of the non-relational database of system can not meet demand.
Further, the incidence relation formed in described step (2), (3) is stored in non-pass
Being (such as HDFS) in the distributed file system of type data base, HDFS is as Hadoop
Following distributed file system, has Error Tolerance, is suitable for being deployed on cheap machine,
Run and maintenance cost is relatively low.HDFS is highly suitable for large-scale dataset simultaneously;Use
HDFS stores pending data can meet mass data storage, the needs of high fault tolerance,
And for using other processing modes of Hadoop to provide convenience.
Preferred as one, the incidence relation in described step (2), step (3) passes through Hadoop
Under map-reduce Computational frame realize.
Further, two degree of incidence relations in described step (2), step (3) pass through Spark
Computational frame realizes.The big data processing shelf of Spark is used to realize the calculating of incidence relation,
Spark as the replacement scheme of MapReduce, can compatible HDFS distributed storage layer,
The ecosystem circle of Hadoop can be incorporated.Spark is can to build big datarams to calculate
Platform, and make full use of internal memory calculating, it is achieved the real-time process of mass data.
Further, two degree of incidence relations in described step (3) pass through Spark calculation block
SQL statement in frame realizes, and the concrete join algorithm in use SQL realizes: ratio
As said: comprise structurized two column informations in a tables of data: the first information, the second information;
Second information, the 3rd information;Then by join algorithm, can very easily by the first information and
Second information connects together through the second information, defines the first information, the second information, the
The new data result of three information.
Embodiment 1
Below using wherein 3 data as a little example, the analysis of incidence relation is described
Journey.Assume that (field of setting includes: the first field, second through field in initial data
Field, the 3rd field and the 4th field) extract, the data extracted comprise such as Fig. 3
3 shown data records, the first field that wherein the first data record comprises, the second field,
Information content corresponding to 3rd field and the 4th field is followed successively by: A, B, D and E;
The first field, the second field, the 3rd field and the 4th word comprised in Article 2 data record
Information content corresponding to Duan is followed successively by: C, B, F and G;Article 3 information comprises
The first field, the second field, the 3rd field and the information content corresponding to the 4th field depend on
Secondary it is: H, F, I.Assume content corresponding for the first field as the starting point of association analysis,
Then the first data record can be formed: the once incidence relation of A-B, A-D, A-E, wherein B,
D, E are the once related information of A, and A is also the once association letter of B, D, E simultaneously
Breath;Second data record can form the once incidence relation of C-B, C-F, C-G, wherein B,
F, G are the once related information of C, and C is also the once association letter of B, F, G simultaneously
Breath;3rd data record can be formed: the once incidence relation of H-F, H-I, and wherein F, I are
The once related information of H, H is also the once related information of F, I simultaneously.To once associate
Relation stores with the storage format of list structure, then can be formed as described in Figure 4 is structurized
Two row.
Above-mentioned once associate on the basis of, according in the once incidence relation of A-B and C-B
Identical information unit B, is abstracted into two degree of related informations of A by C, with A as starting point,
Form the associated path of A-B-C.According to identical in the once incidence relation of C-B with A-B
Information unit B, is abstracted into two degree of related informations of C by breath A, with C as starting point, is formed
The associated path of C-B-A.According to information list identical in the once incidence relation of C-F with H-F
Unit F, is abstracted into two degree of related informations of C by H;With C as starting point, form C-F-H's
Associated path.According to information unit F identical in the once incidence relation of H-F with C-F, will
C is abstracted into two degree of related informations of H;With H as starting point, form the associated path of H-F-C.
The associated data storage forming two degree of incidence relations can use as shown in Figure 5 with tables of data form
Storage organization.
Further, in above-mentioned two degree of associations and once on the basis of incidence relation, with first
Information A is the starting point once related information according to two degree of related information C of A, can take out
The associated path of A-B-C-F, A-B-C-G, wherein F and G is three degree of related informations of A.
With C as starting point, according to the once related information of two degree of related information A and H of C, can take out
As going out, the associated path of C-B-A-E, C-B-A-D, C-F-H-I, wherein D, E, I are C
Three degree of related informations.Same with H as starting point, according to its two degree of related information C once
Related information, can form the associated path of H-F-C-B, H-F-C-G, and wherein B and G is H
Three degree of related informations.The data table memory of three degree of related informations is as shown in Figure 6.
It should be noted that and need to remove closed path during related information calculates, come with this
Avoid the error loop in calculating.
Formed according to once incidence relation data with A, C and H as starting point in embodiment
Merge integrate after associated path respectively as shown in Fig. 7, Fig. 8, Fig. 9;Two degree of association roads
Footpath is as shown in Figure 10, Figure 11, Figure 12;Three degree of associated path such as Figure 13, Figure 14, Figure 15
Shown in.
The calculating process of the explanation related information of being diagrammatically only by property of the present embodiment, indeed according to needs
Number of targets to be analyzed can reach ten thousand, 100,000, million magnitudes;And from above-described embodiment
It can be seen that along with the increase of the association number of degrees, the data volume of required calculating sharply increases, magnanimity
The amount of calculation of multidimensional related information of calculating target huger, and the present invention uses cloud computing
The big data processing shelf of platform, can be carried out magnanimity target parallel according to said method
Calculate, and then achieve incidence relation analysis and the excavation of magnanimity target information.
Although detailed description of the invention illustrative to the present invention is described above, in order to this
Technology neck artisans understand that the present invention, it should be apparent that the invention is not restricted to be embodied as
The scope of mode, from the point of view of those skilled in the art, as long as various change is in institute
Attached claim limits and in the spirit and scope of the present invention that determine, during these changes aobvious and
Being clear to, all utilize the innovation and creation of present inventive concept all at the row of protection.
Claims (16)
1. mass data processing method based on cloud computing platform, it is characterised in that comprise with
Under realize process:
(1), in the basic data of every from initial data, the field according to arranging extracts
Corresponding information, forms corresponding data record;
(2) in a data record, the first information and the second information are comprised, wherein the second letter
Breath is the once related information of the first information;The second information and is comprised in the second data record
Three information, wherein said 3rd information is the once related information of described second information;Pass through cloud
Described 3rd information is become the described first information by the distributed treatment framework calculated under platform
Two degree of related informations;And take out from the first information through the second information to the 3rd information
Associated path;
(3) as comprised the 4th information and the 3rd information in the 3rd data record, wherein the 4th letter
Breath is the once related information of the 3rd information, by the distributed treatment framework under cloud computing platform
By two degree of related informations that the 4th Information expansion is the first information;And take out from the first information
Associated path through the second information to the 3rd information to the 4th information;
The like, take out the N degree related information with the first information as starting point and correspondence
Associated path, wherein N > 1.
2. mass data processing method based on cloud computing platform as claimed in claim 1,
It is characterized in that, during the calculating of described related information, calculate nth degree association in path
Information is the once related information of N-1 degree related information.
3. mass data processing method based on cloud computing platform as claimed in claim 1,
It is characterized in that, described initial data crawl from the related web page of the Internet as required and
Come.
4. mass data processing method based on cloud computing platform as claimed in claim 3,
It is characterized in that, described data record realizes data prediction through over cleaning.
5. mass data processing method based on cloud computing platform as claimed in claim 4,
It is characterized in that, sieved by Field Sanitization, field derivation, processing empty value, sampling of data record
Choosing, record collect, record additional, record merging and, or record ordering method to carry out data clear
Wash.
6. mass data processing method based on cloud computing platform as claimed in claim 1,
It is characterized in that, the information unit in each data record of described step (1) is according to unification
Structural order stores.
7. mass data processing method based on cloud computing platform as claimed in claim 6,
It is characterized in that, in described step (1), described data record is carried out with the form of tables of data
Storage.
8. mass data processing method based on cloud computing platform as claimed in claim 6,
It is characterized in that, in described step (1), the chien shih of the information unit in described data record
Separate with separator.
9. mass data processing method based on cloud computing platform as claimed in claim 6,
It is characterized in that, the once related information formed in described step (2), according to origin information,
The structural order that once related information, association were describing stores.
10. mass data processing method based on cloud computing platform as claimed in claim 6,
It is characterized in that, the sequential organization that described N degree related information increases successively according to the degree of association is entered
Row storage.
11. mass data processing methods based on cloud computing platform as claimed in claim 10,
It is characterized in that, by and the information unit belonging to the different degree of association is stamped the labelling of correspondence.
12. mass data processing methods based on cloud computing platform as claimed in claim 1,
It is characterized in that, N degree associated data is stored respectively in different data bases.
13. mass data processing methods based on cloud computing platform as claimed in claim 12,
It is characterized in that, it is distributed that described N degree associated data is stored respectively in disparate databases
In file system.
14. mass datas based on cloud computing platform as described in one of claim 1 to 13
Processing method, it is characterised in that the N degree incidence relation in described step (2) passes through Hadoop
Under MapReduce Computational frame calculate.
15. mass datas based on cloud computing platform as described in one of claim 1 to 13
Processing method, it is characterised in that described N degree incidence relation comes real by Spark Computational frame
Existing.
16. mass data processing methods based on cloud computing platform as claimed in claim 15,
It is characterized in that, the N >=2 degree incidence relation in described step (3) passes through Spark calculation block
Join statement in frame realizes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610255566.8A CN105930462A (en) | 2016-04-21 | 2016-04-21 | Cloud computing platform based massive data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610255566.8A CN105930462A (en) | 2016-04-21 | 2016-04-21 | Cloud computing platform based massive data processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105930462A true CN105930462A (en) | 2016-09-07 |
Family
ID=56839725
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610255566.8A Pending CN105930462A (en) | 2016-04-21 | 2016-04-21 | Cloud computing platform based massive data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105930462A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109669996A (en) * | 2018-12-29 | 2019-04-23 | 恒睿(重庆)人工智能技术研究院有限公司 | Information dynamic updating method and device |
CN113094415A (en) * | 2019-12-23 | 2021-07-09 | 北京懿医云科技有限公司 | Data extraction method and device, computer readable medium and electronic equipment |
-
2016
- 2016-04-21 CN CN201610255566.8A patent/CN105930462A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109669996A (en) * | 2018-12-29 | 2019-04-23 | 恒睿(重庆)人工智能技术研究院有限公司 | Information dynamic updating method and device |
CN113094415A (en) * | 2019-12-23 | 2021-07-09 | 北京懿医云科技有限公司 | Data extraction method and device, computer readable medium and electronic equipment |
CN113094415B (en) * | 2019-12-23 | 2024-03-29 | 北京懿医云科技有限公司 | Data extraction method, data extraction device, computer readable medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Osman | A novel big data analytics framework for smart cities | |
US9251277B2 (en) | Mining trajectory for spatial temporal analytics | |
Yan et al. | A hybrid model and computing platform for spatio-semantic trajectories | |
CN105930466A (en) | Massive data processing method | |
CN107590250A (en) | A kind of space-time orbit generation method and device | |
CN108737492A (en) | A method of the navigation based on big data system and location-based service | |
CN105956016A (en) | Associated information visualization processing system | |
CN105930465A (en) | Data mining processing method | |
Roth et al. | Event data warehousing for complex event processing | |
CN113779169B (en) | Space-time data stream model self-enhancement method | |
CN105956018A (en) | Massive associated data analysis and visualization implementation method based on cloud computing platform | |
CN102609501B (en) | Data cleaning method based on real-time historical database | |
CN109977125A (en) | A kind of big data safety analysis plateform system based on network security | |
CN109062769A (en) | The method, apparatus and equipment of IT system performance risk trend prediction | |
CN104219088A (en) | Hive-based network alarm information OLAP method | |
Singh et al. | Analysis on data mining models for Internet Of Things | |
Arora et al. | Big data: A review of analytics methods & techniques | |
CN103150470A (en) | Visualization method for concept drift of data stream in dynamic data environment | |
CN105930462A (en) | Cloud computing platform based massive data processing method | |
Jiang et al. | Spatial and spatiotemporal big data science | |
CN109828995A (en) | A kind of diagram data detection method, the system of view-based access control model feature | |
CN105956019A (en) | Big data analysis processing method | |
CN105930463A (en) | Cloud computing platform based big data processing method | |
CN107562909A (en) | A kind of big data analysis system and its analysis method for merging search and calculating | |
CN110019453A (en) | A kind of method and system that tax data is handled based on distributed system infrastructure platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160907 |