CN110378550A

CN110378550A - The processing method of the extensive food data of multi-source based on distributed structure/architecture

Info

Publication number: CN110378550A
Application number: CN201910477552.4A
Authority: CN
Inventors: 曹云; 龙安杰
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2019-10-25

Abstract

The invention discloses a kind of processing methods of extensive food data of the multi-source based on distributed structure/architecture.Content is as follows: on the basis of distributed structure/architecture, having built Distributed Computing Platform, has carried out distributed storage, operation to data based on Map-Reduce algorithm；For extensive multi-source food data, defines key index and space-time data abnormal problem clusters data using the space-time big data key index analysis method of proposition, divide.Key index is calculated, to realize abnormality detection.Algorithm includes the following steps: that Distributed Computing Platform is built；Data are directed into distributed file system；Data prediction is inputted as algorithm；Data object is clustered, index of the computing object in different dimensions；Visualize index analysis result.The present invention can efficiently solve traditional single object time series data method for detecting abnormality there are the problem of so that detection accuracy and speed are all effectively promoted.

Description

The processing method of the extensive food data of multi-source based on distributed structure/architecture

Technical field

The invention belongs to field of computer technology, more particularly to a kind of extensive food of the multi-source based on distributed structure/architecture The key index parser and system of data.

Background technique

In information-based highly developed today, interested regulatory authorities can effectively integrate the big of food industries Lian Zhong each department Scale data resource；For example, Food Inspection data and food and medicine complain data etc..On the whole, this kind of data resource presents Data scale is big, institutional framework is complicated and the diversified feature of form of presentation.Although being passed along with the development of artificial intelligence technology Advantage of the machine learning algorithm of system in big data processing has been embodied in all kinds of fields of industrialization.But in face of multi-source heterogeneous Multiple database data fusion and Cooperative Analysis, still suffer from that data redudancy is high, form of presentation disunity (such as retouch by text State and numerical statistic) etc. difficulties.Towards the classical Space-time Model big data with " multi-source heterogeneous flow data " characteristic, how data are used Integration carries out research modeling with key technologies such as space-time data analyses to be connected to " each department's information island ", and it is potential sufficiently to excavate its Risk.

In data research field, space-time big data refers to that spatially mutually independent several objects are produced in different moments Raw data, and each object then forms a time series data, also known as flow data in the data that different moments generate. The time series data abnormality detection of traditional single object only considers single object, and the data dimension of consideration is also relatively low, is unable to fully Excavate hiding effective information.All there are some problems in detection accuracy and speed in the algorithm of standalone.It needs to develop It is a kind of more accurate, detection algorithm more efficiently.

Summary of the invention

Technical problem: the present invention existing problem and shortage in view of the above technology provides a kind of based on distributed structure/architecture The extensive food data of multi-source key index parser.Realize to the distributed storage of extensive multi-source data, reading, Operation, and this concept of key index is utilized, modeling analysis is carried out to data, according to the key index of food data, Lai Jinhang Data analysis, abnormality detection.

Technical solution: for achieving the above object, a kind of multi-source based on distributed structure/architecture of the invention is eaten on a large scale The technical solution that the processing method of product data uses includes the following steps:

Step 1: writing code builds Distributed Computing Platform；

Step 2: the extensive food inspection flow data of multi-source is directed into distributed file system from MySQL database HDFS (Hadoop Distributed File System), is stored with text file, with being for data processing, analyze；

Step 3: the data being stored in HDFS are pre-processed, the data object of each batch, according to temporal characteristics It is divided, is divided into different time sections, while carrying out dimension polishing, guarantee each object period dimension having the same.

Step 4: to each period of each object, clustering is carried out, using K-Means clustering algorithm, is obtained pair The cluster centre point and class cluster number answered.

Step 5: calculating separately the index of the n class cluster of each period to each object.Here index definition is Data departure degree.Each object is calculated in accounting of the data in inhomogeneity cluster of each period, it is different right to finally obtain The index series of elephant.

Step 6: visualized according to obtained index series, so that the fluctuation situation of each object is intuitively observed, Realize data analysis and abnormality detection.

Wherein:

It is described step 1: writing code builds Distributed Computing Platform；Based on Hadoop environment, configured in server end Include a master node, the Distributed Computing Platform of three slave nodes, for realizing Distributed Parallel Computing.

It is described step 2: writing data imports scripted code, by the extensive food inspection flow data of multi-source from original MySQL database is directed into distributed file system HDFS, is stored with text file, with being for data processing, analyze.

It is described step 3: the data being stored in HDFS are pre-processed, pre-treatment step are as follows: to the number of each batch It according to object, is divided according to temporal characteristics, is divided into different time sections, while carrying out dimension polishing, guarantee each object tool There is identical period dimension.

It is described step 4: each period to each object, carries out clustering, clustering method are as follows: use K-Means Clustering algorithm first initializes cluster centre for each period, iterates to calculate new cluster centre until convergence, obtains pair The final cluster centre point and class cluster number answered.

It is described step 5: calculate separately the index of the n class cluster of each period to each object, index here is fixed Justice is data departure degree, calculation method are as follows: each class cluster to each object in each period, to the cluster centre of class cluster It is weighted summation, the evaluation index as respective class cluster；Each object is calculated again in the data of each period in inhomogeneity Accounting in cluster, calculation method are as follows: using class number of clusters where object divided by total class number of clusters；Finally obtain the index sequence of different objects Column.

7. the processing method of the extensive food data of the multi-source according to claim 1 based on distributed structure/architecture, It is characterized in that: described step 6: being visualized according to obtained index series, method for visualizing are as follows: by obtained index Data show that display effect is different objects with json format transmission to system webpage web terminal using Echarts after parsing Corresponding histogram, horizontal axis indicate the corresponding period, and the longitudinal axis indicates corresponding key index value, to intuitively observe each The fluctuation situation of object realizes data analysis and abnormality detection.

The utility model has the advantages that the invention proposes one kind sides of efficiently solving aiming at the problem that traditional food data analysis technique Case.Innovatively using servicing based on Distributed Computing Platform, on the basis of Distributed Computing Platform, a kind of base is devised In the data analysis of key index and Outlier Detection Algorithm.The present invention not only has certain amplitude promotion in detection accuracy, simultaneously Based on Distributed Computing Platform, standalone algorithm is compared, processing speed of the invention and efficiency also all have a distinct increment.Effectively solution The analysis and abnormality detection problem of the extensive food data of multi-source of having determined.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the present invention,

Fig. 2 is the integrated stand composition for the risk evaluating system that the method for the present invention is built,

Fig. 3 is the method for the present invention simplified diagram,

Fig. 4 is the effect of visualization figure of the method for the present invention.

Specific embodiment

In the following with reference to the drawings and specific embodiments, the present invention is furture elucidated, it should be understood that these embodiments are merely to illustrate It the present invention rather than limits the scope of the invention, after the present invention has been read, those skilled in the art are to of the invention each The modification of kind equivalent form falls within the application range as defined in the appended claims.

Step 1: writing code builds Distributed Computing Platform.

Step 2: data are directed into distributed file system HDFS from MySQL, stored with text file.It is used as Data processing, analysis.The each field of initial text data with "=" separate, wherein the 2nd, 3,5,6 field be respectively the review time, Enterprise ID, detected value in a certain respect, standard value.Data about 60,000, respectively 2017 6,8,9,10,11, December it is daily Detection data.

Step 3: the data being stored in HDFS are pre-processed, the data object of each batch, according to temporal characteristics It is divided, is divided into different time sections, while carrying out dimension polishing, guarantee each object period dimension having the same. Data are integrated according to division methods, each interfield with ", " separate, respectively indicate enterprise ID, the time (specific to day), with And the data of different dimensions.Number of data is 3665 after integration, and every data dimension is 11 dimensions.

Step 4: to each period of each object, clustering is carried out, using K-Means clustering algorithm, is obtained pair The cluster centre point and class cluster number answered.Enabling period T is one month, uses K-Means to above-mentioned processed data respectively Algorithm is clustered.New data are obtained, new data packets contain 4 fields, respectively period (specific to the moon), corresponding data institute The cluster centre of category class cluster, enterprise ID, 11 dimension data of enterprise, these fields are separated by ", ", and number of data remains as 3665 Item.

Step 5: calculating separately the index of the n class cluster of each period to each object.Here index definition is Data departure degree.Each object is calculated in accounting of the data in inhomogeneity cluster of each period, it is different right to finally obtain The index series of elephant.Calculate separately 3 class clusters of each period index and each enterprise each period number According to the accounting in inhomogeneity cluster, different enterprise's bath index series in different time periods are finally obtained, output data is 3 words Section, respectively enterprise ID, period, response time short index.Totally 231 data.

Step 6: visualized according to obtained index series, so that the fluctuation situation of each object is intuitively observed, Realize data analysis and abnormality detection.Output data includes two fields, respectively exception enterprise ID and corresponding business indicators Fluctuation.

The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment, All technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art For those of ordinary skill, several improvements and modifications without departing from the principles of the present invention should be regarded as protection of the invention Range.

Claims

1. a kind of processing method of the extensive food data of multi-source based on distributed structure/architecture, it is characterised in that: the place of the data Reason method the following steps are included:

Step 1: writing code builds Distributed Computing Platform；

Step 2: the extensive food inspection flow data of multi-source is directed into distributed file system HDFS from MySQL database, It is stored with text file, with being for data processing, analyze；

Step 3: the data being stored in HDFS are pre-processed, the data object of each batch is carried out according to temporal characteristics It divides, is divided into different time sections, while carrying out dimension polishing, guarantees each object period dimension having the same；

Step 4: each period of each object is carried out clustering and is obtained corresponding using K-Means clustering algorithm Cluster centre point and class cluster number；

Step 5: calculating separately the index of the n class cluster of each period to each object；Here index definition is data Departure degree.Each object is calculated in accounting of the data in inhomogeneity cluster of each period, finally obtains different objects Index series；

Step 6: being visualized according to obtained index series, to intuitively observe the fluctuation situation of each object, realize Data analysis and abnormality detection.

2. the processing method of the extensive food data of the multi-source according to claim 1 based on distributed structure/architecture, feature It is: described step 1: writing code builds Distributed Computing Platform；Based on Hadoop environment, configures and wrap in server end Containing a master node, the Distributed Computing Platform of three slave nodes, for realizing Distributed Parallel Computing.

3. the processing method of the extensive food data of the multi-source according to claim 1 based on distributed structure/architecture, feature It is: described step 2: writing data imports scripted code, by the extensive food inspection flow data of multi-source from original MySQL Database is directed into distributed file system HDFS, is stored with text file, with being for data processing, analyze.

4. the processing method of the extensive food data of the multi-source according to claim 1 based on distributed structure/architecture, feature It is: described step 3: the data being stored in HDFS are pre-processed, pre-treatment step are as follows: to the data of each batch Object is divided according to temporal characteristics, is divided into different time sections, while carrying out dimension polishing, guarantees that each object has Identical period dimension.

5. the processing method of the extensive food data of the multi-source according to claim 1 based on distributed structure/architecture, feature It is: described step 4: each period to each object, carries out clustering, clustering method are as follows: poly- using K-Means Class algorithm first initializes cluster centre for each period, iterates to calculate new cluster centre until convergence, is corresponded to Final cluster centre point and class cluster number.

6. the processing method of the extensive food data of the multi-source according to claim 1 based on distributed structure/architecture, feature It is: described step 5: calculating separately the index of the n class cluster of each period to each object, index definition here is Data departure degree, calculation method are as follows: each class cluster to each object in each period carries out the cluster centre of class cluster Weighted sum, the evaluation index as respective class cluster；Each object is calculated again in the data of each period in inhomogeneity cluster Accounting, calculation method are as follows: using class number of clusters where object divided by total class number of clusters；Finally obtain the index series of different objects.

7. the processing method of the extensive food data of the multi-source according to claim 1 based on distributed structure/architecture, feature It is: described step 6: being visualized according to obtained index series, method for visualizing are as follows: by obtained achievement data It with json format transmission to system webpage web terminal, is shown after parsing using Echarts, display effect is corresponding for different objects Histogram, horizontal axis indicates corresponding period, and the longitudinal axis indicates corresponding key index value, to intuitively observe each object Fluctuation situation, realize data analysis and abnormality detection.