CN107016501A

CN107016501A - A kind of efficient industrial big data multidimensional analysis method

Info

Publication number: CN107016501A
Application number: CN201710190553.1A
Authority: CN
Inventors: 李律
Original assignee: ZHEJIANG LITAI TECHNOLOGY CO LTD
Current assignee: ZHEJIANG LITAI TECHNOLOGY CO LTD
Priority date: 2017-03-28
Filing date: 2017-03-28
Publication date: 2017-08-04

Abstract

The present invention relates to a kind of efficient industrial big data multidimensional analysis method, comprise the following steps：(1) data write-in hadoop distributed file systems are read from opc server by OPC extraction programs；(2) data conversion in hadoop distributed file systems is will be stored in into the detail list in hive data warehouses；(3) row and column needed for detail list is filtered out, forms true table；(4) the fact that obtain table and dimension table are connected to the wide table to form subject-oriented, carrying out multidimensional analysis to wide table obtains analysis result；Wherein, dimension table comes from external system, it is necessary to which user is manually imported among hive data warehouses.The inventive method realizes elastic storage, elastic calculation, high availability, the purpose of analysis method Universal efficient.

Description

A kind of efficient industrial big data multidimensional analysis method

Technical field

The present invention relates to database technical field, more particularly to a kind of efficient industrial big data multidimensional analysis method.

Background technology

With the popularization of industrial information, factory more and more uses automation control appliance and intelligence in process of production Energy instrument, these equipment and instrument also generate substantial amounts of real time data.OPC agreements are an industrial standards, have been obtained mostly The support of number automation control appliance and intelligence instrument, the unified OPC interface that application program is provided by these equipment can The a large amount of real time datas produced with easily obtaining in production process.These real time datas have reacted the various shapes in production process State, by the analysis to these data, can help us to optimize production procedure, prevention defect and failure, reduce production cost, Improve production efficiency.

Existing industrial real-time data analysis is completed usually using traditional relational database.Relative to based on hadoop Distributed storage and calculating platform, traditional relational database memory capacity is small, and computing capability is weak, and dilatation cost is high.With The continuous expansion of industrial real-time data scale, is analyzed, it has to carry out the sampling of coarseness, lose using traditional relational database Discard substantial amounts of field data, this will impact analysis result accuracy.And as industrial real-time data analyzes business Constantly variation, the computing capability of traditional relational database also becomes the bottleneck of restriction, have impact on the expansion of analysis business And actual effect.

In the analysis method of industrial real-time data, traditional mode is individually developed generally directed to single business, different Business uses single data model and exhibition method, have ignored the general character of data analysis business.When data analysis business is more next When more, either safeguard that still extension can all become more and more difficult.And actual data analysis is often enlightenment, This data analysis mode according to business customizing just seems underaction, have impact on the thinking diverging of data analyst, holds Mindset is easily caused, is unfavorable for excavating more valuable information from mass data.

The content of the invention

The present invention is to overcome above-mentioned weak point, it is therefore intended that provide a kind of efficient industrial big data multidimensional analysis side Method, the present invention builds data warehouse based on this distributed computing technologies increased income of hadoop, and large-scale industrial real-time data is complete Amount is imported among data warehouse, and multidimensional analysis modeling is carried out according to unified flow for different data analysis business；Have The data volume of receiving is large and complete, and analysis method is general and efficient, the characteristics of whole system is easy to maintain and expands.The inventive method Realize elastic storage, elastic calculation, high availability, the purpose of analysis method Universal efficient.

The present invention is to reach above-mentioned purpose by the following technical programs：A kind of efficient industrial big data multidimensional analysis side Method, comprises the following steps：

(1) data write-in hadoop distributed file systems are read from opc server by OPC extraction programs；

(2) data conversion in hadoop distributed file systems is will be stored in into the detail list in hive data warehouses；

(3) row and column needed for detail list is filtered out, forms true table；

(4) the fact that obtain table and dimension table are connected to the wide table to form subject-oriented, carrying out multidimensional analysis to wide table obtains To analysis result；Wherein, dimension table comes from external system, it is necessary to which user is manually imported among hive data warehouses.

Preferably, described OPC extraction programs read industrial real-time data according to standard OPC agreements from opc server, And timestamp when stamping reading is considered as and once read；If the data point that opc server is provided is key1, key2, key3, correspondence Data be respectively value1, value2, value3, timestamp is represented with timestamp, it is corresponding value then be yyyy-MM- dd-HH-mm-ss；Wherein, y represents year, and M represents the moon, and d represents day, when H is represented, m represents point that s represents the second；OPC extraction programs The data once read are described with json forms, form character string as follows：

{“key1”:“value1”,“key2”:“value2”,“key3”:“value3”,“timestamp”:“yyyy- MM-dd-HH-mm-ss”}。

Preferably, the extraction frequency of the OPC extraction programs is second level, the data of extraction are according to one json word of a line The form write-in hadoop distributed file systems of symbol string, json character strings are merged into one or more text in units of hour In part, the file for belonging to each hour is put into same file folder；The data of one hour are write it in OPC extraction programs Afterwards, the file that size is 0 can be generated in corresponding file, file is entitled _ SUCCESS, _ SUCCESS is as current The whether complete criterion of data in file.

Preferably, the detail list flow that the step (2) is obtained in hive data warehouses is as follows：(a) hadoop is utilized The workflow schedule instrument oozie that distributed file system is provided will be distributed literary from hadoop more than the json files of phase buffer Deleted in part system, wherein phase buffer is default；

(b) according to the merging chronomere of data, using one task of oozie start by set date, by the data of json forms A two-dimentional detail list is converted into, is loaded among the hive data warehouses based on hadoop platforms, follow-up number is carried out According to processing.

Preferably, the follow-up data processing is as follows：

(I) an interim table for only including a character tandem is set up to be carried on json data；

(II) key in json character strings is changed into name in column using hive json analytical functions json_tuple, The value of value conversions in column.

Preferably, described is to exclude the data point unrelated with theme to entering ranks filtering in detail list, to row filtering It is by the span diminution of the data point related to theme；In step (3) in column, conversion is arranged using by the row conversion of detail list Embark on journey, timestamp row keep constant, the new row of increase are filtered as the method for the title of data point.

Preferably, it is described the fact table and dimension table connect and can be formed towards master according to one or more row The wide table of topic.

Preferably, it is described to wide table carry out multidimensional analysis implementation for by hadoop platforms provide it is real-time Sql query facilities impala, the stsndard SQL driver access that outside data visualization tool can be provided using impala The data of wide table, complete multidimensional analysis.

The beneficial effects of the present invention are：(1) industrial real-time data of full dose can be write data bins by OPC extraction programs Storehouse, due to data storage use hadoop distributed file systems, convenience extending transversely, will not due to memory capacity limitation and It is forced to abandon partial data in the way of sampling；(2) the Distributed Calculation engine generation subject-oriented based on hadoop is utilized Wide table, so in the case of big data quantity and intensive also can be completed within the relatively short time modeling process； (3) because hadoop distributed file system memory capacity is huge, a certain degree of data redundancy in wide table can be allowed, so In data visualisation system can in order to avoid the connection of multilist there is provided the multidimensional analysis function of low delay；(4) utilize and be based on Hadoop workflow schedule engine, modeling calculating process fault-tolerance is strong, and stability is high；(5) the inventive method clear process, leads to It is strong with property, can conveniently it be extended in the various multidimensional analysis business of industrial big data.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the inventive method；

Fig. 2 is the system architecture schematic diagram for implementing the present invention.

Embodiment

With reference to specific embodiment, the present invention is described further, but protection scope of the present invention is not limited in This：

Embodiment：As shown in figure 1, a kind of efficient industrial big data multidimensional analysis method comprises the following steps：

(1) data write-in hadoop distributed file systems are read from opc server；

OPC extraction programs read whole industrial real-time datas from opc server according to the OPC agreements of standard, then beat Timestamp during upper reading, which is considered as, once to be read.Assuming that the data point that opc server can be provided is key1, key2, key3, it is right The data answered are respectively value1, value2, value3；Timestamp represents that corresponding value is yyyy-MM- with timestamp dd-HH-mm-ss；

Wherein, y represents year, and M represents the moon, and d represents day, when H is represented, m represents point that s represents the second.OPC extraction programs are once The data of reading are described with json forms, form character string as follows：

{“key1”:“value1”,“key2”:“value2”,“key3”:“value3”,“timestamp”:“yyyy- MM-dd-HH-mm-ss”}

The extraction frequency of OPC extraction programs is second level, and the data of extraction are write according to the form of one json character string of a line Enter hadoop distributed file systems, json character strings are merged into units of hour in one or more file, belong to every The file of individual hour is put into same file folder., can be corresponding after OPC extraction programs write the data of a hour Generate the file that size is 0 in file, file is entitled _ SUCCESS, follow-up data processing routine can whether to have _ SUCCESS presss from both sides the whether complete criterion of interior data as current file.The extraction frequency and data of OPC extraction programs merge Chronomere can be changed according to the situation of practical business, but in general extract frequency at least than merge chronomere it is small An order of magnitude.Higher extraction frequency is, in order to obtain more real time datas as far as possible, the thing represented by data not to be lost Reason state；It is for the ease of the more efficient processing data of Distributed Calculation engine, together with respect to the merging chronomere of coarseness When taken into account the ageing of data.

(2) initial data is converted into the detail list in hive data warehouses；

Industrial real-time data is stored in hadoop distributed file systems using above-mentioned json forms, is due to Json forms are logically natural close with industrial real-time data, facilitate the processing of OPC extraction programs.But json form numbers Very high according to redundancy, among the corresponding data set of same opc server, data point key is largely repeated, in the case of big data It is the waste to storage resource.Therefore, we arrange one and delayed by the OPC data of json forms as just a cushion Rush the phase such as 3 months, the workflow schedule instrument oozie provided by hadoop distributed file systems starts one daily to be determined When task, the json files more than phase buffer are deleted from hadoop distributed file systems.

OPC extraction programs are write data into after hadoop distributed file systems, when oozie can be according to the merging of data Between one task of unit start by set date, by the data conversion of json forms into a two-dimentional detail list, be loaded into and be based on Among the hive data warehouses of hadoop platforms, follow-up data processing is carried out.The OPC data of json forms is converted into hive Detail list mainly by two steps, initially set up one only the interim table comprising a character tandem be carried in json data it On, the key in json character strings is then changed into name in column using hive json analytical functions json_tuple, value turns Change the value of row into.It is as shown in table 1 that json character strings in step (1) are converted into result after hive detail lists：

key1	key2	key3	timestamp
				value1	value2	value3	yyyy-MM-dd-HH-mm-ss

Table 1

In table 1, the first behavior row name, is the title of OPC data point, the timestamp finally gathered plus each data point； Second row is only the industrial real-time data being really stored in hadoop distributed file systems.If during according to a unit Between, such as one hour, data are merged, then multirow data will be had in detail list, regard these rows as detail list One subregion, the entitled yyyy-MM-dd-HH of subregion.The purpose of so subregion is the renewal of detail list increment for convenience, is also In order to occur conveniently being recalculated after exception in units of subregion.

(2) row and column needed in detail list is filtered out to form true table；

One detail list is reflected one in all industrial real-time datas that an opc server can be provided, actual conditions Individual opc server might have thousands of data points, but user is only concerned those data points related to some theme. It would therefore be desirable to be filtered to the row and column of detail list, the true table of generation.Row filtering is excluded unrelated with theme Data point, is further to reduce the span of the data point related to theme to row filtering.In general, row filtering be must It is indispensable, and it is optional to go filtering.True table can be with dimension table according to data point in subsequent step for convenience Key is attached, and is needed the row conversion of detail list during ranks are filtered in column, row conversion is embarked on journey, timestamp row are protected Hold constant, increase a new row as the title of data point.As shown in table 2, it is assumed that have in detail list comprising key1, key2, It is time1, time2, time3 to have the corresponding timestamp of three row data in tri- data points of key3, a subregion of detail list, is led to Cross row filtering and exclude the corresponding row of key3 in detail list, the corresponding rows of time3 in detail list are excluded by row filtering, then passed through True table is generated after the rule of row-column transform, as shown in table 3.Now, the title of OPC data point is no longer row name, but into For the key values in true table, while this leu of timestamp is so consistent with detail list.

key1	key2	key3	timestamp
				value11	value12	value13	time1
value21	value22	value23	time2
				value31	value31	value31	time3

Table 2

key	value	timestamp
			key1	value11	time1
key1	value21	time2
			key2	value12	time1
key2	value22	time2

Table 3

Opc server and detail list are one-to-one relations；And detail list and true table are one-to-many relations, i.e., one Multiple true tables can be generated by opening detail list, and a true table can only be from a detail list.The purpose for the arrangement is that being Avoid that the connection of multilist occurs during the true table of generation, simple flow improves the efficiency performed.

(4) true table and dimension table are connected to the wide table to form subject-oriented；

True table direct sources and industrial real-time data, and dimension table is then from other external systems, it is necessary to user Manually import among hive data warehouses.Such as user can edit dimension table in relational database, then pass through Dimension table is imported among hive data warehouses by the Distributed Relational tables of data import tool sqoop that hadoop platforms are provided. In general, the change frequency of dimension table is relatively low, and data volume is also far smaller than true table.By one or multiple true tables, plus Upper one or multiple dimension tables, the width towards some special body can be formed by being connected according to one or more row Table.Generally, the row of connection can include multiple row in the key of data point, connection procedure can be according to some calculation formula Participate in filter row or column in computing, connection procedure.Assuming that the fact that have as shown in table 4 table and such as the institute of table 5 The dimension table shown, the key values of two tables are the titles of OPC data point, and value represents the corresponding value of data point, Dimension1 and dimension2 represent two the dimension such as workshops and process related to data point, according to data point key Wide table is formd after connection, as shown in table 6.Wide table is towards some business-subject, it should as far as possible comprising with the main body Related all data are, it is necessary to according to would rather the principle that can not lack of redundancy.Multidimensional can be externally provided after wide table formation The service of analysis, the real-time sql query facilities impala provided by hadoop platforms, outside data visualization tool can be with Using the data of the impala wide tables of stsndard SQL driver access provided, extemporaneous inquiry, billboard etc. are provided user various many Tie up analytic function.

key	value	timestamp
			key1	value1	time1
key2	value2	time1

Table 4

key	dimension1	dimension2
			key1	dim11	dim21
key2	dim12	dim22
			key3	dim13	dim23

Table 5

key	value	dimension1	dimension2	timestamp
					key1	value1	dim11	dim21	time1
key2	value2	dim12	dim22	time1

Table 6

It is as shown in Figure 2 with the system architecture diagram of the inventive method.

The present invention is applied to certain Large scale nonferrous metals manufactory, whole to the factory by the collection of real time data on production line The power consumption of individual production process is analyzed, and is excavated potential energy consumption and is wasted reason, energy consumption excess is alerted, realized The visualized management of energy consumption.Specific implementation step is as follows：

1st, data points all on production line are once obtained in every 30 seconds by OPC extraction programs, and aggregated into according to hour Detail list in table subregion, generation data warehouse.

The fact that data point related to electricity consumption of equipment in detail list the 2nd, is filtered out into generation power consumption table.

3rd, outside editor's data point title, device name, device type, workshop, process, order of classes or grades at school, the dimension table of time correlation, And imported among Data Data warehouse.

4th, true table and dimension table are connected, using corresponding electric energy calculation formula, generation equipment power consumption is the theme Wide table, data visualization tool can analyze the situation of power consumption on production line in real time according to different dimensions.

The technical principle for being the specific embodiment of the present invention and being used above, if conception under this invention institute The change of work, during the spirit that function produced by it is still covered without departing from specification and accompanying drawing, should belong to the present invention's Protection domain.

Claims

1. a kind of efficient industrial big data multidimensional analysis method, it is characterised in that comprise the following steps：

(3) row and column needed for detail list is filtered out, forms true table；

(4) the fact that obtain table and dimension table are connected to the wide table to form subject-oriented, carrying out multidimensional analysis to wide table is divided Analyse result；Wherein, dimension table comes from external system, it is necessary to which user is manually imported among hive data warehouses.

2. a kind of efficient industrial big data multidimensional analysis method according to claim 1, it is characterised in that：Described OPC extraction programs read industrial real-time data according to standard OPC agreements from opc server, and timestamp when stamping reading is regarded Once to read；If opc server provide data point be key1, key2, key3, corresponding data be respectively value1, Value2, value3, timestamp represent that corresponding value is then yyyy-MM-dd-HH-mm-ss with timestamp；Wherein, y tables Show year, M represents the moon, and d represents day, when H is represented, m represents point that s represents the second；The data json that OPC extraction programs are once read Form is described, and forms character string as follows：

{“key1”:“value1”,“key2”:“value2”,“key3”:“value3”,“timestamp”:“yy yy-MM- dd-HH-mm-ss”}。

3. a kind of efficient industrial big data multidimensional analysis method according to claim 2, it is characterised in that：The OPC The extraction frequency of extraction program is second level, and the data of extraction are distributed according to the form write-in hadoop of one json character string of a line Formula file system, json character strings are merged into units of hour in one or more file, belong to the file of each hour It is put into same file folder；After OPC extraction programs write the data of a hour, it can be generated in corresponding file One size is 0 file, and file is entitled _ SUCCESS, and _ SUCCESS is used as the whether complete judgement of data in current file folder Standard.

4. a kind of efficient industrial big data multidimensional analysis method according to claim 1, it is characterised in that：The step (2) the detail list flow obtained in hive data warehouses is as follows：(a) workflow provided using hadoop distributed file systems Scheduling tool oozie will be deleted more than the json files of phase buffer from hadoop distributed file systems, and wherein phase buffer is It is default；

(b) according to the merging chronomere of data, using one task of oozie start by set date, by the data conversion of json forms Into a two-dimentional detail list, it is loaded among the hive data warehouses based on hadoop platforms, carries out at follow-up data Reason.

5. a kind of efficient industrial big data multidimensional analysis method according to claim 4, it is characterised in that：It is described follow-up Data processing it is as follows：

(II) key in json character strings is changed into name in column using hive json analytical functions json_tuple, value turns Change the value of row into.

6. a kind of efficient industrial big data multidimensional analysis method according to claim 1, it is characterised in that：It is described to bright It is to exclude the data point unrelated with theme to enter ranks filtering in thin table, is taking the data point related to theme to row filtering It is worth range shorter；Using by the row conversion of detail list, in column, row conversion is embarked on journey, and timestamp row keep constant, increase in step (3) A new row are filtered as the method for the title of data point.

7. a kind of efficient industrial big data multidimensional analysis method according to claim 1, it is characterised in that：Described thing Real table is connected with dimension table according to one or more row can form the wide table of subject-oriented.

8. a kind of efficient industrial big data multidimensional analysis method according to claim 7, it is characterised in that：It is described to width The implementation that table carries out multidimensional analysis is the real-time sql query facilities impala provided by hadoop platforms, outside number The data for the wide table of stsndard SQL driver access that can be provided according to visualization tool using impala, complete multidimensional analysis.