CN114416705A

CN114416705A - Multi-source heterogeneous data fusion modeling method

Info

Publication number: CN114416705A
Application number: CN202111318577.3A
Authority: CN
Inventors: 李忱; 陈忠国; 周鑫; 江何; 门殿春; 孟繁荣; 姚志强
Original assignee: Beijing Testor Technology Co ltd; Beijing Tongtech Co Ltd
Current assignee: Beijing Testor Technology Co ltd; Beijing Tongtech Co Ltd
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-04-29

Abstract

The invention discloses a multi-source heterogeneous data fusion modeling method, in particular to the technical field of manufacturing heterogeneous data processing, the invention provides complete JPA support by using a Hibernate ORM core, realizes faster and unified reading and writing of a plurality of databases of different types, simultaneously adopts data description corresponding to different types of original data, adopts protocol analysis rules, realizes decision-level fusion modeling of key characteristic data in the data fusion modeling process, extracts data based on the two-dimensional relationship of a plurality of protocol analysis engines and the data, realizes decision-level fusion of key characteristics on the data of different types, improves fault tolerance and interference resistance, and compensates the influence of low accuracy caused by low data precision of a decision-level modeling mode by the two-dimensional relationship of a plurality of protocol analysis engines and metadata aiming at the traditional mode synchronously, the method realizes a quick and accurate decision modeling mode.

Description

Multi-source heterogeneous data fusion modeling method

Technical Field

The invention relates to the technical field of manufacturing heterogeneous data processing, in particular to a multi-source heterogeneous data fusion modeling method.

Background

The multi-source heterogeneous data come from a plurality of data sources, including data sets collected by different database systems and different devices in work and the like. Different data sources are different in operating system and management system, different in storage mode and logic structure of data, different in generation time, use place, code protocol and the like of data, this results in a "multi-source" characterization of the data, which, as is currently the case in the manufacturing industry, particularly data generated during the manufacturing process of the product, the method not only has huge data volume, rich sources, various types and complex structure, but also has isomerism, distributivity and autonomy among data sources due to different sources, storage forms and the like of data among different departments and systems in the manufacturing industry, the data types not only comprise structured data such as digital and relational data, but also comprise unstructured data such as images and audios, the production data is subjected to modeling treatment after the whole process, so that the production data can be displayed more intuitively, and the decision-making deployment is facilitated.

Due to the multi-source characteristic of the data, the quality of the acquired data is difficult to guarantee in the data integration process, missing, wrong, inconsistent and other invalid data which do not meet the specification generally exist, and the formats of the data from different systems are not uniform, which bring difficulty to the effective analysis of the data, so that an efficient processing and integration means is adopted to improve the integration efficiency of various heterogeneous data, and for the decision-making modeling mode, the traditional multi-source heterogeneous data has data missing to a certain extent in the data fusion process, so that during feature extraction, the accuracy of the model is influenced, and the modeling content can not be controlled more accurately while the rapid decision-making is realized.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a multi-source heterogeneous data fusion modeling method, which realizes the more rapid and unified reading and writing of a plurality of databases of different types by using the core and complete JPA support of Hibernate ORM, ensures the stability and efficiency of the reading and writing process, improves the overall quality of data by adopting a data cleaning mode, and ensures the effective workload of the data conversion process, thereby achieving the effect of improving the real-time data processing speed and improving the data integration efficiency.

In order to achieve the purpose, the invention provides the following technical scheme: a multi-source heterogeneous data fusion modeling method comprises a data acquisition process, a data integration process and a data analysis process, and specifically comprises the following steps:

the method comprises the following steps: in the data acquisition process, the original data are accurately acquired in real time, an original data source is provided for the data integration stage, the data description is carried out on the original data source, and a corresponding multi-protocol analysis engine is established.

Step two: and the HBase and the NoSQL databases are used for carrying out distributed storage on the data from each subsystem according to various different data sources.

Step three: by loading Hibernate OGM and establishing a unified HBase and NoSQL database access model based on the Hibernate OGM, the two databases are read and written under the same frame according to a unified rule to complete integral data access.

Step four: for error data, a homogeneous mean interpolation mode is utilized, firstly, a standard deviation method of statistical analysis is utilized to identify the estimated error value, and the identified error data is eliminated, so that the data is screened.

Step five: after the data are cleaned, the data are subjected to screening processing conversion through Extract-Transform-Load, and then the data are loaded into a data warehouse model to be stored.

Step six: extracting and analyzing data in the data warehouse model by adopting an FP-Growth parallel algorithm, marking associated information, and importing the associated information into a corresponding modeling algorithm.

As a further scheme of the invention: the HBase and NoSQL databases in the second step can be replaced by any one of MySQL, Oracle, DB2, SQL Server and Redis, HBase, MongoDB and Neo4 j.

As a further scheme of the invention: the Extract-Transform-Load data warehouse technology in the step five comprises Datastage, Informatica and Kettle.

As a further scheme of the invention: and the distributed storage memory in the second step adopts an index structure based on a hash table, namely the hash table stores the position index of the data on the disk, and the disk stores the actual contents of the main key and the value.

As a further scheme of the invention: and when the data is screened in the fourth step, potential errors of the data are detected and repaired based on the consistency between the associated data for inconsistent data, so that the cleaning of the data of multiple data sources is completed.

As a further scheme of the invention: the original data source comprises various heterogeneous data information, and the data description of the original data source comprises the combined description of the extraction of key characteristic data and a protocol analysis rule.

As a further scheme of the invention: and the multiple protocol analysis engines are used for establishing a two-dimensional relationship after data analysis on the protocols configured in the data description by using monitoring, pulling and crawling modes of related protocols, storing the two-dimensional relationship into a message queue and sequentially storing the two-dimensional relationship into corresponding HBase and NoSQL databases in the message queue.

As a further scheme of the invention: and C, implementing a temporary storage strategy of the constant-capacity recycle bin on the error data cleared in the step four.

The invention has the beneficial effects that:

1. the invention provides complete JPA support by using the core of Hibernate ORM, realizes the more rapid unified reading and writing of a plurality of different types of databases, ensures the stability and the efficiency, simultaneously adopts a data cleaning mode to improve the overall quality of data and ensures the effective workload of a data conversion process, thereby achieving the effect of improving the real-time data processing speed and improving the efficiency of data integration, simultaneously adopts data description corresponding to different types of original data to realize the direct description aiming at the data characteristics, adopts a protocol analysis rule to realize the decision-level fusion modeling of key characteristic data in the process of data fusion modeling, and simultaneously extracts detailed data in a mode of monitoring, pulling and crawling through two-dimensional relations of a plurality of protocol analysis engines and data to realize the decision-level fusion of key characteristics of different types of data, the calculated amount is reduced to a certain extent, the fault tolerance and the anti-interference performance are improved, and the influence of low data precision of a decision-level modeling mode caused by low modeling accuracy is made up for the traditional mode through the two-dimensional relation between a plurality of protocol analysis engines and metadata, so that the quick and accurate decision-level modeling mode is realized.

2. The invention stores the data from each subsystem in a distributed way by using the HBase and the NoSQL database, adopts an index structure based on a hash table, namely the hash table stores the position index of the data on a disk, so that companies in a plurality of aggregation intervals can keep unchanged at the physical level, realizes the retrieval of multi-source heterogeneous data at the software level, is matched with and establishes a uniform HBase and NoSQL database access model, realizes the integrated retrieval authority and integration of the whole data, integrates the standard deviation method of statistical analysis to carry out error value estimation identification and the similar mean interpolation mode to process the error data, has more obvious effect on the quality improvement of the whole data, carries out further processing on the required data by the Extract-Transform-Load tool processing after the primary quality improvement, and carries out the overall identification processing of key data according to the key characteristics of the data in the process, and the unified data warehouse model is stored and called, so that the processing of modeling data is realized, the data access speed during modeling is increased, and the processing of dirty data in the data is guaranteed.

Drawings

FIG. 1 is a schematic diagram of the overall architecture of the present invention;

FIG. 2 is a schematic block diagram of the system of the present invention;

FIG. 3 is a block diagram of the process of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

a multi-source heterogeneous data fusion modeling method comprises a data acquisition process, a data integration process and a data analysis process, and specifically comprises the following steps:

Step four: and processing error data by using a homogeneous mean interpolation mode, firstly identifying the estimated error value by using a standard deviation method of statistical analysis, and clearing the identified error data to complete the screening of the data.

Step five: after the data are cleaned, the data are screened, processed and converted through an Extract-Transform-Load tool, and then the data are loaded into a data warehouse model to be stored.

By adopting the standard deviation method, the method can calculate the average number and standard deviation of a given sample, then determine the critical point for distinguishing the abnormal value, namely a plurality of standard deviation ranges from the average number, and then determine the value exceeding the defined lower limit and upper limit as the abnormal value, thereby realizing the identification of the error value, facilitating the cleaning of data, improving the data quality

In other embodiments, the HBase and NoSQL databases in step two may be replaced by any of MySQL, Oracle, DB2, SQL Server and Redis, HBase, MongoDB, Neo4 j. The HBase and NoSQL databases in the step two adopt a selectable and replaceable mode of various database types, so that the HBase and NoSQL databases can be suitable for data required to be stored in different manufacturing industries, and the most appropriate storage mode is selected, so that the wide compatibility of the HBase and NoSQL databases is improved.

In other embodiments, the Extract-Transform-Load data warehouse technique in step five includes Datastage, Informatica, and Kettle. The data is further screened and converted by adopting an Extract-Transform-Load mode, so that the data can be further processed on the basis of the fourth step, the quality of the data is further improved, and the processed data is loaded to the same data warehouse model, so that a good data integration effect can be guaranteed, the data in the modeling process can be directly read conveniently, and the modeling speed and quality are guaranteed.

In other embodiments, an index structure based on a hash table is adopted in the distributed storage memory in the second step, that is, the hash table stores the position index of the data on the disk, and the actual contents of the primary key and the value are stored on the disk. By adopting a distributed storage mode, the method can select the latest distribution based on different types of data, simultaneously unify data indexes, adopt a Hash storage engine, regularly merge old data or deletion operation, retain the latest data, simultaneously retain an index record on a disk, generate the index record when regularly merging, and directly reconstruct the index record in a memory when the disk is powered off so as to ensure the data security.

In other embodiments, the data screening in the fourth step is performed simultaneously, for inconsistent data, potential errors of the data are detected based on consistency between associated data, and repair is performed, so as to complete cleaning of data of multiple data sources. And judging and repairing possible errors based on the consistency between the associated data, so that the data can be matched and sorted in the fourth step, and the integration speed of the data is improved.

In other embodiments, the original data source includes a plurality of heterogeneous data information, and the data description of the original data source includes a combined description of extraction of key feature data and a protocol parsing rule. By adopting the matching of the key characteristic data and the protocol analysis rule, the original data can be simply represented through the key characteristic data, and meanwhile, the index of the two-dimensional relation is matched, so that the processing of the key characteristic data can be realized, in the modeling process, the introduction and completion of the original data are realized through the index, the quick achievement of decision-making modeling is realized, and the accurate modeling processing of data index perfection is realized subsequently.

In other embodiments, the multiple protocol parsing engines utilize monitoring, pulling and crawling modes of relevant protocols for protocols configured in the data description to establish a two-dimensional relationship after data parsing and store the two-dimensional relationship into a message queue, and sequentially store corresponding HBase and NoSQL databases in the message queue. After the data are analyzed, the two-dimensional relationship is established, so that the data indexing between the characteristic data and the original data can be completed in a monitoring, pulling and crawling mode.

In other embodiments, the error data cleared in the fourth step is implemented with a constant-capacity recycle bin temporary storage strategy, and by adopting the recycle bin temporary storage strategy, the error data can be temporarily stored when the recycle bin is used, and the recycle bin is cleared according to a time sequence after the capacity is full, so that the situation that the error deletion cannot be recovered is prevented, and the fault tolerance of the overall operation is improved.

Example 2:

the method comprises the following steps: in the data acquisition process, the original data are accurately acquired in real time, and an original data source is provided for the data integration stage.

Step two: and performing distributed storage on data from each subsystem by using HBase and NoSQL databases according to various different data sources, performing data description on an original data source, and establishing corresponding various protocol analysis engines.

Step three: and processing error data by using a homogeneous mean interpolation mode, firstly identifying the estimated error value by using a standard deviation method of statistical analysis, and clearing the identified error data to complete the data screening.

Step four: after the data are cleaned, the data are subjected to screening processing conversion through Extract-Transform-Load, and then the data are loaded into a data warehouse model to be stored.

Step five: extracting and analyzing data in the data warehouse model by adopting an FP-Growth parallel algorithm, marking associated information, and importing the associated information into a corresponding modeling algorithm.

The Extract-Transform-Load data warehouse technology in the step five comprises Datastage, Informatica and Kettle.

And adopting an index structure based on a hash table in the distributed storage memory in the second step, namely storing the position index of the data on the disk by the hash table, and storing the actual contents of the main key and the value on the disk.

And while screening the data in the fourth step, detecting potential errors of the data based on the consistency among the associated data for inconsistent data, and repairing to finish cleaning the data of multiple data sources.

Example 3:

Step two: and (3) performing distributed storage on data from each subsystem by using HBase and No SQL databases according to various different data sources.

Step three: by loading Hibernate OGM and establishing a uniform HBase and No SQL database access model based on the Hibernate OGM, the two databases are read and written under the same frame according to a uniform rule to complete integral data access.

Step four: and (4) screening, processing and converting the data through Extract-Transform-Load, and then loading the data into a data warehouse model for storage.

The HBase and NoSQL databases in the step two can be replaced by any one of MySQL, Oracle, DB2, SQL Server and Redis, HBase, MongoDB and Neo4 j.

In conclusion, the present invention: through the comparison of the embodiments, the Hibernate OGM and the storage mode of the distributed database can be matched with each other, so that the distributed database can play a role in more convenient data reading and storage, meanwhile, the uniformity of the data can be kept, the similar mean interpolation mode and the matched data cleaning and repairing based on the consistency between the associated data are matched, the quality of the data is improved, meanwhile, the high-efficiency fusion of the data can be guaranteed, and the direct reading of a modeling algorithm is facilitated.

The points to be finally explained are: although the present invention has been described in detail with reference to the general description and the specific embodiments, on the basis of the present invention, the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A multi-source heterogeneous data fusion modeling method is characterized by comprising the processes of data acquisition, data integration and data analysis, and specifically comprises the following steps:

the method comprises the following steps: in the data acquisition process, the original data are accurately acquired in real time, an original data source is provided for the data integration stage, the data description is carried out on the original data source, and a corresponding multi-protocol analysis engine is established;

step two: according to various different types of data sources, the HBase and the NoSQL database are used for carrying out distributed storage on data from each subsystem;

step three: the method comprises the steps that a Hibernate OGM is loaded, and a unified HBase and NoSQL database access model is established on the basis of the Hibernate OGM, so that the two databases read and write under the same frame according to a unified rule to complete integral data access;

step four: processing error data by using a homogeneous mean interpolation mode, firstly identifying the estimated error value by using a standard deviation method of statistical analysis, and clearing the identified error data to complete the screening of the data;

step five: after the data are cleaned, the data are screened, processed and converted through an Extract-Transform-Load tool and then loaded into a data warehouse model for storage;

2. The multi-source heterogeneous data fusion modeling method according to claim 1, characterized in that: the HBase and NoSQL databases in the second step can be replaced by any one of MySQL, Oracle, DB2, SQL Server and Redis, HBase, MongoDB and Neo4 j.

3. The multi-source heterogeneous data fusion modeling method according to claim 1, characterized in that: in the step five, the Extract-Transform-Load tool is any one of Datastage, Informatica and button.

4. The multi-source heterogeneous data fusion modeling method according to claim 1, characterized in that: and the distributed storage memory in the second step adopts an index structure based on a hash table, namely the hash table stores the position index of the data on the disk, and the disk stores the actual contents of the main key and the value.

5. The multi-source heterogeneous data fusion modeling method according to claim 1, characterized in that: and when the data is screened in the fourth step, potential errors of the data are detected and repaired based on the consistency between the associated data for inconsistent data, so that the cleaning of the data of multiple data sources is completed.

6. The multi-source heterogeneous data fusion modeling method according to claim 1, characterized in that: the original data source comprises various heterogeneous data information, and the data description of the original data source comprises the combined description of the extraction of key characteristic data and a protocol analysis rule.

7. The multi-source heterogeneous data fusion modeling method according to claim 1, characterized in that: and the multiple protocol analysis engines are used for establishing a two-dimensional relationship after data analysis on the protocols configured in the data description by using monitoring, pulling and crawling modes of related protocols, storing the two-dimensional relationship into a message queue and sequentially storing the two-dimensional relationship into corresponding HBase and NoSQL databases in the message queue.

8. The multi-source heterogeneous data fusion modeling method according to claim 1, characterized in that: and C, implementing a temporary storage strategy of the constant-capacity recycle bin on the error data cleared in the step four.