CN111666263A - Method for realizing heterogeneous data management in data lake environment - Google Patents

Method for realizing heterogeneous data management in data lake environment Download PDF

Info

Publication number
CN111666263A
CN111666263A CN202010399269.7A CN202010399269A CN111666263A CN 111666263 A CN111666263 A CN 111666263A CN 202010399269 A CN202010399269 A CN 202010399269A CN 111666263 A CN111666263 A CN 111666263A
Authority
CN
China
Prior art keywords
data
pool
information
heterogeneous
data pool
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010399269.7A
Other languages
Chinese (zh)
Inventor
吴奇锋
王燕
王明
高振宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iReadyIT Beijing Co Ltd
Original Assignee
iReadyIT Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iReadyIT Beijing Co Ltd filed Critical iReadyIT Beijing Co Ltd
Priority to CN202010399269.7A priority Critical patent/CN111666263A/en
Publication of CN111666263A publication Critical patent/CN111666263A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2291User-Defined Types; Storage management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools

Abstract

The invention discloses a method for realizing heterogeneous data management in a data lake environment, which comprises the following steps: s1, dividing the data lake into original data pools, wherein the original data pools comprise a simulation data pool, an application data pool, an object data pool and a document data pool; s2, uploading all the heterogeneous data information collected from each system to an original data pool, and classifying the uploaded heterogeneous data information by the original data pool through a classification program; s3, the original data pool respectively transmits the classified heterogeneous data information to the simulation data pool, the application data pool, the object data pool and the document data pool for storage; s4, butting data transmission ports in the simulation data pool, the application data pool, the object data pool and the document data pool with data search ports of the data lake system, and searching heterogeneous data information in the data lake through the data search ports of the data lake system.

Description

Method for realizing heterogeneous data management in data lake environment
Technical Field
The invention relates to the field of data processing, in particular to a method for realizing heterogeneous data management in a data lake environment.
Background
With the popularization of big data applications, people need to manage the variety and quantity of data, wherein the data not only comprise traditional structured data, but also comprise unstructured data such as texts, images and videos, and secondary processing data extracted and mined based on the data. In addition, the sources of data are becoming more diverse, for example, the description information of the working condition of a device includes both the time series data collected by the sensors on the device and the data of patrol, overhaul and the like entered by the user into the system. The use of these heterogeneous data from multiple sources poses a great challenge to the existing data management work, and therefore, it is necessary to research a management method for performing management operation on the large data.
Disclosure of Invention
The invention aims to solve the problems and provides a method for realizing heterogeneous data management in a data lake environment, which is used for effectively managing heterogeneous data.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a method for realizing heterogeneous data management in a data lake environment comprises the following steps:
s1, dividing the data lake into an original data pool for storing original data; the original data pool comprises an analog data pool for storing monitoring data, an application data pool for storing temporary data generated when an application is executed, an object data pool for storing texts, images, audios and videos and a document data pool for storing data which cannot be summarized in the analog data pool, the application data pool and the object data pool;
s2, uploading all the heterogeneous data information collected from each system to an original data pool, and classifying the uploaded heterogeneous data information by the original data pool through a classification program;
s3, the original data pool respectively transmits the classified heterogeneous data information to the simulation data pool, the application data pool, the object data pool and the document data pool for storage;
s4, butting data transmission ports in the simulation data pool, the application data pool, the object data pool and the document data pool with data search ports of the data lake system, and searching heterogeneous data information in the data lake through the data search ports of the data lake system.
Further, the step S2 of classifying the uploaded data information by a classification program includes the following steps:
s21, the classification program analyzes the format of the uploaded heterogeneous data information, judges whether the information belongs to a temporary file format or not according to the format of the information, and classifies the information into an application data pool if the information belongs to the temporary file format;
s22, judging whether the rest heterogeneous data information is text, image, audio and video data, and if the rest heterogeneous data information belongs to the text, image, audio and video data, classifying the rest heterogeneous data information into an object data pool;
s23, judging whether the rest heterogeneous data information is monitoring data or not, and if the rest heterogeneous data information belongs to the monitoring data, classifying the rest heterogeneous data information into a simulation data pool; if it does not belong to the monitored data, it is classified into a document data pool.
Further, the step S4 of searching heterogeneous data information in the data lake by the data search port of the data lake system includes the following steps:
s41, judging format information of the heterogeneous data information to be searched through a retrieval program;
s42, judging the data pool type of the heterogeneous data information to be searched through the format information;
and S43, searching the heterogeneous data information to be searched in the data pool of the category to which the heterogeneous data information to be searched belongs through the search program.
Compared with the prior art, the invention has the advantages and positive effects that:
according to the invention, the original data pool is divided into the analog data pool, the application data pool, the object data pool and the document data pool, the original data pool is used for receiving the original data information uploaded to the data lake and classifying the data information into the analog data pool, the application data pool, the object data pool and the document data pool, so that the division of the original data information in the data lake is rapidly realized, a data search port of a data lake system can rapidly search the required data information in the data lake according to different classified data pools, and the management effect of heterogeneous data information in the data lake is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a block diagram of the framework of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments of the present invention by a person skilled in the art without any creative effort, should be included in the protection scope of the present invention.
As shown in fig. 1, the method for implementing heterogeneous data management in a data lake environment in this embodiment includes the following steps:
a method for realizing heterogeneous data management in a data lake environment comprises the following steps:
s1, dividing the data lake into an original data pool for storing original data; the original data pool comprises an analog data pool for storing monitoring data, an application data pool for storing temporary data generated when an application is executed, an object data pool for storing texts, images, audios and videos and a document data pool for storing data which cannot be summarized in the analog data pool, the application data pool and the object data pool;
s2, uploading all the heterogeneous data information collected from each system to an original data pool, and classifying the uploaded heterogeneous data information by the original data pool through a classification program;
s21, the classification program analyzes the format of the uploaded heterogeneous data information, judges whether the information belongs to a temporary file format or not according to the format of the information, and classifies the information into an application data pool if the information belongs to the temporary file format;
s22, judging whether the rest heterogeneous data information is text, image, audio and video data, and if the rest heterogeneous data information belongs to the text, image, audio and video data, classifying the rest heterogeneous data information into an object data pool;
s23, judging whether the rest heterogeneous data information is monitoring data or not, and if the rest heterogeneous data information belongs to the monitoring data, classifying the rest heterogeneous data information into a simulation data pool; if the data does not belong to the monitoring data, classifying the data into a document data pool;
s3, the original data pool respectively transmits the classified heterogeneous data information to the simulation data pool, the application data pool, the object data pool and the document data pool for storage;
s4, butting data transmission ports in the simulation data pool, the application data pool, the object data pool and the document data pool with data search ports of the data lake system and searching heterogeneous data information in the data lake through the data search ports of the data lake system;
s41, judging format information of the heterogeneous data information to be searched through a retrieval program;
s42, judging the data pool type of the heterogeneous data information to be searched through the format information;
and S43, searching the heterogeneous data information to be searched in the data pool of the category to which the heterogeneous data information to be searched belongs through the search program.
A data lake is a large warehouse that stores a wide variety of raw data of an enterprise, where the data is available for access, processing, analysis, and transmission. The data lake obtains raw data from multiple data sources of the enterprise, and for different purposes, there may be multiple copies of the same raw data that satisfy a particular internal model format. Thus, the data processed in the data lake may be any type of information, from structured data to completely unstructured data.
The data lakes are stored in different data pools in a classified manner through original data, and then the data are integrated and converted into a uniform storage format which is easy to analyze in each data pool for storage. The method is greatly convenient for users to analyze and utilize the data, thereby generating economic benefit.
The data pools are mainly used for storing data, and one data pool mainly comprises the following data:
1. raw data pool
The raw data pool is a single data lake and is used for storing a large amount of raw data. It is difficult to extract desired data therefrom without any processing and use.
2. Analog data pool
The analog data pool is specially responsible for storing analog data, which are mainly data generated by mechanical equipment, generally measured data, temperature, humidity and the like. Typically stored in a record or log tape.
3. Application data pool
Application data is primarily data generated when an application or transaction is executed, such as sales data, payment data, manufacturing process control data, and the like. Such a data pool is exclusively responsible for storing application data.
4. Object data pool
The object data pool is responsible for storing the object data of the file as the name implies, and the original data may be text data of different sources and forms. Such as audio recordings, mail, and even data generated by some physical device. In the data pool, data can be stored according to emotion classification, different emotion types need to be preset in the data pool, then, when new texts, audio, videos and pictures enter the data pool, emotion colors can be determined according to context, tag marking attributes of objects are formed, corresponding types are found and stored.
5. Document data pool
The document data pool mainly stores data which do not belong to an application data pool, an analog data pool and an object data pool.
The data lake is realized by Hadoop, and after evolution, a data group is linked with a program, an operation rule, a display and a history record to finish the target of the data lake.
The data lakes store the original data according to the categories, and the data can be converted into a uniform and directly extractable format in each data pool, so that the method has great commercial value and makes great contribution to big data analysis.
According to the invention, the original data pool is divided into the analog data pool, the application data pool, the object data pool and the document data pool, the original data pool is used for receiving the original data information uploaded to the data lake and classifying the data information into the analog data pool, the application data pool, the object data pool and the document data pool, so that the division of the original data information in the data lake is rapidly realized, a data search port of a data lake system can rapidly search the required data information in the data lake according to different classified data pools, and the management effect of heterogeneous data information in the data lake is effectively improved.

Claims (3)

1. A method for realizing heterogeneous data management in a data lake environment is characterized by comprising the following steps: the method comprises the following steps:
s1, dividing the data lake into an original data pool for storing original data; the original data pool comprises an analog data pool for storing monitoring data, an application data pool for storing temporary data generated when an application is executed, an object data pool for storing texts, images, audios and videos and a document data pool for storing data which cannot be summarized in the analog data pool, the application data pool and the object data pool;
s2, uploading all the heterogeneous data information collected from each system to an original data pool, and classifying the uploaded heterogeneous data information by the original data pool through a classification program;
s3, the original data pool respectively transmits the classified heterogeneous data information to the simulation data pool, the application data pool, the object data pool and the document data pool for storage;
s4, butting data transmission ports in the simulation data pool, the application data pool, the object data pool and the document data pool with data search ports of the data lake system, and searching heterogeneous data information in the data lake through the data search ports of the data lake system.
2. A method for implementing heterogeneous data management in a data lake environment, as claimed in claim 1, wherein: the step S2 of classifying the uploaded data information by the classification program includes the following steps:
s21, the classification program analyzes the format of the uploaded heterogeneous data information, judges whether the information belongs to a temporary file format or not according to the format of the information, and classifies the information into an application data pool if the information belongs to the temporary file format;
s22, judging whether the rest heterogeneous data information is text, image, audio and video data, and if the rest heterogeneous data information belongs to the text, image, audio and video data, classifying the rest heterogeneous data information into an object data pool;
s23, judging whether the rest heterogeneous data information is monitoring data or not, and if the rest heterogeneous data information belongs to the monitoring data, classifying the rest heterogeneous data information into a simulation data pool; if it does not belong to the monitored data, it is classified into a document data pool.
3. A method for implementing heterogeneous data management in a data lake environment, as claimed in claim 2, wherein: the step S4 of searching heterogeneous data information in the data lake by the data search port of the data lake system includes the following steps:
s41, judging format information of the heterogeneous data information to be searched through a retrieval program;
s42, judging the data pool type of the heterogeneous data information to be searched through the format information;
and S43, searching the heterogeneous data information to be searched in the data pool of the category to which the heterogeneous data information to be searched belongs through the search program.
CN202010399269.7A 2020-05-12 2020-05-12 Method for realizing heterogeneous data management in data lake environment Pending CN111666263A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010399269.7A CN111666263A (en) 2020-05-12 2020-05-12 Method for realizing heterogeneous data management in data lake environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010399269.7A CN111666263A (en) 2020-05-12 2020-05-12 Method for realizing heterogeneous data management in data lake environment

Publications (1)

Publication Number Publication Date
CN111666263A true CN111666263A (en) 2020-09-15

Family

ID=72383462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010399269.7A Pending CN111666263A (en) 2020-05-12 2020-05-12 Method for realizing heterogeneous data management in data lake environment

Country Status (1)

Country Link
CN (1) CN111666263A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883091A (en) * 2021-01-12 2021-06-01 平安资产管理有限责任公司 Factor data acquisition method and device, computer equipment and storage medium
CN113157742A (en) * 2021-04-27 2021-07-23 华录智达科技股份有限公司 Data lake management method and system for intelligent bus

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109298840A (en) * 2018-11-19 2019-02-01 平安科技(深圳)有限公司 Data integrating method, server and storage medium based on data lake
CN110909072A (en) * 2018-09-18 2020-03-24 阿里巴巴集团控股有限公司 Data table establishing method, device and equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909072A (en) * 2018-09-18 2020-03-24 阿里巴巴集团控股有限公司 Data table establishing method, device and equipment
CN109298840A (en) * 2018-11-19 2019-02-01 平安科技(深圳)有限公司 Data integrating method, server and storage medium based on data lake

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李光荣等: "基于物联网的企业共享大数据融合研究", 《南京工程学院学报(自然科学版)》 *
李曼寻: "数据湖技术在档案信息资源共建中的应用", 《山西档案》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883091A (en) * 2021-01-12 2021-06-01 平安资产管理有限责任公司 Factor data acquisition method and device, computer equipment and storage medium
CN113157742A (en) * 2021-04-27 2021-07-23 华录智达科技股份有限公司 Data lake management method and system for intelligent bus

Similar Documents

Publication Publication Date Title
US20070195344A1 (en) System, apparatus, method, program and recording medium for processing image
US20080162561A1 (en) Method and apparatus for semantic super-resolution of audio-visual data
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
CN101840422A (en) Intelligent video retrieval system and method based on target characteristic and alarm behavior
US20140293069A1 (en) Real-time image classification and automated image content curation
US9665773B2 (en) Searching for events by attendants
JP3497712B2 (en) Information filtering method, apparatus and system
CN106534784A (en) Acquisition analysis storage statistical system for video analysis data result set
CN111666263A (en) Method for realizing heterogeneous data management in data lake environment
KR101472451B1 (en) System and Method for Managing Digital Contents
CN111723256A (en) Government affair user portrait construction method and system based on information resource library
Kumar et al. An extensive review on different strategies of multimedia data mining
CN107122464B (en) Decision-making assisting system and method
CN110874366A (en) Data processing and query method and device
CN116414854A (en) Data asset query method, device, computer equipment and storage medium
CN106886783B (en) Image retrieval method and system based on regional characteristics
CN117056392A (en) Big data retrieval service system and method based on dynamic hypergraph technology
US11457192B2 (en) Digital evidence management method and digital evidence management system
CN112559739A (en) Method for processing insulation state data of power equipment
Jadhav et al. Unstructured big data information extraction techniques survey: Privacy preservation perspective
CN112256836A (en) Recording data processing method and device and server
CN111143328A (en) Agile business intelligent data construction method, system, equipment and storage medium
US11954151B1 (en) Natural language processing for searching security video data
CN115049372B (en) Method, apparatus and medium for constructing digital infrastructure for human resource information
US11615133B2 (en) Sharing user generated content for media searches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200915