CN113157742A

CN113157742A - Data lake management method and system for intelligent bus

Info

Publication number: CN113157742A
Application number: CN202110457293.6A
Authority: CN
Inventors: 张世强; 孙宏飞; 钱贵涛; 李峰巍; 赵岩
Original assignee: Hualu Zhida Technology Co Ltd
Current assignee: Hualu Zhida Technology Co Ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-23

Abstract

The invention discloses a method and a system for managing data lakes of an intelligent bus, which classify data produced in the management process of a bus system, divide the data lakes into a plurality of data pools according to data types, store the classified data into the data pools of corresponding categories, establish a data set for standardizing the data in each data pool, analyze the data set according to an inquiry request to obtain an inquiry condition when a user needs to inquire the data, generate an inquiry index list based on the inquiry condition, and facilitate index search when calling the data, thereby providing a solution strategy for applying the data lakes to the bus management instead of a relational database so as to improve the utilization rate of the bus management data.

Description

Data lake management method and system for intelligent bus

Technical Field

The invention relates to the technical field of intelligent bus operation management, in particular to a data lake management method and system for an intelligent bus.

Background

With the continuous development and progress of big data analysis technology, data becomes an important asset for public transportation enterprises or organizations; in order to effectively manage data, most current users adopt a big data platform for management, but the existing big data platform provides challenges for storage, effective management and centralized management of original data, particularly data tracing and calling, and a data management form more suitable for intelligent buses needs to be researched so as to meet the storage and calculation capabilities required by the intelligent buses for processing large-scale data and provide multi-mode data processing capabilities for the users. In addition, most of the data lakes used at present are unidirectional, that is, the data lakes only have the function of storing data, and the data in the data lakes are not classified and integrated, so that the data cannot be extracted and utilized.

Disclosure of Invention

The invention provides a data lake management method and a data lake management system for an intelligent bus, which aim to overcome the technical problems.

The invention discloses a data lake management method of an intelligent bus, which comprises the following steps:

acquiring a data packet uploaded by a public transport system, and classifying data in the data packet into different data types; the data packet is a data set generated in the management process of the public transportation system; the data types in the data packet comprise: structured data, semi-structured data, and unstructured data;

dividing the data lake into different data pools according to different data types, and storing the data in the data packet into the corresponding data pools according to different data types;

establishing a data set in a data pool according to the data in the data packet; the data set, comprising: target data, pool metadata, a meta-processing procedure, data transformation standards, pool descriptions and pool targets;

after a user initiates a query request, analyzing the data set according to the query request to obtain a query condition, and generating a query index list based on the query condition;

and judging whether matched data exist or not based on the query index list, if so, packaging the matched data and sending the packaged matched data to a user, and otherwise, feeding back query failure to the user.

Further, the dividing the data lake into different data pools according to different data types includes: dividing a data lake into a structured data pool, a semi-structured data pool and an unstructured data pool; the structured data pool is used for storing bus basic data, bus configuration data, driving area region data and user personal information data; the semi-structured data pool is used for storing HTML page files and log files with file formats of CSV, XML and JSON; the unstructured data pool is used for storing e-mails, documents, graphics, audios and videos, and message and instruction data in the public transport office system.

Further, the storing the data in the data packet into the corresponding data pool according to the different data types includes:

splitting the data packet, wherein the splitting principle is that the data packet is split into at least one sub data packet based on the data type;

carrying out type attribute information identification on the split sub-data packets one by one, and forming a plurality of primary data storage forms after adding time authentication information;

setting a plurality of storage position forms stored in corresponding data pool positions;

acquiring and storing a storage position mapping table of each primary data storage form; the storage location mapping table is used for representing the storage location of the primary data storage form on the storage location form.

Further, the target data is data which is stored in the data pool and can be really analyzed and used; the pool metadata is data describing physical characteristics of data in the data pool; the meta-processing process is a file that illustrates the steps of converting raw data in a data pool into usable standardized data;

converting the original data in the data pool into a file of available standardized data by formula (1);

in the formula, I is input original data, and a is a file of available standardized data; n represents the number of times data is processed, n is 3,

processing data for the t time by using a zipper algorithm, wherein W is a linear regression matrix, omega represents weight, and f (I) represents that data are converted by using convolution;

the data conversion standard is a file which indicates a standard to be followed when converting the original data; the pool description includes: external and internal descriptions of the data pool; the pool target is a file representing the direction of application of the data.

Further, the target data are searched out in a data lake through a machine learning and concept search method, and after the target data are eliminated, data with unclear standards are obtained;

finding out the target data in a data lake through an equation (2);

in the formula, t is target data, and f is a data lake; m represents the total amount of data in the data lake, l (. + -.) represents the characteristic extracted by the convolution network, and f (. + -.) represents the serialization of the data; po is the possibility of being the target data, and if the po value is larger than a preset threshold, it means that the data is the target data.

Further, the determining whether there is matched data based on the query index list includes:

judging whether matched data exist or not through the formula (3);

where x is the dataset and y is the queryAn index list; v_x，V_yRepresents confidence coefficient and takes value range [0,1]L (#) represents the features extracted by the convolution network, f (#) represents the serialization of the data, and the matching value is larger than a set threshold value to calculate that the matching is successful; dis is the degree of match.

Further, after the user initiates a query request, parsing the data set according to the query request to obtain a query condition, and generating a query index list based on the query condition, including:

a user initiates a query request;

resolving the query request into a plurality of fields to form a plurality of query conditions; generating a query index list based on the query condition; the query index list at least comprises the analyzed field and the matched type attribute information corresponding to the field; the type attribute information is obtained based on a fuzzy matching algorithm.

Further, establishing a data pool to be processed; and storing the data with unclear standards into the data pool to be processed, and using the data after the data is standardized again.

The utility model provides a data lake management system of intelligence public transit, includes:

the system comprises a data packet processing unit, a data pool processing unit and a data query unit;

the data packet processing unit is used for acquiring data packets uploaded by a public transport system and classifying data in the data packets into different data types; the data packet is a data set generated in the management process of the public transportation system; the data types in the data packet comprise: structured data, semi-structured data, and unstructured data;

the data pool processing unit is used for dividing the data lake into different data pools according to different data types and storing the data in the data packet into the corresponding data pools according to different data types; establishing a data set in a data pool according to the data in the data packet; the data set, comprising: target data, pool metadata, a meta-processing procedure, data transformation standards, pool descriptions and pool targets;

the data query unit is used for analyzing the data set according to the query request to obtain a query condition after a user initiates the query request, and generating a query index list based on the query condition; and judging whether matched data exist or not based on the query index list, if so, packaging the matched data and sending the packaged matched data to a user, and otherwise, feeding back query failure to the user.

The data generated in the public transportation system management process is classified, the data lakes are divided into a plurality of data pools according to data types, the classified data are stored into the data pools of corresponding categories, then data sets enabling the data to be standardized are established in the data pools, when a user needs to inquire the data, the data sets can be analyzed according to inquiry requests to obtain inquiry conditions, an inquiry index list is generated based on the inquiry conditions, and index searching is facilitated when the data is called, so that a solving strategy of using the data lakes to replace relational databases to be applied to public transportation management is provided, data islands are eliminated, data standards are unified, data change is accelerated, and the utilization rate of public transportation management data is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a data lake management method for intelligent buses;

fig. 2 is a schematic structural diagram of a data lake management system of an intelligent bus.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the present embodiment provides a data lake management method and system for an intelligent bus, including:

101. acquiring a data packet uploaded by a public transport system, and classifying data in the data packet into different data types; the data packet is a data set generated in the management process of the public transportation system; types of data in the data packet, including: structured data, semi-structured data, and unstructured data;

specifically, a data lake can be built by using a blue-ray storage (optomagnetic fusion storage) technology or a cloud platform, and a public transportation system data lake operation platform is built for applications such as data integration, data cleaning, data management and intelligent services. Currently, a common means for implementing a data lake is Hadoop. The evolved Hadoop data management architecture depends on an Apache Falcon data management platform, and a data group is connected with a program, an operation rule, a display and a history record to finish the use target of the data lake. The data uploaded by the public transportation system comprises various different types of data such as structured data, semi-structured data and unstructured data, and all the data are stored in a data lake to serve as a water source of the data lake.

Structured data is data that can be represented with a uniform structure. Generally, the data can be logically expressed by a two-dimensional table structure, and the data stored in a relational database in the public transportation system belongs to structured data. The semi-structured data is data between strictly defined structured data and completely unstructured data, and mainly comprises HTML page files and log files with file formats of CSV, XML and JSON. The unstructured data refers to data which is not convenient to be represented by a database two-dimensional logic table, namely, the unstructured data comprises office documents, texts, pictures, subset XML under a standard general markup language, various reports, images, audio/video information and the like in all formats.

102. Dividing the data lake into different data pools according to different data types, and storing the data in the data packet into corresponding data pools according to different data types;

specifically, if the data lake data is not classified or integrated, the data lake data cannot be extracted and utilized. The solution strategy adopted by the method is to divide the data lake into a structured data pool, a semi-structured data pool and an unstructured data pool. The data pools in the data lake are closely connected, one data is classified into different data pools according to the data type of the data after entering the data lake, and the different data pools are respectively used for storing different types of data and establishing a relationship among the different types of data to share information. The structured data pool is used for storing bus basic data, bus configuration data, driving area region data and user personal information data; the public transportation basic data mainly comprises a plurality of groups of basic data with invariable values, such as information of vehicle numbers, line names, line numbers, IP addresses and ports of vehicle-mounted terminals and the like; the public transport configuration data mainly comprises information such as a vehicle-mounted terminal system configuration parameter IP address and port, engine parameters and the like; the driving area region data mainly comprises bus stops and longitude and latitude on lines; the user personal information mainly comprises driver information, service personnel information and other staff information.

The semi-structured data pool is used for storing HTML page files and log files, namely data obtained by application API (application program interface), such as running logs, scheduling logs and the like of a vehicle-mounted terminal system, and the file format can be CSV (common service provider), XML (extensible markup language) and JSON (Java service provider);

the unstructured data pool is used for storing various messages and instructions such as e-mails, documents and PDFs issued in the public transportation office system, and graphs, audios and videos collected by the public transportation operation state, such as images/videos of people in a carriage, road conditions and the like.

The step of storing the data in the data packet into the data lake is as follows:

1. splitting the data packet, wherein the splitting principle is that the data packet is split into at least one sub data packet based on the data type;

2. carrying out type attribute information identification on the split sub-data packets one by one, and forming a plurality of primary data storage forms after adding time authentication information;

3. setting a plurality of storage position forms stored in corresponding data pool positions;

4. acquiring and storing a storage position mapping table of each primary data storage form; the storage location mapping table is used for representing the storage location of the primary data storage form on the storage location form.

103. Establishing a data set in a data pool according to data in the data packet; a data set comprising: target data, pool metadata, a meta-processing procedure, data transformation standards, pool descriptions and pool targets;

specifically, target data can be found in the data lake through a machine learning and concept search method, and after the target data is eliminated, data with unclear standards are obtained. There are many ways to find the data, for example, first find the limiting factor of the data, then check the data tag, and finally find a large amount of data.

The target data is data which is stored in the data pool and can be really analyzed and used, and the data can be directly used without processing; the pool metadata is data describing physical characteristics of data in the data pool; the meta-processing procedure is a file illustrating the steps of converting raw data in the data pool into usable standardized data; the data conversion standard is a file for explaining the standard to be followed when converting the original data; the pool description includes: external description and internal description of the data pool, the external description comprising: function, size of the data pool; the internal description comprises the source, volume, updating frequency, extraction, conversion, standard of data in the data pool and the relation between the data; the pool target is a file representing the direction of application of the data. The public traffic system data lake operation platform converts non-target data into usable target data through a data cleaning function according to pool metadata, a metadata processing process, a data conversion standard, pool description and a pool target, and stores the usable target data in a uniform standard format.

i is input original data, a is available standard data file; n represents the number of times data is processed, n is 3,

finding out the target data in a data lake through an equation (2);

In addition, considering that the data still has many data which cannot be utilized and has unclear standards after being cleaned, the data cannot be retrieved after being discarded. And a data pool to be processed can be established again, the data with unclear standards can be stored in the data pool to be processed, and the data can be used after being standardized again.

104. After a user initiates a query request, analyzing the data set according to the query request to obtain a query condition, and generating a query index list based on the query condition;

specifically, a user initiates a query request, and then the query request is analyzed into a plurality of fields to form a plurality of query conditions; generating a query index list based on the query condition; the query index list at least comprises the analyzed field and the matched type attribute information corresponding to the field; the type attribute information is obtained based on a fuzzy matching algorithm.

105. And judging whether matched data exist or not based on the query index list, if so, packaging the matched data and sending the packaged matched data to the user, and otherwise, feeding back the query failure to the user.

Specifically, whether matched data exists is judged through the formula (3);

in the formula, x is a data set, and y is a query index list; v_x，V_yRepresents confidence coefficient and takes value range [0,1]L (#) represents the features extracted by the convolution network, f (#) represents the serialization of the data, and the matching value is larger than a set threshold value to calculate that the matching is successful; dis is the degree of match.

In addition, the method in the present invention only describes the query method of the client, and does not describe the operation methods of adding, deleting, checking, changing, etc. of the management end, because the operation method of the terminal is not the focus of the present invention, but an extended research can be performed according to the present invention.

As shown in fig. 2, this embodiment provides a data lake management system of intelligent public transport, including:

the data packet processing unit is used for acquiring data packets uploaded by the public transportation system and classifying data in the data packets into different data types; the data packet is a data set generated in the management process of the public transportation system; types of data in the data packet, including: structured data, semi-structured data, and unstructured data;

the data pool processing unit is used for dividing the data lake into different data pools according to different data types and storing the data in the data packet into the corresponding data pools according to different data types; establishing a data set in the data pool according to the data in the data packet; a data set comprising: target data, pool metadata, a meta-processing procedure, data transformation standards, pool descriptions and pool targets;

the data query unit is used for analyzing the data set according to the query request to obtain a query condition after a user initiates the query request, and generating a query index list based on the query condition; and judging whether matched data exist or not based on the query index list, if so, packaging the matched data and sending the packaged matched data to the user, and otherwise, feeding back the query failure to the user.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A data lake management method of an intelligent bus is characterized by comprising the following steps:

2. The method for managing the data lake of the intelligent bus according to claim 1, wherein the dividing the data lake into different data pools according to different data types comprises:

dividing a data lake into a structured data pool, a semi-structured data pool and an unstructured data pool;

the structured data pool is used for storing bus basic data, bus configuration data, driving area region data and user personal information data;

the semi-structured data pool is used for storing HTML page files and log files with file formats of CSV, XML and JSON;

the unstructured data pool is used for storing e-mails, documents, graphics, audios and videos, and message and instruction data in the public transport office system.

3. The method for managing the data lake of the intelligent bus according to claim 2, wherein the step of storing the data in the data packet into the corresponding data pool according to different data types comprises:

4. The data lake management method of the intelligent bus according to claim 3,

the target data is data which is stored in the data pool and can be really analyzed and used;

the pool metadata is data describing physical characteristics of data in the data pool;

the meta-processing process is a file that illustrates the steps of converting raw data in a data pool into usable standardized data;

the data conversion standard is a file which indicates a standard to be followed when converting the original data;

the pool description includes: external and internal descriptions of the data pool;

the pool target is a file representing the direction of application of the data.

5. The data lake management method of the intelligent bus according to claim 4, wherein the target data is found in the data lake through a machine learning and concept search method, and standard unclear data is obtained after the target data is eliminated;

finding out the target data in a data lake through an equation (2);

6. The method as claimed in claim 5, wherein the step of determining whether there is matched data based on the query index list comprises:

judging whether matched data exist or not through the formula (3);

7. The method for managing the data lake of the intelligent bus according to claim 1, wherein after the user initiates an inquiry request, the data set is analyzed according to the inquiry request to obtain an inquiry condition, and an inquiry index list is generated based on the inquiry condition, comprising:

a user initiates a query request;

8. The data lake management method of the intelligent bus according to claim 5, wherein a to-be-processed data pool is established; and storing the data with unclear standards into the data pool to be processed, and using the data after the data is standardized again.

9. The utility model provides a data lake management system of intelligence public transit which characterized in that includes: