CN112684986B

CN112684986B - Mass data processing method

Info

Publication number: CN112684986B
Application number: CN202110008538.7A
Authority: CN
Inventors: 张金宝; 沈党云; 田野
Original assignee: Cccc Intelligent Transportation Co ltd
Current assignee: Cccc Intelligent Transportation Co ltd; China Highway Engineering Consultants Corp; CHECC Data Co Ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2023-01-24
Anticipated expiration: 2041-01-05
Also published as: CN112684986A

Abstract

A massive data processing method is characterized in that the granularity and the data field of data to be stored are determined, a dynamically amplified column is used for organizing and generating a storage table, and a storage unit is used as the data with the finest granularity for storage; when data storage is carried out, data to be stored is stored in a memory to serve as a cache, and when a defined cache stack exceeds a threshold value, the data stored in the memory is written into a file and falls into a disk; the data composition comprises an index area and a data area, wherein the index area records the data types contained in the storage file, and the data area stores the stored data objects; when reading the stored data, firstly inquiring the data from the buffer stack; when data is searched in the bottom file storage, an index area of the data is searched first, whether the bottom file contains the data or not is judged, and if the bottom file contains the searched data, data positioning is carried out according to the index area. The invention solves the problems of low data query speed, redundancy of a large amount of data and excessive occupation of storage space.

Description

Mass data processing method

Technical Field

The invention relates to the technical field of data storage, in particular to a mass data processing method.

Background

At present, mass positioning data is stored in a relational database, taking vehicle positioning data as an example, generally one piece of data is a piece of positioning information, and the positioning information includes the following common fields:

longitude, latitude, time, license plate number, license plate color (code), speed, direction, altitude, mileage, province of belongings (code), city of belongings (code), district of belongings (code), address description, other identification fields. The data is stored in a relational database in row units, and one row is the information of one track point and is organized into tracks according to different query conditions.

At present, data is taken as information of one track point in a row unit, and a large amount of data redundancy can occur. Because the structured data is convenient for query, the problems of low query speed, large data redundancy and excessive storage space occupation can be caused under the condition of extremely large data amount due to the storage structure and the storage mode of the structured data.

Taking the vehicle positioning data as an example, assuming a piece of positioning data for one minute, if the same vehicle travels in the same urban area within one hour, the following situations occur:

in the generated 60 pieces of data, the main fields except longitude, latitude, time, speed, direction, altitude, mileage and address description are different (in the case of no parking), and fields such as license plate number, license plate color (code), belonging province (code) and the like are repeated, so that data redundancy is realized.

Due to the characteristics of positioning data, the data volume of a system is usually tens of millions or even hundreds of millions, the traditional storage mode is obviously not appropriate, and the problems of low query speed, redundancy of a large amount of data and excessive storage space occupation when tracks are formed exist.

In summary, a new mass data storage technical scheme is needed.

Disclosure of Invention

Therefore, the invention provides a mass data processing method, which solves the problems of low data query speed, redundancy of mass data and excessive occupation of storage space.

In order to achieve the above purpose, the invention provides the following technical scheme: a mass data processing method comprises the steps of determining the granularity and data fields of data to be stored, organizing and generating a storage table by a dynamically amplified column, crossing one column and one row of the storage table to serve as a storage unit, and storing the storage unit as data with the finest granularity;

when data storage is carried out, the data to be stored is stored in the memory to serve as a cache, and when a defined cache stack exceeds a threshold value, the data stored in the memory is written into a file and falls into a disk;

the data composition comprises an index area and a data area, the index area records the data type contained in the storage file, and the data area stores the stored data object;

when reading the stored data, firstly inquiring the data from the cache stack, if the data is hit, directly returning, otherwise, searching the data in the bottom file storage;

when data is searched in the bottom file storage, an index area of the data is searched first, whether the bottom file contains the data is judged, if the bottom file does not contain the searched data, the data is skipped directly, and if the bottom file contains the searched data, the data is positioned according to the index area.

As a preferred scheme of the mass data processing method, the type of the data is vehicle positioning data, the data granularity is one track point, and the storage unit is composed of one track point.

As a preferable scheme of the mass data processing method, the track points stored in the storage table take the time stamps as column names, and the columns of the storage table are dynamically amplified by time.

As a preferred scheme of the processing method of the mass data, the storage of the data adopts a distributed storage architecture, and the data is finally stored in a disk in a file form;

and performing metadata management and maintenance on the distributed storage architecture, and determining service nodes and file directories in which data fall through the metadata management.

As a preferred scheme of the mass data processing method, the size of the storage files in the disk is scanned at regular time according to preset time, and when a preset merging condition is reached, a merging mechanism is triggered to merge a plurality of storage files.

As an optimal scheme of the mass data processing method, the size of a storage file in a disk is scanned at regular time according to preset time, and when preset splitting conditions are met, a splitting mechanism is triggered to split the storage file.

As a preferable scheme of the massive data processing method, when the time stamps of the columns are the same, the data stored earlier is overwritten by the data stored later.

As a preferred scheme of the mass data processing method, a marking bit is maintained in a storage unit, whether data are valid is marked through the marking bit, when a merging mechanism is triggered by the data, the data marked as invalid are retrieved, and the invalid data are not merged.

As a preferred scheme of the method for processing the mass data, the query mode of the vehicle positioning data comprises the following steps: the method comprises the steps of obtaining the latest track point of a specified vehicle, obtaining the latest track point of the vehicle in a specified time interval and obtaining the track of the specified vehicle in the specified time interval.

As a preferred scheme of the massive data processing method, when data are searched in the bottom file storage, a row of data are positioned through row coding, then screening is carried out according to columns, and then the searched data are returned.

The invention has the following advantages:

according to the invention, a large amount of redundant data is abstracted into frequently-changed data and infrequently-changed data through dynamic expansion of rows and columns, the frequently-changed data is stored in the storage unit, and the infrequently-changed data forms row codes and columns. The dynamic expansion of the columns meets the characteristics of positioning data, and because the data source acquires data unstably, the positioning data at the time is stored, and the positioning data is not stored if the data source acquires data at the time; the time stamps are used as columns, so that data redundancy is reduced, dynamic expansion and ordered storage can be realized, and a large amount of disk space is saved.

The invention stores the data in the order of dictionary, and the date information in the data is stored according to the time order. The line code is used as a basis for data query and uniquely identifies a line of data.

In the invention, the sequencing storage of the row codes is greatly suitable for the query scene of the positioning track, the track is composed of track points of a period of continuous time, the sequence storage of the row codes ensures that the track points are stored according to the time sequence, and meanwhile, the time stamp columns are also stored according to the time sequence. When inquiring, also promoted inquiry efficiency, can fix a position the data of a certain time quantum of a certain car fast, filter according to the time stamp row again, inquire the location data of appointed interval fast.

The data read-write speed of the cache stack based on the memory is far higher than that of file read-write, when data is searched, the data can be firstly inquired in the cache stack, and only when the data is not hit, the inquiry in the disk file is removed. And the data storage in the file form is also provided with a corresponding file merging mechanism and a corresponding file splitting mechanism, so that the files are automatically optimized and managed, and the data file storage is ensured not to be too small or too large and is always in a stable and controllable state.

According to the invention, through metadata management and distributed storage, the data files are stored in different server nodes, so that the condition of overhigh load of a single-point server is avoided, meanwhile, the transverse expansion can be carried out, and more data can be stored by increasing the number of servers.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

Fig. 1 is a diagram of a distributed storage architecture in a method for processing mass data according to an embodiment of the present invention;

fig. 2 is a schematic diagram of data composition in a method for processing mass data according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a method for processing mass data according to an embodiment of the present invention to obtain a latest track point of a specified vehicle;

fig. 4 is a flowchart illustrating a method for processing mass data according to an embodiment of the present invention, where the latest track point of a vehicle in a specified time interval is obtained;

fig. 5 is a flowchart illustrating a method for processing mass data according to an embodiment of the present invention to obtain a trajectory of a specific vehicle in a specific time interval.

Detailed Description

The present invention is described in terms of specific embodiments, and other advantages and benefits of the present invention will become apparent to those skilled in the art from the following disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 and fig. 2, a mass data processing method is provided, where the method determines the granularity of data to be stored and data fields, organizes and generates a storage table from dynamically-augmented columns, crosses one column and one row of the storage table as a storage unit, and stores the storage unit as the data with the finest granularity;

Specifically, while the conventional relational database organizes the storage tables by rows, the present technical solution organizes the storage tables by columns, and the columns are dynamically augmented, and may be dynamically specified by a program without defining in advance the column names and the number of columns. A column is interleaved with a row to correspond to a memory location, one memory location being stored as the finest granularity of data, and a row being stored as a higher level granularity of business logic, the column being dynamically scalable.

In this embodiment, the granularity of data to be stored and the data field are determined first, and taking vehicle positioning data as an example, the common fields are as follows:

longitude, latitude, time, license plate number, license plate color (code), speed, direction, altitude, mileage, province of belongings (code), city of belongings (code), district of belongings (code), address description, other identification fields.

In this embodiment, the type of data is vehicle positioning data, the data granularity is a track point, the memory cell comprises a track point. And taking the track point as a unit, and taking the data granularity as one track point to form a storage unit. And the fields extract necessary fields: longitude, latitude, time, license plate number, license plate color (code), speed, direction, altitude, mileage, the remaining fields may be maintained in association by a separate latitude table to reduce redundancy.

Since the amount of positioning data is large, continuous queries are usually required to compose the track, and are therefore stored as a granularity of one "data row" on a daily basis. After the necessary fields are extracted, the fields which can only indicate one data line are extracted again: time, license plate number, license plate color (code). These three fields together form a row code (ID) uniquely indicating a set of positioning data for a particular vehicle for a particular day.

Specifically, the track points stored in the storage table take the time stamps as column names, and the columns of the storage table are dynamically augmented by time. Longitude, latitude, speed, direction, altitude and mileage, wherein the almost different values of each track point are stored by the storage unit, and the column name of each point is determined by the time field and is actually the stored timestamp.

Since the columns are dynamically expandable, the data for a certain day (one row) of a certain car is a set, which contains all the positioning data for that day. The time stamp is used as the column name, so that the uncertainty of the positioning data amount in one day is guaranteed, if a certain moment exists, the data is stored, and if no certain moment exists, the data does not exist.

The data structure can be more intuitively presented through table 1.

TABLE 1 data Structure

The data structure is shown by way of data example by table 2.

TABLE 2 data Structure of a vehicle as an example

	1603156870	1603156990	1603243990	1603244590
					Jin XX1111\|2 purple light 2020-10-20	102,72,...	102,76,...
Jin XX2222\|2 amino acids 2020-10-20	103,72,...
					Jin XX2222\|2 amino acids 2020-10-21			104,72,...	104,74,...

The table structure in the form of a document is not easy to express, and in an actual situation, an empty storage unit is not actually stored, so that the storage space is not occupied. Therefore, the row coding of jin XX1111|2 purple 2020-10-20 does not have two columns of 1603243990 and 1603244590 actually, and does not have corresponding memory cells.

For positioning data, the data volume is mainly determined by the data acquisition frequency, and the track time during acquisition is not very accurate. Particularly, when a third-party interface is connected, the time interval between track points is not fixed frequently. Therefore, the time stamp is taken as the column by combining the characteristics of the positioning data, and the benefit of the dynamic expansion of the column is embodied: and storing the trace points at which time.

With reference to fig. 2, in this embodiment, the data is stored in a distributed storage architecture, and the data is finally stored in a disk in a file form;

and carrying out metadata management and maintenance on the distributed storage architecture, and determining a service node and a file directory in which data falls through the metadata management.

Data storage adopts a distributed storage architecture, all data finally fall into a disk in a file form, and a data storage system maintains a metadata management and needs to know which data fall into which service nodes and which file directories. The core of the storage mechanism consists of two parts, wherein one part is stored in a memory and used as a cache, so that the memory is convenient to read and write quickly; the other part is in the form of a file in a disk.

In this embodiment, the sizes of the storage files in the disk are scanned at regular time according to a preset time, and when a preset merging condition is reached, a merging mechanism is triggered to merge the storage files. Scanning the size of a storage file in a disk at regular time according to preset time, and triggering a splitting mechanism to split the storage file after a preset splitting condition is reached.

Specifically, when data is written, the data is stored in a memory, and when a defined cache stack exceeds a threshold, the data is written into a file and falls into a disk. The files are more and more in time and the directory is larger, and a timer is used for scanning the size of the disk file at regular time. When the specified conditions are met, a merging mechanism is triggered to merge the small files into a large file, and correspondingly, when the large files are too large, a splitting mechanism is triggered to split the large files into proper small files for storage. These do not depart from metadata management.

Specifically, the data composition of these files is also composed of two parts, a part of index area and a part of data area. The index area records which data are contained in the file, so that the file is convenient to quickly query, and the data area stores real data.

In this embodiment, when the time stamps as columns are the same, the data previously stored is overwritten by the data subsequently stored. Maintaining a mark bit in the storage unit, marking whether the data is valid through the mark bit, retrieving the data marked as invalid when the data triggers a merging mechanism, and not merging the invalid data.

For updating and deleting data, an upsert mechanism is adopted for storage. Taking the positioning data as an example, when the timestamps as columns are the same, the later data will overwrite the former data, that is, the trace points at the same time, and only the last received data is saved. For deletion, a marking bit is maintained in the storage unit at the bottom layer to mark whether the data is valid or not, and when the data triggers a merging mechanism, the data marked as invalid data can be retrieved, so that merging is not performed, and the purpose of deletion is achieved.

Specifically, with respect to the storage order of the data, the set of location data for a particular vehicle on a particular day is uniquely identified by a row code. Line codes become the key to retrieve data, and therefore, line codes are concatenated from the following fields:

license plate number + color of license plate (code) + date (year, month and day)

And sequencing the line codes during storage, and sequentially sequencing and storing the line codes according to the dictionary sequence. Therefore, the requirement of track query, namely time continuous query, is greatly met.

In this embodiment, when data is searched in the underlying file storage, a row of data is located by row coding, then screening is performed according to the columns, and then the searched data is returned. When data is read, the data is firstly inquired from the cache stack, if the data is hit, the data is directly returned, otherwise, the data is searched in the bottom file storage. When searching data in a file, searching an index area of the data to see whether the file contains the data or not, if not, directly skipping, and if so, quickly positioning the data according to the index.

In order to quickly inquire out required data, a row of data is quickly positioned through row coding, screening is carried out according to columns, and then the data are returned.

Because the line codes are stored in sequence, the data lines in a certain time interval can be very quickly locked during query, and the lines are queried no matter in one day or across days; and taking the row of data, screening according to the time stamp column, and returning positioning data of specified time parameters (specific arrival time minutes and seconds).

With reference to fig. 3, in this embodiment, the query manner of the vehicle positioning data includes: and acquiring the latest track point of the specified vehicle. All the line codes of the vehicle can be quickly taken out through the line codes according to the license plate number and the license plate color (codes), and the latest time is taken out according to the line codes, so that the track of the latest day is taken out, and then the latest track point is taken out through screening by the time stamp column.

With reference to fig. 4, in this embodiment, the query manner of the vehicle positioning data includes: and acquiring the latest track point of the vehicle in the designated time interval. In this case, it is actually necessary to maintain a vehicle list, and it is sequentially determined according to the vehicle list, which vehicles have tracks in the time interval, and if they exist, the vehicle is queried, and if they do not exist, the vehicle is determined as the next vehicle. And the latest track point query method of each vehicle is the same as the last one, and finally, the latest track points of all vehicles are gathered together and returned to the query party.

With reference to fig. 5, in this embodiment, the query manner of the vehicle positioning data includes: a trajectory of a specified vehicle in a specified time interval is acquired. According to the license plate number, the license plate color (code) and the time interval, corresponding row data (track of the date interval) of the vehicle in the appointed date interval can be quickly taken out through the formed row codes, then screening is carried out according to the time stamp column, the track meeting the conditions is taken out, and the track is returned to the inquiring party.

According to the invention, a large amount of redundant data is abstracted into frequently-changed data and infrequently-changed data through dynamic expansion of rows and columns, the frequently-changed data is stored in the storage unit, and the infrequently-changed data forms row codes and columns. The dynamic expansion of the columns meets the characteristics of positioning data, and because the data source collects data unstably, the positioning data at the time is stored, and the positioning data is not stored if the data source collects data unstably; the time stamps are used as columns, so that data redundancy is reduced, dynamic expansion and ordered storage can be realized, and a large amount of disk space is saved.

The invention stores the data in the order of dictionary, and the date information in the data is stored according to the time order. The line code is used as a basis for data query and uniquely identifies a line of data. The sequencing storage of the row codes is greatly suitable for the query scene of the positioning track, the track is composed of track points of a period of continuous time, the sequence storage of the row codes ensures that the track points are stored according to the time sequence, and meanwhile, the time stamp columns are also stored according to the time sequence. When inquiring, also promoted inquiry efficiency, can fix a position the data of a certain time quantum of a certain car fast, filter according to the time stamp row again, inquire the location data of appointed interval fast.

The cache stack based on the memory has the data reading and writing speed far higher than that of file reading and writing, and can firstly inquire in the cache stack when data is searched, and only when the data is not hit, the data is not inquired in the disk file. And the data storage in the file form is also provided with a corresponding file merging mechanism and a corresponding file splitting mechanism, so that the files are automatically optimized and managed, and the data file storage is ensured not to be too small or too large and is always in a stable and controllable state.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A mass data processing method is characterized in that the granularity and the data field of data to be stored are determined, a dynamically amplified column is used for organizing and generating a storage table, one column and one row of the storage table are crossed to be used as a storage unit, and the storage unit is used as the data with the finest granularity for storage;

when data storage is carried out, data to be stored is stored in a memory to serve as a cache, and when a defined cache stack exceeds a threshold value, the data stored in the memory is written into a file and falls into a disk;

when data is searched in the bottom file storage, an index area of the data is searched first, whether the bottom file contains the data is judged, if the bottom file does not contain the searched data, the data is skipped directly, and if the bottom file contains the searched data, the data is positioned according to the index area;

the type of the data is vehicle positioning data, the data granularity is one track point, and the storage unit is composed of one track point;

the track points stored in the storage table take the time stamps as column names, and the columns of the storage table are dynamically amplified by time;

the data is stored in a distributed storage architecture, and the data is finally stored in a disk in a file form; performing metadata management and maintenance on the distributed storage architecture, and determining a service node and a file directory in which data falls through the metadata management;

scanning the sizes of the storage files in the disk at regular time according to preset time, and triggering a merging mechanism to merge a plurality of storage files when preset merging conditions are met; scanning the size of a storage file in a disk at regular time according to preset time, and triggering a splitting mechanism to split the storage file when a preset splitting condition is reached;

maintaining a mark bit in a storage unit, marking whether data is valid through the mark bit, retrieving the data marked as invalid when a merging mechanism is triggered by the data, and not merging the invalid data;

the inquiry mode of the vehicle positioning data comprises the following steps: the method comprises the steps of obtaining the latest track point of a specified vehicle, obtaining the latest track point of the vehicle in a specified time interval and obtaining the track of the specified vehicle in the specified time interval;

when data are searched in the bottom file storage, a row of data are positioned through row coding, screening is carried out according to columns, and then the searched data are returned.

2. A method for processing a mass of data according to claim 1, wherein when the time stamps for the columns are the same, the previously stored data is overwritten by the subsequently stored data.