CN113742330A - Multidimensional transportation data fusion and data quality detection method - Google Patents

Multidimensional transportation data fusion and data quality detection method Download PDF

Info

Publication number
CN113742330A
CN113742330A CN202111097329.0A CN202111097329A CN113742330A CN 113742330 A CN113742330 A CN 113742330A CN 202111097329 A CN202111097329 A CN 202111097329A CN 113742330 A CN113742330 A CN 113742330A
Authority
CN
China
Prior art keywords
data
fusion
longitude
latitude
travel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111097329.0A
Other languages
Chinese (zh)
Other versions
CN113742330B (en
Inventor
罗建平
陈欢
戴宇聪
杨森彬
尹杰丽
李志武
陈招帆
喻莲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Jiaoxin Investment Technology Co Ltd
Original Assignee
Guangzhou Jiaoxin Investment Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Jiaoxin Investment Technology Co Ltd filed Critical Guangzhou Jiaoxin Investment Technology Co Ltd
Priority to CN202111097329.0A priority Critical patent/CN113742330B/en
Publication of CN113742330A publication Critical patent/CN113742330A/en
Application granted granted Critical
Publication of CN113742330B publication Critical patent/CN113742330B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multidimensional transportation data fusion and data quality detection method, which comprises the following steps: s1: data collection: acquiring original travel data of different types of vehicles from different platforms and systems, wherein the original travel data comprises a data acquisition component, a data storage component and a data preprocessing component; s2: data fusion: the method comprises three levels of data level fusion, feature level fusion and decision level fusion; s3: and (5) fusing data quality detection. According to the invention, more objective and accurate passenger travel rule analysis, bus section passenger flow prediction, taxi passenger carrying line recommendation and multi-dimensional traffic travel fusion data sharing can be carried out by using multi-dimensional traffic travel data, a set of double closed loop fusion data quality detection method is provided, and the data quality is ensured to the maximum extent; the method solves the problems that in the traffic field, data sources of various vehicles are mutually independent, great errors are generated when trip characteristics are judged, and accuracy of data analysis and algorithm development is affected.

Description

Multidimensional transportation data fusion and data quality detection method
Technical Field
The invention relates to the technical field of data fusion, in particular to a multidimensional transportation data fusion and data quality detection method.
Background
With the rapid development of the internet and big data technology, the data volume is continuously increased, but at the same time, information mining under the big data volume also faces new challenges. How to mine useful information from massive data becomes an important research object. The multi-dimensional and multi-source data fusion provides a more objective and comprehensive information source for mass data mining. The multidimensional and multisource data fusion has important significance for the application of industry analysis, prediction and other scenes, for example, for capturing personal trip chain scenes, because various vehicles can be used simultaneously during trip, if various vehicle data sources such as public transport, leasing and the like are mutually independent, a great error can be generated when the trip chain is judged, and the accuracy of data analysis and algorithm development is influenced. From the consideration of various application scenes in the transportation industry, multi-dimensional and multi-source data fusion is also very important.
The existing stage data fusion has corresponding research and application landing scenes in different fields and different industries. However, for the transportation industry, there is still a great gap in research and application of the multidimensional transportation trip chain data fusion method and the multidimensional transportation trip chain data quality detection method as a whole.
In addition, due to the problem that the data sources of all industries are independent, data cannot be shared to perform fusion data rule analysis and artificial intelligence application. The analyzed rule has a large difference from the actual condition, and the AI model has a large error.
Disclosure of Invention
The invention aims to provide a multi-dimensional traffic travel data fusion and data quality detection method, which ensures the data quality to the utmost extent, can perform more objective and accurate passenger travel rule analysis, bus section passenger flow prediction, taxi passenger carrying line recommendation and multi-dimensional traffic travel fusion data sharing through multi-dimensional traffic travel data, greatly improves the working efficiency of the prior art, and solves the problems that multiple transportation means data sources in the traffic field are mutually independent, great errors are generated when travel characteristics are judged, and the accuracy of data analysis and algorithm development is influenced.
In order to achieve the purpose, the invention provides the following technical scheme:
a multidimensional transportation data fusion and data quality detection method comprises the following steps:
s1: data collection: acquiring original travel data of different types of vehicles from different platforms and systems, wherein the original travel data comprises a data acquisition component, a data storage component and a data preprocessing component;
s2: data fusion: the method comprises three levels of data level fusion, feature level fusion and decision level fusion; in order to obtain a multi-dimensional transportation fusion data table, data level fusion and characteristic level fusion are adopted; the data level fusion comprises the characteristic extraction of specific starting and stopping points, longitude and latitude, time, vehicle information, user id, date and vehicle type of travel data of a networked car, a taxi and a shared single car; the feature level fusion comprises the extraction of the trip features of buses, subways and private cars and the extraction of the start and stop point space features of all vehicles;
s3: and (5) fusing data quality detection.
Furthermore, in S1, the bus, subway, taxi, network appointment, shared bicycle, and private car travel data obtained from different platforms or systems through the data collection component, and the bus, subway card swiping data, taxi, network appointment, shared bicycle order data, and private car card passing data are all structured data, and the structured data is used as a multidimensional transportation travel data source.
Further, the multidimensional travel data acquired by the data storage component in the S1 from a plurality of different systems is stored in a distributed manner in the big data platform.
Furthermore, the data preprocessing component in S1 is used to perform missing condition inspection on the original data, further process the verified data according to the data quality condition, and upload the data quality to the multi-source data providing system for the source data provider to perform data quality improvement; the specific method comprises the following steps:
s101: missing value processing: directly removing records with field missing number more than 80%;
s102: abnormal value processing: the method comprises the steps that the distance of the rejected mileage is larger than a normal range, and the trip time of the rejected trip is larger than the normal range;
s103: data specification processing: the method comprises time non-standard representation processing, date unified processing, standard longitude and latitude representation types and payment amount field unified units.
Furthermore, the spatial feature extraction method for data fusion in S2 after the improvement based on the geography meshing technology includes the following steps:
s201: spatial gridding of a target area: selecting a target area according to service requirements, dividing the space of the target area into a plurality of grids by adopting a space gridding technology, specifically adopting a geohash grid coding technology, and dividing the target area into a plurality of grids by combining with an error requirement and using 8-bit geohash with granularity, wherein the space error of the 8-bit geohash is 19 meters;
s202: mesh home zone information: and sequentially calculating the attribution area of each 8-bit geohash grid, wherein the solving method of one 8-bit geohash in the target area space comprises the following steps:
s2021: solving the center of 8-bit geohash, and expressing the center by longitude and latitude;
s2022: taking the center as a center, screening out all alternative areas in a square area with the side length of 10 kilometers, wherein the specific screening mode is as follows:
(1) calculating the longitude and latitude of the northwest point and the southeast point of the square area, which are respectively expressed as (lng _ w, lat _ n), (lng _ e and lat _ s);
(2) and solving the longitude and latitude of the northwest point and the southeast point of the fence area of all the traffic cells, wherein the longitude and latitude are expressed as (lng _ w)i,lat_ni),(lng_ei,lat_si);
(3) Screening out possible traffic districts meeting the requirements, wherein the conditions meeting the requirements are as follows:
Figure BDA0003269555470000031
the candidate areas meeting the requirements can be quickly screened out by the algorithm;
s2023: traversing all the alternative areas obtained in the step 2022, and judging the area to which the latitude and longitude belong by using an algorithm of whether the point is in the designated fence or not according to the fence of the alternative areas and the latitude and longitude of the 8-bit geohash center to be solved;
s203: region information closest to the grid: sequentially solving the nearest area information of each 8-bit geohash grid, wherein the solving method comprises the following steps:
s2031: finding the center of the 8-bit geohash and representing the center (lng, lat) by latitude and longitude;
s2032: taking the center as a center, screening out all alternative areas in a square area with the side length of 4 kilometers, wherein the specific screening mode is as follows:
(1) calculating the longitude and latitude of the northwest point and the southeast point of the square area, which are respectively expressed as (lng _ w, lat _ n), (lng _ e and lat _ s);
(2) each site latitude and longitude is represented as (lng _ w)i,lat_ni),(lng_ei,lat_si);
(3) All bus stops in the square area are screened out, and the conditions meeting the requirements are as follows:
Figure BDA0003269555470000041
s2033: sequentially solving the distances from all the alternative sites to the center (lng, lat) of the 8-bit geohash by using a longitude and latitude distance calculation formula, and finding out the site information and the distance which are closest to each other;
according to the steps, the station and distance information which all the space grids of the target area belong to can be sequentially obtained;
s204: and aggregating the spatial feature basic information obtained based on the geography gridding technology in S201-S203 with 8-bit geohash corresponding to the start point and the stop point of each trip record, so that the spatial feature of the start point and the stop point of each trip record can be rapidly extracted.
Further, in order to reduce the time complexity in the feature level fusion process, the feature extraction algorithm is optimized as follows when extracting the spatial features:
(1) unifying the longitude and latitude types of the starting and stopping points of the trip, wherein the unified specification is wgs84 type longitude and latitude;
(2) 8-bit geohash coding is carried out on the trip starting and stopping points according to the longitude and latitude;
(3) generating a relevant basic information table corresponding to all 8-bit geohashes in the target area;
(4) respectively aggregating data information corresponding to the data-level fusion data and all 8-bit geohashes in the target area, and respectively taking the 8-bit geohashes of the trip start and stop points and the 8-bit geohashes of the basic information data as aggregation keys to obtain traffic districts, streets, jurisdictions, important activity places where the traffic districts and the subway stations are located, and bus and subway station condition fields closest to the trip start and stop points;
(5) and finishing the fusion of the characteristic level fields corresponding to the trip starting and stopping points.
Furthermore, the judgment criteria of the data quality in the S3 are integrity, consistency, accuracy and timeliness; wherein, the integrity detection refers to whether the data information is missing; consistency refers to whether data conforms to a uniform specification; accuracy refers to whether the information of the data record has an abnormality or is due to an error caused in the calculation process; timeliness refers to the time interval from the time data is produced to the time it can be viewed. The source feedback detection assembly and the fusion result detection assembly double-closed-loop detection assembly are formed through four judgment standards of completeness, consistency, accuracy and timeliness, the source feedback detection assembly is a primary assembly, the fusion result detection assembly is a secondary assembly, and when the data quality of the source detection assembly passes detection, the fusion result detection assembly can be used for conducting fusion data quality detection.
Furthermore, the source feedback detection component aims to detect the quality of data provided by the multi-source system from a data source and feed the data back to a source data provider; the method mainly detects the missing condition and abnormal condition of data and whether the data is generated to a big data platform according to a specified time period, and feeds back the detection result to a source system or platform providing original data in a report form, so that a source party can improve the data quality in time according to a report suggestion;
the fusion result detection component is used for comparing the data after cleaning, detecting the accuracy of the fused data and respectively detecting the accuracy of the data in three aspects of data quantity comparison before and after fusion, common statistics and data distribution; the data quality condition after fusion is detected by comparing the data quantity before and after fusion, wherein the detection target based on the line takes single-line data as the minimum unit for detection; the detection of the two dimensions of the common statistics and the data distribution is based on column detection; and then checking the distance field from the dimension of the data distribution.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a method for multi-dimensional traffic travel data fusion and data quality detection, which solves the problems that a plurality of vehicle data sources in the traffic field are mutually independent, great errors are generated when travel characteristics are judged, and the accuracy of data analysis and algorithm development is influenced, and simultaneously provides a new thought and solution for the multi-dimensional traffic travel data fusion.
2. According to the multidimensional traffic travel data fusion and data quality detection method provided by the invention, more objective and accurate passenger travel rule analysis, bus section passenger flow prediction, taxi passenger carrying line recommendation and shared multidimensional traffic travel fusion data can be carried out by using multidimensional traffic travel data, a set of double closed loop fusion data quality detection method is provided, and the data quality is ensured to the maximum extent.
Drawings
FIG. 1 is a flow diagram of a data preprocessing component of the present invention;
FIG. 2 is a block diagram of the overall concept of data fusion according to the present invention;
FIG. 3 is a diagram of a data fusion process of the present invention;
FIG. 4 is a flow chart of data quality detection according to the present invention;
FIG. 5 is a flow diagram of a source feedback detection assembly of the present invention;
FIG. 6 is a flow diagram of a fusion result detection assembly of the present invention;
fig. 7 is a screening diagram of an 8-bit geohash-based home traffic cell according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the embodiment of the invention: a multidimensional transportation data fusion and data quality detection method comprises the following steps:
the first step is as follows: data collection: the method comprises the steps of obtaining original travel data of different types of vehicles from different platforms and systems, wherein the original travel data comprise a data acquisition component, a data storage component and a data preprocessing component.
Specifically, data collection is the basis and the premise of data fusion, and data collection of different service lines, different systems and different types is collected and concentrated to a data center station, so that unified management of data is realized, data fusion of cross-field and cross-platform systems is realized, in the data collection process, a data source can be recorded in detail through a data collection assembly, and specific recording fields are recorded as data source records in table 1:
TABLE 1 data Source record
Figure BDA0003269555470000071
The public transport, subway, taxi, network appointment, shared single car and private car travel data obtained from different platforms or systems are all structured data; the multidimensional transportation travel data source comprises bus and subway card swiping data, taxi, network car booking, shared single-car order data and private car bayonet passing data.
The data storage component in this embodiment is used for storing the multidimensional travel data acquired from a plurality of different systems in a distributed manner in the big data platform
The data preprocessing component in this embodiment is used for performing missing condition inspection on original data, further processing the verified data according to a data quality condition, uploading the data quality to a multi-source data providing system, and providing a source data providing party for data quality improvement, and is specifically shown in fig. 1:
the data preprocessing process comprises missing value processing, abnormal value processing and data specification processing, and specifically comprises the following steps:
missing value processing: directly removing records with field missing number more than 80%;
abnormal value processing:
the first abnormal condition is as follows: the distance of the removed mileage is larger than a normal range, for example, the longitude and latitude position of the starting point of the trip is domestic, and the terminal point is foreign;
and (2) abnormal conditions: the elimination travel time is longer than the normal range, for example, the travel time of a taxi is longer than 12 hours.
Data specification processing:
when original data are analyzed, the longitude and latitude sequence dislocation of partial records is found, and the longitude and latitude data need to be exchanged; the time is not specified and represents the treatment, for example, the time information in the original database has a plurality of expressions such as "2021-07-0210: 00: 01" or "20210702100001", and is uniformly treated into a format of "xxxx-xx-xx xx: xx: xx"; the dates are uniformly processed into a format of 'xxxx-xx-xx'; the longitude and latitude representing type is standardized, and the longitude and latitude are represented by wgs84 in the embodiment; the payment amount field is unified; other fields are handled in a specification.
The second step is that: data fusion: as shown in fig. 2, based on the data fusion technical principle, the data fusion process is divided into data-level fusion and feature-level fusion; the data level fusion comprises the extraction of characteristics such as specific starting and stopping positions, longitude and latitude, time, vehicle information, user id, date, vehicle types and the like of travel data of a network taxi, a taxi and a shared single vehicle; the feature level fusion comprises the travel feature extraction of public transport, subway and private car and the start and stop point space feature extraction of all vehicles.
Specifically, the characteristics define: the purpose of the multidimensional traffic travel data fusion in the embodiment is to integrate travel records of passengers by different transportation means, so that the traffic travel rules of the passengers can be analyzed and judged more objectively and comprehensively; therefore, the embodiment integrates the travel data of 6 passenger common transportation means such as buses, subways, taxis, network appointments, shared single cars, private cars and passenger buses, and integrates a plurality of useful fields generated in travel as important information for analyzing and judging the traffic travel rule of the passengers.
The embodiment has perfect field definition and reasonable specification; distributed mass data storage; the three basic guiding ideas are used for designing a multi-dimensional traffic travel data fusion characteristic and a table structure; taking a travel starting point as a main research object, combining travel characteristics such as travel distance and consumption conditions in the travel process, fusing multi-source data, and extracting 46 fields in total; the number of the fields related to the travel starting point is 18, the number of the fields related to the travel end point is 18, and the number of the fields related to the travel characteristics in the travel process is 10; specifically, as shown in Table 2:
TABLE 2 multidimensional traffic travel data fusion large feature design
Figure BDA0003269555470000081
Figure BDA0003269555470000091
Figure BDA0003269555470000101
The multidimensional transportation data fusion process in the embodiment: as shown in table 2, the multidimensional travel data is fused with the big table data to generate a clear data connection relationship, which is specifically shown in fig. 3: the multidimensional transportation travel data fusion big table data generates travel od data depending on various vehicles, fields with fused feature levels in the travel od data of various vehicles depend on three basic information tables corresponding to regional geographic grid codes (8-bit geohash is adopted in the invention) or a feature fusion extraction method of corresponding fields, the generation process and principle of the three basic information tables and the feature fusion extraction of the corresponding fields generate spatial feature basic data based on a geographic gridding technology, and data level fusion fields in the travel od data of various vehicles are derived from historical order tables of various vehicles.
In addition, the distributed mass data storage in this embodiment is as follows: the generated data not only has perfect defined fields, is standard and reasonable, has clear data blood relationship, but also has the characteristic of distributed storage; the large multi-dimensional transportation travel data fusion table generated by the method of the embodiment covers various transportation travel data, has more fields and generates huge data amount every day, so that a distributed storage mode is needed to store massive large fusion table data, and when a storage mode is designed, the table is partitioned and stored by adopting a two-partition system, namely, a date is used as a primary partition (namely, a pdate field in a table 2), a vehicle type (namely, a type field in the table 2) is used as a secondary partition, and data are stored in multiple nodes; the purpose of storing mass data and quickly reading and writing data can be achieved by adopting the two-partition system.
The third step: and (3) fusion data quality detection: as shown in fig. 4, the data quality is mainly evaluated from four aspects of completeness, consistency, accuracy and timeliness; wherein, the integrity detection refers to whether the data information is missing; consistency refers to whether data conforms to a uniform specification; accuracy refers to whether the information of the data record has an abnormality or is due to an error caused in the calculation process; timeliness refers to the time interval from the time data is produced to the time it can be viewed. In the embodiment, the data quality is detected from the aspects to form a source feedback detection component and a fusion result detection component double closed-loop detection component, the source feedback detection component is a primary component, the fusion result detection component is a secondary component, and the fusion result detection component performs fusion data quality detection only when the data quality detection of the source detection component passes.
As shown in fig. 5, in this embodiment, the purpose of the source feedback detection component is to detect the quality of data provided by the multi-source system from the source of the data, and feed back the quality to the source data provider; the method mainly detects the missing condition and abnormal condition of data and whether the data is generated to a big data platform according to a specified time period. And the detection result is fed back to a source system or a platform providing original data in a report form, so that a source party can improve the data quality in time according to the report suggestion.
As shown in fig. 6, in this embodiment, the fusion result detection component mainly compares the data after cleaning, and detects the accuracy of the fused data; data accuracy is detected from three aspects of data quantity comparison before and after fusion, common statistics and data distribution; in the process of fusing the multidimensional traffic travel data, the data record number is consistent with the data record number after cleaning, the data quality condition after fusion is detected by comparing the data number before and after fusion, and the detection is carried out by taking single-row data as the minimum unit based on a detection target of a row.
The detection of both dimensions, the common statistics and the data distribution, is a column-based detection. For example, a line of distance traveled after fusion is set as a detection target, and the maximum value, the minimum value, the mean value, and the variance are calculated for each of the fields corresponding to the distance traveled in the data after fusion and the data before fusion, and the four statistics before and after fusion are compared to determine the data quality after fusion. The distance field can be calibrated from the dimensions of the data distribution in a similar manner. Detection method for other fields of fused data
In the above embodiments, to better explain the present invention, a method for generating spatial feature basic data based on a geographic meshing technology is also provided, the method is optimized for a travel starting point and end point spatial feature extraction method, the geographic meshing technology is adopted to generate spatial feature basic information, and compared with a conventional method, the improved method effectively reduces the time complexity of feature extraction and improves the speed of data fusion
The traditional extraction method comprises the following steps:
for the extraction of the features of the traffic cell, the street, the administrative jurisdiction and the important activity place to which the start point and the stop point belong, the traditional spatial feature extraction method is that each trip record is traversed in sequence, the area of the start point is obtained for each record according to the longitude and latitude of the start point and the stop point, and the spatial feature extraction of the traffic cell to which the start point of the trip belongs is taken as an example, and the traditional method has the following feature extraction process:
(1) traversing each travel record, and solving the attributive traffic cell of the starting point of the travel for each record according to the following steps;
(2) traversing all traffic cells, acquiring corresponding traffic cell fences and the starting point longitude and latitude of the travel record, judging the traffic cell to which the longitude and latitude belong by using an algorithm of judging whether a point is in a specified fence or not, and if all the traffic cells are traversed, determining that the traffic cell matched with the longitude and latitude cannot be found, wherein the characteristic of the starting point traffic cell of the travel record is empty;
similarly, the traffic cell to which the end point of each travel record belongs can be obtained by the method.
Similarly, streets, administrative jurisdictions and important activity places to which the starting and stopping points of each travel record belong can be obtained by the method.
For the characteristics of the buses and subway stations closest to the starting point and the stopping point and the distance, the traditional method is to sequentially traverse each travel record, solve the distance between the point and the nearest bus and subway station according to the longitude and latitude of the starting point and the stopping point of each record, and return the distance of the nearest station, and the solving process is similar to that of the solution method of the attribution area.
Therefore, the algorithm time complexity of the traditional method for extracting the spatial features of the starting point and the stopping point of the trip is high, the data fusion speed is influenced, and the traditional method is time-consuming and labor-consuming for massive multi-dimensional traffic trip data. There is a need for an optimization method that reduces the algorithm time complexity of the data fusion process and increases the data fusion speed
The improved extraction method of the invention comprises the following steps:
the improved spatial feature extraction method based on the geography gridding technology comprises the following steps:
(1) spatial gridding of a target area: selecting a target area according to business requirements, wherein the target area is Guangzhou if the multi-dimensional traffic data fusion of the whole Guangzhou needs to be researched; then, dividing the target area space into a plurality of grids by adopting a space gridding technology; in the embodiment, a geohash grid coding technology is adopted, and a target area is divided into a plurality of grids by combining an error requirement and using 8-bit geohash with granularity, wherein the spatial error of the 8-bit geohash is about 19 meters; the geohash precision errors for different numbers of bits are shown in table 3:
TABLE 3 error corresponding to different bit numbers geohash
Figure BDA0003269555470000141
(2) Mesh home area (large, medium, and small, key area) information: and sequentially calculating the attribution area of each 8-bit geohash grid, taking the example that one 8-bit geohash in the target area space calculates the attribution traffic cell, and for each 8-bit geohash grid, the solving method is as follows:
finding out the center of the 8-bit geohash, and expressing the center by longitude and latitude;
secondly, screening all alternative traffic cells in a square area with the center as the center and the side length of 10 kilometers, wherein the specific screening mode is as follows:
the latitude and longitude of the northwest point and the southeast point of the square area are determined, and expressed as (lng _ w, lat _ n), (lng _ e, lat _ s)
Finding the longitude and latitude of the northwest point and the southeast point of the fence area of all the traffic cells, which are expressed as (lng _ w)i,lat_ni),(lng_ei,lat_si)
Screening out possible traffic zones meeting the requirements, wherein the conditions meeting the requirements are as follows:
Figure BDA0003269555470000142
for example, the square area in fig. 7 represents the 8-bit geohash-centered square area, a1, a2, A3 and a4 represent some three traffic cells, and the candidate traffic cells a2, A3 and a4 meeting the requirements can be quickly screened by the above algorithm.
And thirdly, traversing all the alternative traffic cells obtained in the step II, and judging the traffic cell to which the longitude and the latitude belong by using an algorithm of judging whether the longitude and the latitude are in the designated fence or not according to the fence of the alternative traffic cells and the longitude and the latitude of the 8-bit geohash center to be solved. And if all the traffic cells are traversed, the traffic cell matched with the longitude and latitude still cannot be found, and the 8-bit geohash to be solved has no corresponding traffic cell.
According to the steps of the first step and the third step, the traffic cells to which all the space grids (8-bit geohash) in the target area belong can be sequentially obtained.
Similarly, the streets, administrative jurisdictions and important activity sites to which all the spatial grids (8-bit geohash) in the target area belong can be obtained by the method.
(3) Bus station closest to the grid and distance information: sequentially solving the nearest bus stop and distance information of each 8-bit geohash grid, taking the example that one 8-bit geohash in the target area space is used for solving the nearest bus stop and distance information, and for each 8-bit geohash grid, the solving method is as follows:
finding the center of the 8-bit geohash, and representing the center (lng, lat) by latitude and longitude;
secondly, screening all alternative bus stops in a square area with the center as the center and the side length of 4 kilometers, wherein the specific screening mode is as follows:
determining the latitude and longitude of the northwest point and the southeast point of the square area, which are respectively represented as (lng _ w, lat _ n), (lng _ e, lat _ s);
each bus stop longitude and latitude is represented as (lng _ w)i,lat_ni),(lng_ei,lat_si);
Screening out all bus stops in the square area, and meeting the requirements as follows:
Figure BDA0003269555470000151
and thirdly, sequentially solving the distances from all the alternative bus stops to the center (lng, lat) of the 8-bit geohash by using a longitude and latitude distance calculation formula, and finding out the information and the distance of the bus stop closest to the candidate bus stop.
According to the steps of the first step and the third step, bus stops and distance information to which all space grids (8-bit geohash) in the target area belong can be sequentially obtained.
(4) Subway station closest to the grid and distance information:
similarly, the bus stop and distance information of all the spatial grids (8-bit geohash) in the target area can be obtained by the method (3).
(5) And (2) aggregating the spatial feature basic information obtained based on the geography gridding technology from (4) with 8-bit geohash corresponding to the starting point and the stopping point of each trip record, and quickly extracting the spatial feature of the starting point and the stopping point of each trip record
In conclusion, the traditional extraction method is superior to the improved method:
assuming that the number of regions is n1The number (mean value) of the edges of the regional fence is h, and the number g of bus (or subway) stations1The number m of the trip records, and the number s of 8-bit geohash grids; in the improved method, the number of the alternative attribution area fences after screening is n2(n2<<n1) The number g of alternative public transport (or subway) stations2(g2<<g1)。
Given that the time complexity of the algorithm for determining whether a point is in a fence by using a ray method is o (h), the time complexity of the conventional extraction method and the time complexity of the spatial feature extraction method based on the geography gridding are compared as shown in the following table 4:
TABLE 4 time complexity contrast before and after optimization
Figure BDA0003269555470000161
From the above table, the time complexity after the improvement is greatly reduced, and the data fusion speed is effectively improved.
In addition, in the second step of the embodiment of the invention, data set fusion and feature level fusion are adopted for the data fusion process, wherein the data level fusion mainly comprises simple operations related to splicing of the existing data features of the original data; the feature level fusion comprises travel feature extraction and start and stop point space feature extraction.
Wherein, the data level fusion: and (4) performing data level fusion on the cleaned data, such as fusion of fields of the start and stop point specific position, longitude and latitude, time, vehicle information, user id, date, vehicle type and the like of the multi-dimensional traffic travel data.
Wherein, trip characteristic extraction: the invention defines the travel characteristics of the transportation means as the geographic information related to the starting point and the ending point of one complete travel, comprising specific position, longitude and latitude, geographic grid code (8-bit geohash), date, departure time and arrival time.
Specifically, the method comprises the following steps of extracting travel characteristics of buses and subways:
according to the invention, the public transport and subway trips are taken as the same type of trip mode, and data analysis and previous research show that the combined trip modes of public transport transfer bus, subway transfer subway, public transport transfer subway or subway transfer bus account for a great proportion when passengers finish one complete trip by adopting public transport; the bus-subway card swiping record application platform deduces a passenger once complete trip chain by an existing bus-subway trip chain derivation algorithm obtained by data level fusion of a method for generating spatial feature basic data based on a geographic gridding technology, records that the trip chain is the next trip of the passenger on the same day by using a trip _ num field, and checks the sequence of the trip chain by using o _ time or d _ time.
Further comprises the following steps of extracting the travel characteristics of the private car:
<1> the private car passing gate records are grouped according to the license plate number and are sequenced in time to generate a gate sensing sequence;
identifying single trip according to the travel time threshold of the front and rear gate sensing pair, and dividing the vehicle gate sensing sequence into a plurality of groups of private car trip sub-sequences, namely recording a plurality of times of private car trips;
and <3> carrying out travel feature extraction on each group of private car travel subsequences, namely starting and stopping point related geographic features, travel date, starting point departure time and ending point arrival time of complete travel of a private car.
The method further comprises the following steps of extracting travel characteristics of the taxis, the network taxi appointment and the shared single taxi:
for taxis, net appointment cars and shared single cars, one order represents one complete trip, so that trip characteristics of the three types of trip modes are easy to obtain compared with those of private cars, and trip characteristics such as start and stop points, trip dates, start point departure times, end point arrival times and the like can be extracted after cleaning according to original data provided by a source system.
The geographic grid search technology extracts travel starting and stopping point spatial features:
extracting relevant features of important spatial positions of the starting and stopping points of the trip, nearby buses and subways: according to the design of the multidimensional transportation travel data fusion big table, the characteristic level data of a traffic district, a street, a jurisdiction area, an important activity place where the traffic district, the street, the jurisdiction area, the bus and the subway station which are closest to each other are fused, wherein the traffic district, the street, the jurisdiction area, the important activity place, the bus and the subway station are corresponding to the start and stop points of travel.
In order to reduce the time complexity in the feature level fusion process, the feature extraction algorithm is optimized when the features are extracted in the embodiment, which is specifically as follows:
<1> unify the longitude and latitude type of the starting and stopping point of the trip, the unified specification is wgs84 type longitude and latitude;
<2> 8-bit geohash coding is carried out on the trip starting and stopping points according to the longitude and latitude;
<3> generating a relevant basic information table corresponding to all 8-bit geohashes of the target area;
basic table 1 all traffic districts, streets, jurisdictions and key areas corresponding to 8-bit geohash in target area
Basic table 2 bus stop information and distance basic information table with all 8-bit geohashs in target area being nearest to each other
Basic table 3 subway platform information and distance basic information table with the nearest distance of all 8-bit geohashes in target area
The data level fusion data obtained by the method for generating the spatial feature basic data based on the geographic meshing technology are respectively aggregated with three basic information tables corresponding to all 8-bit geohashes in the target area, and the 8-bit geohashes of the starting and stopping points of the trip and the 8-bit geohashes of the basic information tables are respectively used as aggregation keys to obtain traffic districts, streets, jurisdictions, important activity places where the traffic districts are located, and bus and subway station condition fields with the nearest distance, which correspond to the starting and stopping points of the trip.
And finally, completing the fusion of the characteristic level fields corresponding to the trip start and stop points.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims (8)

1. A multidimensional transportation data fusion and data quality detection method is characterized by comprising the following steps:
s1: data collection: acquiring original travel data of different types of vehicles from different platforms and systems, wherein the original travel data comprises a data acquisition component, a data storage component and a data preprocessing component;
s2: data fusion: the method comprises three levels of data level fusion, feature level fusion and decision level fusion; in order to obtain a multi-dimensional transportation fusion data table, data level fusion and characteristic level fusion are adopted; the data level fusion comprises the characteristic extraction of specific starting and stopping points, longitude and latitude, time, vehicle information, user id, date and vehicle type of travel data of a networked car, a taxi and a shared single car; the feature level fusion comprises the extraction of the trip features of buses, subways and private cars and the extraction of the start and stop point space features of all vehicles;
s3: and (5) fusing data quality detection.
2. The method for multi-dimensional transportation travel data fusion and data quality detection according to claim 1, characterized by: in the step S1, the bus, subway, taxi, network appointment, shared bicycle, and private car travel data, the bus and subway card swiping data, the taxi, network appointment, shared bicycle order data, and the private car card passing data, which are obtained from different platforms or systems through the data acquisition component, are all structured data, and the structured data are used as a multidimensional transportation travel data source.
3. The method for multi-dimensional transportation travel data fusion and data quality detection according to claim 1, characterized by: the multidimensional travel data acquired by the data storage component in the S1 from a plurality of different systems are stored in a large data platform in a distributed mode.
4. The method for multi-dimensional transportation travel data fusion and data quality detection according to claim 1, characterized by: the data preprocessing component in the S1 is used for carrying out missing condition inspection on original data, further processing the verified data according to the data quality condition, uploading the data quality to a multi-source data providing system, and improving the data quality by a source data providing party; the specific method comprises the following steps:
s101: missing value processing: directly removing records with field missing number more than 80%;
s102: abnormal value processing: the method comprises the steps that the distance of the rejected mileage is larger than a normal range, and the trip time of the rejected trip is larger than the normal range;
s103: data specification processing: the method comprises time non-standard representation processing, date unified processing, standard longitude and latitude representation types and payment amount field unified units.
5. The method for multi-dimensional transportation travel data fusion and data quality detection according to claim 1, wherein the spatial feature extraction method after the data fusion is improved based on the geography gridding technology in S2 comprises the following steps:
s201: spatial gridding of a target area: selecting a target area according to service requirements, dividing the space of the target area into a plurality of grids by adopting a space gridding technology, specifically adopting a geohash grid coding technology, and dividing the target area into a plurality of grids by combining with an error requirement and using 8-bit geohash with granularity, wherein the space error of the 8-bit geohash is 19 meters;
s202: mesh home zone information: and sequentially calculating the attribution area of each 8-bit geohash grid, wherein the solving method of one 8-bit geohash in the target area space comprises the following steps:
s2021: solving the center of 8-bit geohash, and expressing the center by longitude and latitude;
s2022: taking the center as a center, screening out all alternative areas in a square area with the side length of 10 kilometers, wherein the specific screening mode is as follows:
(1) calculating the longitude and latitude of the northwest point and the southeast point of the square area, which are respectively expressed as (lng _ w, lat _ n), (lng _ e and lat _ s);
(2) and solving the longitude and latitude of the northwest point and the southeast point of the fence area of all the traffic cells, wherein the longitude and latitude are expressed as (lng _ w)i,lat_ni),(lng_ei,lat_si);
(3) Screening out possible traffic districts meeting the requirements, wherein the conditions meeting the requirements are as follows:
Figure FDA0003269555460000021
the candidate areas meeting the requirements can be quickly screened out by the algorithm;
s2023: traversing all the alternative areas obtained in the step 2022, and judging the area to which the latitude and longitude belong by using an algorithm of whether the point is in the designated fence or not according to the fence of the alternative areas and the latitude and longitude of the 8-bit geohash center to be solved;
s203: region information closest to the grid: sequentially solving the nearest area information of each 8-bit geohash grid, wherein the solving method comprises the following steps:
s2031: finding the center of the 8-bit geohash and representing the center (lng, lat) by latitude and longitude;
s2032: taking the center as a center, screening out all alternative areas in a square area with the side length of 4 kilometers, wherein the specific screening mode is as follows:
(1) calculating the longitude and latitude of the northwest point and the southeast point of the square area, which are respectively expressed as (lng _ w, lat _ n), (lng _ e and lat _ s);
(2) each site latitude and longitude is represented as (lng _ w)i,lat_ni),(lng_ei,lat_si);
(3) All bus stops in the square area are screened out, and the conditions meeting the requirements are as follows:
Figure FDA0003269555460000031
s2033: sequentially solving the distances from all the alternative sites to the center (lng, lat) of the 8-bit geohash by using a longitude and latitude distance calculation formula, and finding out the site information and the distance which are closest to each other;
according to the steps, the station and distance information which all the space grids of the target area belong to can be sequentially obtained;
s204: and aggregating the spatial feature basic information obtained based on the geography gridding technology in S201-S203 with 8-bit geohash corresponding to the start point and the stop point of each trip record, so that the spatial feature of the start point and the stop point of each trip record can be rapidly extracted.
6. The method for multi-dimensional transportation travel data fusion and data quality detection according to claim 5, wherein in order to reduce the time complexity in the feature level fusion process, the feature extraction algorithm is optimized as follows when extracting spatial features:
(1) unifying the longitude and latitude types of the starting and stopping points of the trip, wherein the unified specification is wgs84 type longitude and latitude;
(2) 8-bit geohash coding is carried out on the trip starting and stopping points according to the longitude and latitude;
(3) generating a relevant basic information table corresponding to all 8-bit geohashes in the target area;
(4) respectively aggregating data information corresponding to the data-level fusion data and all 8-bit geohashes in the target area, and respectively taking the 8-bit geohashes of the trip start and stop points and the 8-bit geohashes of the basic information data as aggregation keys to obtain traffic districts, streets, jurisdictions, important activity places where the traffic districts and the subway stations are located, and bus and subway station condition fields closest to the trip start and stop points;
(5) and finishing the fusion of the characteristic level fields corresponding to the trip starting and stopping points.
7. The method for multi-dimensional transportation travel data fusion and data quality detection according to claim 1, wherein the data quality determination criteria in S3 are integrity, consistency, accuracy and timeliness; wherein, the integrity detection refers to whether the data information is missing; consistency refers to whether data conforms to a uniform specification; accuracy refers to whether the information of the data record has an abnormality or is due to an error caused in the calculation process; timeliness refers to the time interval from the time data is produced to the time it can be viewed. The source feedback detection assembly and the fusion result detection assembly double-closed-loop detection assembly are formed through four judgment standards of completeness, consistency, accuracy and timeliness, the source feedback detection assembly is a primary assembly, the fusion result detection assembly is a secondary assembly, and when the data quality of the source detection assembly passes detection, the fusion result detection assembly can be used for conducting fusion data quality detection.
8. The method for multi-dimensional transportation travel data fusion and data quality detection according to claim 7, wherein the purpose of the source feedback detection component is to detect the quality of data provided by the multi-source system from a data source and feed the data back to a source data provider; the method mainly detects the missing condition and abnormal condition of data and whether the data is generated to a big data platform according to a specified time period, and feeds back the detection result to a source system or platform providing original data in a report form, so that a source party can improve the data quality in time according to a report suggestion;
the fusion result detection component is used for comparing the data after cleaning, detecting the accuracy of the fused data and respectively detecting the accuracy of the data in three aspects of data quantity comparison before and after fusion, common statistics and data distribution; the data quality condition after fusion is detected by comparing the data quantity before and after fusion, wherein the detection target based on the line takes single-line data as the minimum unit for detection; the detection of the two dimensions of the common statistics and the data distribution is based on column detection; and then checking the distance field from the dimension of the data distribution.
CN202111097329.0A 2021-09-18 2021-09-18 Multidimensional transportation data fusion and data quality detection method Active CN113742330B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111097329.0A CN113742330B (en) 2021-09-18 2021-09-18 Multidimensional transportation data fusion and data quality detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111097329.0A CN113742330B (en) 2021-09-18 2021-09-18 Multidimensional transportation data fusion and data quality detection method

Publications (2)

Publication Number Publication Date
CN113742330A true CN113742330A (en) 2021-12-03
CN113742330B CN113742330B (en) 2023-02-28

Family

ID=78739894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111097329.0A Active CN113742330B (en) 2021-09-18 2021-09-18 Multidimensional transportation data fusion and data quality detection method

Country Status (1)

Country Link
CN (1) CN113742330B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694706A (en) * 2009-09-28 2010-04-14 深圳先进技术研究院 Modeling method of characteristics of population space-time dynamic moving based on multisource data fusion
WO2018023331A1 (en) * 2016-08-01 2018-02-08 中国科学院深圳先进技术研究院 System and method for real-time evaluation of service index of regular public buses
CN108010316A (en) * 2017-11-15 2018-05-08 上海电科智能系统股份有限公司 A kind of road traffic multisource data fusion processing method based on road net model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694706A (en) * 2009-09-28 2010-04-14 深圳先进技术研究院 Modeling method of characteristics of population space-time dynamic moving based on multisource data fusion
WO2018023331A1 (en) * 2016-08-01 2018-02-08 中国科学院深圳先进技术研究院 System and method for real-time evaluation of service index of regular public buses
CN108010316A (en) * 2017-11-15 2018-05-08 上海电科智能系统股份有限公司 A kind of road traffic multisource data fusion processing method based on road net model

Also Published As

Publication number Publication date
CN113742330B (en) 2023-02-28

Similar Documents

Publication Publication Date Title
CN108346292B (en) Urban expressway real-time traffic index calculation method based on checkpoint data
Yi et al. Inferencing hourly traffic volume using data-driven machine learning and graph theory
CN105718946A (en) Passenger going-out behavior analysis method based on subway card-swiping data
Zhao et al. Identification of land-use characteristics using bicycle sharing data: A deep learning approach
CN110836675B (en) Decision tree-based automatic driving search decision method
CN110633307A (en) Urban public bicycle connection subway space-time analysis method
Demissie et al. Estimation of truck origin-destination flows using GPS data
Xu et al. Understanding the Usage Patterns of Bicycle‐Sharing Systems to Predict Users’ Demand: A Case Study in Wenzhou, China
Qi et al. Vehicle trajectory reconstruction on urban traffic network using automatic license plate recognition data
Ji et al. Research on classification and influencing factors of metro commuting patterns by combining smart card data and household travel survey data
Ferrara et al. Multimodal choice model for e-mobility scenarios
CN114501336B (en) Road traffic volume measuring and calculating method and device, electronic equipment and storage medium
Liu et al. Data analytics approach for train timetable performance measures using automatic train supervision data
Chang et al. Segment‐condition‐based railway track maintenance schedule optimization
CN113742330B (en) Multidimensional transportation data fusion and data quality detection method
Li et al. Multi-mode traffic demand analysis based on multi-source transportation data
CN108681741A (en) Based on the subway of IC card and resident&#39;s survey data commuting crowd&#39;s information fusion method
CN115565376B (en) Vehicle journey time prediction method and system integrating graph2vec and double-layer LSTM
CN111833229B (en) Subway dependency-based travel behavior space-time analysis method and device
Ku et al. Trip-pair based clustering model for urban mobility of bus passengers in Macao
CN114971717A (en) Commercial coupon issuing method
Shan et al. Interfering spatiotemporal features and causes of bus bunching using empirical GPS trajectory data
Ke et al. Traffic Origin-Destination Flow-Inspired Dynamic Urban Arterial Partition for Coordinated Signal Control Using Automatic License Plate Recognition Data
Yigitcanlar et al. Sustainable Australia: Containing travel in master planned estates
Li et al. Identification Dockless Bike‐Sharing and Metro Transfer Travelers through Mobility Chain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant