WO2022105372A1 - 时空关联数据的查询方法、装置、电子设备和存储介质 - Google Patents

时空关联数据的查询方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2022105372A1
WO2022105372A1 PCT/CN2021/116775 CN2021116775W WO2022105372A1 WO 2022105372 A1 WO2022105372 A1 WO 2022105372A1 CN 2021116775 W CN2021116775 W CN 2021116775W WO 2022105372 A1 WO2022105372 A1 WO 2022105372A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
data
index
spatiotemporal
attribute
Prior art date
Application number
PCT/CN2021/116775
Other languages
English (en)
French (fr)
Inventor
刘钧文
李瑞远
Original Assignee
京东城市(北京)数字科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东城市(北京)数字科技有限公司 filed Critical 京东城市(北京)数字科技有限公司
Publication of WO2022105372A1 publication Critical patent/WO2022105372A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/244Grouping and aggregation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a query method, apparatus, electronic device and storage medium for spatiotemporal correlated data.
  • the existing association query scheme is a secondary spatiotemporal query for the total amount of data, which is usually divided into three steps.
  • the execution process and defects of the three steps are as follows:
  • the space-time query box may have overlapping parts. That is to say, there will be a lot of repetitive operations in the second step of the second query of the data set, which will increase the unnecessary data scanning and network transmission process, and eventually lead to low performance of the associated query.
  • the data set will be fully calculated, occupying a large amount of memory resources. Also in a distributed environment, there will also be a problem of performance degradation due to the network transmission of a large amount of data.
  • the purpose of the present disclosure is to provide a query method, device, electronic device and storage medium for spatiotemporal correlated data, which at least to a certain extent overcomes the problem of low query efficiency in the related art.
  • a method for querying spatiotemporal correlated data comprising: receiving a query request, and performing a first index according to the query request to generate query conditions, where the query conditions include time query conditions, spatial at least one of a query condition and an object property query condition; determine a query range in a preset spatiotemporal attribute index model according to the query condition, and perform a second index according to the query range; perform a second index on the second index Deduplication processing is performed on the result set obtained by deduplication, so as to obtain a retrieval set after deduplication; according to the retrieval set and the query condition, the spatial-temporal correlation query result is determined.
  • receiving the query request and performing the first indexing according to the query request includes: receiving the query request and determining the time data, spatial data and object attribute data of the user included in the query request; A filling curve of the spatial data is generated, and the temporal data and the object attribute data are derived from the filling curve of the spatial data to obtain a spatiotemporal index code.
  • the query of the spatiotemporal correlation data further includes: coding the spatiotemporal index into a hash value of the identifier of the user.
  • the query of the spatiotemporal correlation data further includes: writing the spatiotemporal index code in the form of a key value into the database to be indexed.
  • determining a query range in a preset spatiotemporal attribute index model according to the query conditions, and performing the second indexing according to the query range includes: determining a query range according to a value in the query conditions The query range is specified; the spatial data query is performed through the geocoding algorithm and the query range, and the query result is stored as a result set of the hashset class.
  • performing deduplication processing on the result set of the second index to obtain a retrieval set after deduplication includes: performing index key value segmentation on the result set of the second index , to obtain the attribute code of the result set; perform data grouping according to the attribute code to obtain grouped data; perform deduplication statistics on the spatiotemporal index code of the grouped data through the HyperLogLog algorithm to obtain the cardinality of the retrieval set ; Determine the relationship data according to the retrieval set, and generate the corresponding retrieval set.
  • performing deduplication statistics on the spatiotemporal index encoding of the grouped data by using the HyperLogLog algorithm includes: determining a bit value of the spatiotemporal index encoding, and performing bucket averaging processing on the bit value to obtain determining a harmonic mean; performing deviation correction on the grouped data according to the harmonic mean; and performing deduplication processing on the result of the deviation correction.
  • an apparatus for querying spatiotemporal correlated data comprising: a first indexing module, configured to receive a query request, and perform a first index according to the query request to generate query conditions, the The query conditions include at least one of time query conditions, spatial query conditions and object attribute query conditions; the second index module is used to determine the query range in the preset spatiotemporal attribute index model according to the query conditions, and according to the query conditions The query range is indexed for the second time; the deduplication module is used for deduplication processing on the result set of the second index to obtain the retrieval set after deduplication; the determination module is used for according to the retrieval set and all The above query conditions determine the results of the spatiotemporal correlation query.
  • an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute any one of the foregoing by executing the executable instructions A query method for spatiotemporal correlated data.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements any one of the above-mentioned query methods for spatiotemporal correlated data.
  • the geohash code contains information of two dimensions of latitude and longitude
  • a new code can also be formed in the form of cross-coding , and this code can become the key value in the key-value database.
  • it can ensure sufficient hashing and fully distribute the data in the code table, and on the other hand, it can also improve the performance of the query.
  • the retrieval set after deduplication is obtained, and quickly deduplication statistics are performed to obtain the cardinality of the retrieval set, which reduces the redundancy in the query record.
  • the redundant data thereby reducing the data interaction pressure and computational pressure.
  • FIG. 1A shows a schematic diagram of an index model of a query scheme for spatiotemporally correlated data in an embodiment of the present disclosure
  • FIG. 1B shows a schematic diagram of an index range of a query scheme for spatiotemporally correlated data in an embodiment of the present disclosure
  • FIG. 1C shows a schematic diagram of an indexing process of a query scheme for spatiotemporally correlated data in an embodiment of the present disclosure
  • FIG. 1D shows a schematic diagram of a spatiotemporal cross-index model of a query scheme of spatiotemporal associated data in an embodiment of the present disclosure
  • FIG. 1E shows a spatiotemporal cross-index model of another query scheme of spatiotemporal associated data in an embodiment of the present disclosure
  • FIG. 1F shows a schematic diagram of a storage structure of a query scheme for spatiotemporal correlated data in an embodiment of the present disclosure
  • FIG. 2 shows a flowchart of a method for querying spatiotemporal correlated data in an embodiment of the present disclosure
  • FIG. 3 shows a flowchart of another method for querying spatiotemporal correlated data in an embodiment of the present disclosure
  • FIG. 4 shows a flowchart of another method for querying spatiotemporal correlated data in an embodiment of the present disclosure
  • FIG. 5 shows a flowchart of another method for querying spatiotemporal correlated data in an embodiment of the present disclosure
  • FIG. 6 shows a flowchart of another method for querying spatiotemporal correlated data in an embodiment of the present disclosure
  • FIG. 7 shows a flowchart of another method for querying spatiotemporal correlated data in an embodiment of the present disclosure
  • FIG. 8 shows a flowchart of another method for querying spatiotemporal correlated data in an embodiment of the present disclosure
  • FIG. 9 shows a schematic diagram of an apparatus for querying spatiotemporal correlated data in an embodiment of the present disclosure
  • FIG. 10 shows a schematic diagram of an electronic device in an embodiment of the present disclosure.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
  • the geohash code contains information of two dimensions of longitude and latitude
  • a new code can also be formed in the form of cross-coding, and this code can become a key-value database
  • the key value in it on the one hand, can ensure sufficient hashing and fully distribute the data in the encoding table, and on the other hand can improve the performance of the query.
  • the retrieval set after deduplication is obtained, and quickly deduplication statistics are performed to obtain the cardinality of the retrieval set, which reduces the redundancy in the query record.
  • the redundant data thereby reducing the data interaction pressure and computational pressure.
  • FIG. 1A , FIG. 1B , FIG. 1C , FIG. 1D , FIG. 1E and FIG. 1F show schematic diagrams of a query architecture for spatiotemporally correlated data in an embodiment of the present disclosure.
  • the basic idea of a quadtree 102 index is to recursively divide the geographic space into different levels of tree structure. It divides the space of a known range into four equal subspaces, and so on recursively until the level of the tree reaches a certain depth or when certain requirements are met, the division stops.
  • the structure of the quadtree 102 is relatively simple, and when the spatial data objects are evenly distributed, it has a relatively high efficiency of spatial data insertion and query. Therefore, the quadtree 102 is a commonly used space in GIS (Geographic Information System). one of the indexes.
  • GIS Geographic Information System
  • the quadtree 102 is relatively efficient for area queries. However, if the spatial objects are unevenly distributed, with the continuous insertion of geospatial objects, the level of the quadtree 102 will continue to deepen, and a severely unbalanced quadtree 102 will be formed, so the depth of each query will be greatly increased. increase, resulting in a sharp drop in query efficiency.
  • the time index 104 correspondingly generates a spatial query range and a time query range according to the spatial location information and time information in the query record.
  • the time index data is stored in a unified naming method, such as "date-hour-minute-second", for example, "2020-01-01 00:00:00", “2020-01-01 00:00: 01", “2020-01-01 00:00:02”, “2020-01-01 00:00:03”, “2020-01-01 00:00:04", “2020-01-01 00: 00:05”, ..., “2020-01-04 00:00:00", etc.
  • prefix tree 106 is a special form of an N-ary tree. Typically, a prefix tree 106 is used to store character strings. Each node of the prefix tree 106 represents a character string (prefix). Each node has multiple child nodes, and the paths to different child nodes have different characters. The string represented by the child node is composed of the original string of the node itself, and all the characters on the path to the child node.
  • the first-level nodes of the prefix tree 106 are "B"
  • the second-level nodes are "u”, "i” and “o”
  • the third-level nodes are "k”, "i”, “h”, “g” , “e”, “f”, "b”, “c”, "a"
  • the spatiotemporal query frame may have overlapping parts, that is to say, the second step is to perform a secondary query on the data set
  • the second step is to perform a secondary query on the data set
  • There will be a large number of repeated query scopes such as the first query scope 108A, the second query scope 108B, the third query scope 108C, and the fourth query scope 108D.
  • a query condition is generated according to the index information of the client, a query result is generated according to the primary query condition, and the client generates multiple secondary query conditions according to the primary query result and sends it to the network, for example, a secondary query Condition 2_1, secondary query condition 2_2, secondary query condition 2_3, ..., secondary query condition 2_n, correspondingly, the index determines the secondary query result 2_1, secondary query result 2_2, secondary query result 2_3 in the database, «, the secondary query result 2_n, and send the aggregated query result to the network.
  • the Z3 index 110 is a binary similar to geohash. Coding, an indexing mechanism that cross-codes longitude, latitude, and time. On the basis of this encoding, the hash (hash) value of the person's ID (Identity document, ID number) is added at the end, for example, the Z3 index model 1101, 1102, 1103, 1104 shown in Figure 1E , ..., 1110n, etc.
  • the index condition is "user Bob coordinates (119.25 35.45) 2020-06-16 00:23:45”, which is divided into user ID "user Bob”, geographic coordinates (119.25 35.45), and time is "June 16, 2020” Day 00:23:45", as shown in Figure 1E, points to the Z3 index model 1103.
  • Key is the address information of the magnetic head search data, such as "Key01”
  • Value is the data body stored in binary, such as "value01”, which can ensure that as long as the query address information is known , and finally the specific data can be obtained quickly, and this process has nothing to do with the size of the data volume, but only with the disk I/O (Input/Output, input/output) performance.
  • the query of the spatial range is controlled based on the Geohash (spatial index address coding) grid.
  • the result will be a Long (long integer) type collection.
  • HashSet Hash Set
  • the HyperLogLog algorithm is a data statistics algorithm commonly used in big data scenarios. It calculates the cardinality of a large number of data sets with less resources. It has a very good space complexity. The cardinality of the data set increases, and the storage space occupied by it will not increase accordingly. Although there will be some errors, for a large number of data sets, the error correction parameters with higher accuracy are configured. It can be guaranteed to be less than 1%, and its pressure on the memory space is minimal. For example, if the data of 100 million users is counted, the memory required is only about 12kb.
  • each key value is a binary code composed of an attribute and a space-time cross-index.
  • the length of the latter is determined, and the previous attribute code can be obtained by using the bit operation of the computer.
  • the previous attribute encoding that is, the user ID, is used for grouping.
  • the second is to use the HyperLogLog algorithm to deduplicate the binary encoding of the space-time cross-index within each group.
  • the space-time encoding needs to be averaged in buckets, and the stored index values are divided into m buckets.
  • the bucket determines which bucket is based on the value of the first few bits of the hash value, and counts and obtains the harmonic average of the m buckets separately.
  • the user-defined parameters are used to correct the deviation, and finally the deduplication statistical value of the associated information of each associated user and patient can be obtained.
  • the formula of the HyperLogLog algorithm is as follows:
  • the constant constant is a parameter to correct the result
  • R j represents the maximum number of leading zeros of the data in the jth bucket + 1
  • m is a positive integer greater than 1.
  • the above HyperLogLog algorithm uses a hash function, which is the hash value of the spatiotemporal attribute index in the scenario of epidemic correlation analysis.
  • a hash value is obtained for each element in the data stream, and then for each hash value, take The last P bit determines the bucket serial number, and in the epidemic correlation analysis, buckets can be divided according to the hash value of the attribute value.
  • the harmonic mean of the values in all the buckets is calculated.
  • the harmonic mean is obtained by taking the reciprocal of all the values, and finally multiplied by m to get the final result E.
  • the correlation coefficient between the patient and each of the calculated associated users is calculated based on the statistical values of the correlation information.
  • the correlation coefficient here is mainly a calculated value based on the time interval and spatial distance of the found results.
  • FIG. 2 shows a flowchart of a method for querying spatiotemporal correlated data in an embodiment of the present disclosure.
  • the methods provided by the embodiments of the present disclosure may be executed by any electronic device with computing processing capabilities, such as a server or a terminal, but not limited thereto.
  • the terminal is used as the execution subject for example description.
  • a method for querying spatiotemporal correlation data performed by a terminal includes the following steps:
  • Step S202 Receive a query request, and perform a first index according to the query request to generate a query condition, where the query condition includes at least one of a time query condition, a space query condition, and an object attribute query condition.
  • the time window of an index is determined by the time query condition, for example, one hour, one day, one week, one month, one year, etc., but not limited to this.
  • the geographic area range is determined by spatial query conditions, for example, a building, a community, a block, a district, a city, a province, etc., but not limited to this.
  • the unique identifier of the individual object of the object is determined through the object attribute condition, such as fingerprint, voiceprint, name and ID, etc., but not limited to this.
  • Step S204 Determine a query range in a preset spatiotemporal attribute index model according to the query condition, and perform a second index according to the query range.
  • Step S206 performing deduplication processing on the result set of the second index to obtain a retrieval set after deduplication.
  • a query based on Geohash grids can be used to control the spatial range, and the query result is a long integer type.
  • Set when stored, since the query result is long integer data, it can be deduplicated by using a hash set.
  • Step S208 determining a spatial-temporal correlation query result according to the retrieval set and the query condition.
  • the query is performed through the deduplicated set to ensure that the query addresses do not overlap.
  • the final statistics are performed on the result set of the spatiotemporal cross model, which further simplifies the retrieval results and improves retrieval efficiency and reliability. .
  • receiving a query request, and performing the first indexing according to the query request includes:
  • Step S302 Receive a query request and determine the user's time data, spatial data and object attribute data contained in the query request.
  • Step S304 generating a filling curve of the spatial data, and deriving the time data and the object attribute data from the filling curve of the spatial data to obtain a spatiotemporal index code.
  • the inventor found that the traditional Geohash spatial index has no time dimension, so at the bottom layer of the database, the spatial index, time index, and attribute index are organized separately, that is, the data may be redundant. Moreover, in the process of data scanning, three scans are also performed respectively, which takes up a lot of storage space on the one hand, and increases the query time on the other hand.
  • the spatiotemporal attribute index model of the present disclosure has changed from the previous two-dimensional index (longitude, latitude) to a four-dimensional index (longitude, latitude, time, attribute).
  • the traditional geohash value itself is a string of 01 codes obtained by cross-encoding.
  • the time can be converted into a string of 01 codes by using unix time encoding, and the attribute value can also be converted into a string of 01 codes by using the hash algorithm.
  • the attribute of (geohash code contains information of two dimensions of longitude and latitude) can also be used to form a new code in the form of cross-coding.
  • the query of the spatiotemporal correlated data further includes:
  • Step S402 write the hash value of the user's identifier into the spatiotemporal index code.
  • the encoding of the attribute value is not added to the cross-encoding of the spatiotemporal dimension, but is placed before the spatiotemporal attribute index as a prefix, so that the user can perform the encoding according to the attribute.
  • Query filtering that is, a business scenario that is compatible with this attribute query.
  • the query of the spatiotemporal correlated data further includes:
  • Step S502 the spatiotemporal index code is written into the database to be indexed in the form of a key value.
  • the code can become the key value in the key-value database.
  • it can ensure sufficient hashing and fully distribute the data in the code Tables, on the other hand, can also improve query performance.
  • determining the query range in the preset spatiotemporal attribute index model according to the query conditions, and performing the second indexing according to the query range includes:
  • Step S6042 Determine the query range according to the value in the query condition.
  • Step S6044 query the spatial data through the geocoding algorithm and the query range, and store the query result as a result set of the hashset class.
  • the spatial data query is performed through the geocoding algorithm and the query range, that is, the query is performed using the deduplicated set, which can ensure that the query addresses are different in pairs, and the data will not be scanned repeatedly.
  • the query conditions generated by the grid can obtain direct query results and improve the indexing efficiency.
  • the result set of the second index is deduplicated to obtain a retrieval set after deduplication, including:
  • Step S7062 Perform index key value segmentation on the result set of the second index to obtain attribute codes of the result set.
  • index key value segmentation is performed on the result set of the second index, and it can be known from the foregoing query conditions that each key value is a binary code composed of attributes and space-time cross-indexing, and the latter's The length is determined, and the previous attribute code can be obtained by using the bit operation of the computer, so the grouping can be performed according to the previous attribute code, that is, the user ID.
  • Step S7064 Perform data grouping according to the attribute code to obtain grouped data.
  • Step S7066 perform deduplication statistics on the spatiotemporal index codes of the grouped data through the HyperLogLog algorithm, so as to obtain the cardinality of the retrieval set.
  • the traditional spatial indexing method is often organized by using geohash.
  • This is a typical static index, and the number of grids is determined. Therefore, it is more necessary to filter the spatial range. Therefore, the HyperLogLog algorithm It is rarely used in spatiotemporal data management scenarios.
  • the technical solution of the present disclosure needs to use the above-mentioned spatiotemporal attribute index model for the exploration of related groups in the epidemic. This model is not a static index, and the number of elements is variable. Therefore, the HyperLogLog algorithm needs to be used for a Quickly deduplicate statistics, obtain the cardinality of the simplified data set, and improve the efficiency of secondary indexing.
  • Step S7068 Determine association relationship data according to the retrieval set, and generate a corresponding retrieval set.
  • some retrieval sets (that is, data sets) will be obtained, and the number of these data sets may be large, and there will also be a lot of data in the data set.
  • the data in each data set is loaded into the memory for counting, and then the correlation coefficient is calculated, and the calculation of the correlation coefficient requires the cardinality of each data set to be counted, which will cause a lot of computational overhead and performance delay.
  • the use of the HyperLogLog algorithm can quickly complete the cardinality statistical process.
  • performing deduplication statistics on the spatiotemporal index coding of the grouped data by using the HyperLogLog algorithm includes:
  • Step S80662 Determine the bit value encoded by the spatiotemporal index, and perform a bucket averaging process on the bit value to determine the harmonic mean.
  • Step S80664 performing deviation correction on the grouped data according to the harmonic mean.
  • Step S80666 performing deduplication processing on the result of the deviation correction.
  • the HyperLogLog algorithm is used to deduplicate the binary encoding of the space-time cross-index in each group.
  • the space-time encoding needs to be averaged in buckets, and the stored index values are divided into m buckets. Among them, the bucket is divided according to the value of the first few bits of the hash value to determine which bucket, and the harmonic average of the m buckets is calculated separately.
  • the user-defined parameters are used to correct the deviation, and finally the deduplication statistical value of the associated information of each associated user and patient can be obtained.
  • the formula of the HyperLogLog algorithm is as follows:
  • the constant constant is a parameter to correct the result
  • R j represents the maximum number of leading zeros of the data in the jth bucket + 1
  • m is a positive integer greater than 1.
  • the correlation coefficient between the patient and each of the calculated associated users is calculated based on the statistical values of the correlation information, where the correlation coefficient is mainly a calculated value determined according to the time interval and spatial distance of the found results.
  • the query apparatus 900 for spatiotemporal correlated data will be described below.
  • the apparatus 900 for querying spatiotemporal correlated data shown in FIG. 9 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
  • the query device 900 for spatiotemporal correlation data is represented in the form of a hardware module.
  • the components of the apparatus 900 for querying spatiotemporally correlated data may include, but are not limited to: a first indexing module 902 , a second indexing module 904 , a deduplication module 906 and a determination module 908 .
  • the first indexing module 902 is configured to receive a query request, and perform a first index according to the query request to generate query conditions, where the query conditions include at least one of time query conditions, spatial query conditions, and object attribute query conditions kind.
  • the second indexing module 904 is configured to determine the query range in the preset spatiotemporal attribute index model according to the query condition, and perform a second index according to the query range.
  • the deduplication module 906 is configured to perform deduplication processing on the result set of the second index to obtain a retrieval set after deduplication.
  • a determination module 908 configured to determine a spatial-temporal correlation query result according to the retrieval set and the query condition.
  • the electronic device 1000 according to this embodiment of the present disclosure is described below with reference to FIG. 10 .
  • the electronic device 1000 shown in FIG. 10 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.
  • electronic device 1000 takes the form of a general-purpose computing device.
  • Components of the electronic device 1000 may include, but are not limited to, the above-mentioned at least one processing unit 1010 , the above-mentioned at least one storage unit 1020 , and a bus 1030 connecting different system components (including the storage unit 1020 and the processing unit 1010 ).
  • the storage unit stores program codes, which can be executed by the processing unit 1010, so that the processing unit 1010 performs the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "Exemplary Methods" section of this specification.
  • the processing unit 1010 may perform the steps shown in FIG. 2 to FIG. 8 , and other steps defined in the query method for spatiotemporal correlated data of the present disclosure.
  • the storage unit 1020 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 10201 and/or a cache storage unit 10202 , and may further include a read only storage unit (ROM) 10203 .
  • RAM random access storage unit
  • ROM read only storage unit
  • the storage unit 1020 may also include a program/utility 10204 having a set (at least one) of program modules 10205 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, An implementation of a network environment may be included in each or some combination of these examples.
  • the bus 1030 may be representative of one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures bus.
  • the electronic device 1000 may also communicate with one or more external devices 1040 (eg, keyboards, pointing devices, Bluetooth devices, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with The electronic device 1000 can communicate with any device (eg, router, modem, etc.) that communicates with one or more other computing devices. Such communication may occur through input/output (I/O) interface 1050 . Also, the electronic device 1000 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through the network adapter 1060 . As shown in FIG. 10 , the network adapter 1060 communicates with other modules of the electronic device 1000 through the bus 1030 .
  • I/O input/output
  • the electronic device 1000 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through the network adapter 1060 . As shown
  • the exemplary embodiments described herein may be implemented by software, or may be implemented by software combined with necessary hardware. Therefore, the technical solutions according to the embodiments of the present disclosure may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to an embodiment of the present disclosure.
  • a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored.
  • various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code, when the program product runs on a terminal device, the program code is used to cause the terminal device to execute the above-mentioned procedures in this specification. Steps according to various exemplary embodiments of the present disclosure are described in the "Example Methods" section.
  • a program product for implementing the above method according to an embodiment of the present disclosure may adopt a portable compact disc read only memory (CD-ROM) and include program codes, and may run on a terminal device, such as a personal computer.
  • CD-ROM compact disc read only memory
  • the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a propagated data signal in baseband or as part of a carrier wave with readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable signal medium can also be any readable medium, other than a readable storage medium, that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Program code embodied on a readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural Programming Language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
  • LAN local area network
  • WAN wide area network
  • modules or units of the apparatus for action performance are mentioned in the above detailed description, this division is not mandatory. Indeed, according to embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into multiple modules or units to be embodied.
  • the exemplary embodiments described herein may be implemented by software, or may be implemented by software combined with necessary hardware. Therefore, the technical solutions according to the embodiments of the present disclosure may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to an embodiment of the present disclosure.
  • a computing device which may be a personal computer, a server, a mobile terminal, or a network device, etc.
  • the geohash code contains information of two dimensions of longitude and latitude
  • a new code can also be formed in the form of cross-coding, and this code can become a key-value database
  • the key value in it on the one hand, can ensure sufficient hashing and fully distribute the data in the encoding table, and on the other hand can improve the performance of the query.
  • the retrieval set after deduplication is obtained, and quickly deduplication statistics are performed to obtain the cardinality of the retrieval set, which reduces the redundancy in the query record. The redundant data, thereby reducing the data interaction pressure and computational pressure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Remote Sensing (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开提供了一种时空关联数据的查询方法、装置、电子设备和存储介质,涉及计算机技术领域。其中,方法包括:接收查询请求,并根据查询请求进行第一次索引,以生成查询条件,查询条件包括时间查询条件、空间查询条件和对象属性查询条件中的至少一种;根据查询条件确定预设的时空属性索引模型中的查询范围,并根据查询范围进行第二次索引;对第二次索引的结果集进行去重处理,以得到去重后的检索集合;根据检索集合和查询条件确定时空关联查询结果。通过本公开的技术方案,提高了时空关联数据的索引效率,减少了对重复数据的查询,进而降低了数据传输压力和数据交互压力。

Description

时空关联数据的查询方法、装置、电子设备和存储介质
本公开要求于2020年11月17日提交的申请号为202011285394.1、名称为“时空关联数据的查询方法、装置、电子设备和存储介质”的中国专利申请的优先权,该中国专利申请的全部内容通过引用全部并入本文。
技术领域
本公开涉及计算机技术领域,尤其涉及一种时空关联数据的查询方法、装置、电子设备和存储介质。
背景技术
现有的关联查询方案是针对数据总量的二次时空查询,通常分为三步,三个步骤的执行过程和缺陷具体如下:
(1)首先,针对第一阶段的时间空间范围以及属性的查询,在数据库底层,空间索引、时间索引和属性索引是分开执行的,存在大量冗余的查询数据。
(2)在第二步中,针对第一步操作中产生的大量查询数据,根据这些查询数据中的空间位置信息和时间信息,但是由于人的活动具有不规律性,时空查询框可能会有重叠的部分。也就是说第二步对数据集进行二次查询时会出现大量的重复性操作,这样会增加不必要的数据扫描和网络传输过程,最终导致关联查询的性能低下。
(3)在第三步中,会将数据集进行全量计算,占用大量的内存资源,同样在分布式环境中,也会存在大量数据进行网络传输而导致性能下降的问题。
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
本公开的目的在于提供一种时空关联数据的查询方法、装置、电子设备和存储介质,至少在一定程度上克服相关技术中查询效率低的问题。
本公开的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本公开的实践而习得。
根据本公开的一个方面,提供一种时空关联数据的查询方法,包括:接收查询请求,并根据所述查询请求进行第一次索引,以生成查询条件,所述查询条件包括时间查询条件、空间查询条件和对象属性查询条件中的至少一种;根据所述查询条件确定预设的时空属性索引模型中的查询范围,并根据所述查询范围进行第二次索引;对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合;根据所述检索集合和所述查询条件确定时空关联查询结果。
在本公开的一个实施例中,接收查询请求,并根据所述查询请求进行第一次索引包括:接收查询请求并确定所述查询请求中包含的用户的时间数据、空间数据和对象属性数据;生成所述空间数据的填充曲线,并对于所述空间数据的填充曲线衍生所述时间数据和所述对象属性数据,以得到时空索引编码。
在本公开的一个实施例中,时空关联数据的查询还包括:对所述时空索引编码写入所述用户的标识的哈希值。
在本公开的一个实施例中,时空关联数据的查询还包括:将所述时空索引编码以键值形式写入待索引的数据库。
在本公开的一个实施例中,根据所述查询条件确定预设的时空属性索引模型中的查询范围,并根据所述查询范围进行第二次索引包括:根据所述查询条件中的数值确定所述查询范围;通过地理编码算法和所述查询范围进行空间数据的查询,并将查询结果存储为hashset类的结果集。
在本公开的一个实施例中,对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合包括:对所述第二次索引的结果集进行索引键值切分,以得到所述结果集的属性编码;根据所述属性编码进行数据分组,以得到分组数据;通过HyperLogLog算法对所述分组数据的时空索引编码进行去重统计,以得到所述检索集合的基数;根据所述检索集合确定关联关系数据,并生成相应的检索集合。
在本公开的一个实施例中,通过HyperLogLog算法对所述分组数据的时空索引编码进行去重统计包括:确定所述时空索引编码的位值,并对所述位值进行分桶平均处理,以确定调和平均数;根据所述调和平均数对所述分组数据进行偏差修正;对所述偏差修正的结果进行去重处理。
根据本公开的另一个方面,提供一种时空关联数据的查询装置,包括:第一索引模块,用于接收查询请求,并根据所述查询请求进行第一次索引,以生成查询条件,所述查询条件包括时间查询条件、空间查询条件和对象属性查询条件中的至少一种;第二索引模块,用于根据所述查询条件确定预设的时空属性索引模型中的查询范围,并根据所述查询范围进行第二次索引;去重模块,用于对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合;确定模块,用于根据所述检索集合和所述查询条件确定时空关联查询结果。
根据本公开的再一个方面,提供一种电子设备,包括:处理器;以及存储器,用于存储处理器的可执行指令;其中,处理器配置为经由执行可执行指令来执行上述任意一项的时空关联数据的查询方法。
根据本公开的又一个方面,提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述任意一项的时空关联数据的查询方法。
本公开的实施例所提供的时空关联数据的查询方案,通过这三个不同维度的属性(geohash编码内包含了经度和纬度两个维度的信息)同样可以利用交叉编码的形式来形成新的编码,而这个编码可以成为键值数据库当中的key值,一方面能保证足够散列,将 数据充分分布在编码表当中,另一方面也可以提高查询的性能。
进一步地,通过对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合,并进行快速地去重统计,获取到检索集合的基数,减少了查询记录中的冗余数据,进而降低了数据的交互压力和运算压力。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A示出本公开实施例中一种时空关联数据的查询方案的索引模型的示意图;
图1B示出本公开实施例中一种时空关联数据的查询方案的索引范围的示意图;
图1C示出本公开实施例中一种时空关联数据的查询方案的索引过程的示意图;
图1D示出本公开实施例中一种时空关联数据的查询方案的时空交叉索引模型的示意图;
图1E示出本公开实施例中另一种时空关联数据的查询方案的时空交叉索引模型;
图1F示出本公开实施例中一种时空关联数据的查询方案的存储结构的示意图;
图2示出本公开实施例中一种时空关联数据的查询方法的流程图;
图3示出本公开实施例中另一种时空关联数据的查询方法的流程图;
图4示出本公开实施例中另一种时空关联数据的查询方法的流程图;
图5示出本公开实施例中另一种时空关联数据的查询方法的流程图;
图6示出本公开实施例中另一种时空关联数据的查询方法的流程图;
图7示出本公开实施例中另一种时空关联数据的查询方法的流程图;
图8示出本公开实施例中另一种时空关联数据的查询方法的流程图;
图9示出本公开实施例中一种时空关联数据的查询装置的示意图;
图10示出本公开实施例中一种电子设备的示意图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记 表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
本公开提供的方案,通过这三个不同维度的属性(geohash编码内包含了经度和纬度两个维度的信息)同样可以利用交叉编码的形式来形成新的编码,而这个编码可以成为键值数据库当中的key值,一方面能保证足够散列,将数据充分分布在编码表当中,另一方面也可以提高查询的性能。
进一步地,通过对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合,并进行快速地去重统计,获取到检索集合的基数,减少了查询记录中的冗余数据,进而降低了数据的交互压力和运算压力。
本公开实施例提供的方案涉及时空交叉索引和大数据去重等技术,具体通过如下实施例进行说明。
图1A、图1B、图1C、图1D、图1E和图1F示出本公开实施例中一种时空关联数据的查询架构的示意图。
如图1A所示,本公开实施例中一种时空关联数据的查询架构和交互的四叉树102索引机制的原理如下:
使用四叉树102索引机制来对组织空间数据,也需要在查询时将基于四叉树102索引、时间索引、前缀树索引的查询结果传输到服务器进行统一的数据处理。
四叉树102索引的基本思想是将地理空间递归划分为不同层次的树结构。它将已知范围的空间等分成四个相等的子空间,如此递归下去,直至树的层次达到一定深度或者满足某种要求后停止分割。四叉树102的结构比较简单,并且当空间数据对象分布比较均匀时,具有比较高的空间数据插入和查询效率,因此四叉树102是GIS(Geographic Information System,地理信息系统)中常用的空间索引之一。常规四叉树102的结构中,地理空间对象都存储在叶子节点上,中间节点以及根节点不存储地理空间对象。
四叉树102对于区域查询,效率比较高。但如果空间对象分布不均匀,随着地理空间对象的不断插入,四叉树102的层次会不断地加深,将形成一棵严重不平衡的四叉树102,那么每次查询的深度将大大的增多,从而导致查询效率的急剧下降。
如图1A所示,时间索引104即根据查询记录中的空间位置信息和时间信息,对应地产生一个空间查询范围和时间查询范围。
其中,时间索引数据采用统一命名方式进行存储,可例如采用“日期-时-分-秒”,譬如,“2020-01-01 00:00:00”、“2020-01-01 00:00:01”、“2020-01-01 00:00:02”、“2020-01-01 00:00:03”、“2020-01-01 00:00:04”、“2020-01-01 00:00:05”、……、“2020-01-04 00:00:00”等。
如图1A所示,前缀树106是N叉树的一种特殊形式。通常来说,一个前缀树106 是用来存储字符串的。前缀树106的每一个节点代表一个字符串(前缀)。每一个节点会有多个子节点,通往不同子节点的路径上有着不同的字符。子节点代表的字符串是由节点本身的原始字符串,以及通往该子节点路径上所有的字符组成的。
可例如,前缀树106的一级节点为“B”,二级节点为“u”、“i”和“o”,三级节点为“k”、“i”、“h”、“g”、“e”、“f”、“b”、“c”“a”,可得到一个用户ID为“Bob”。
如图1B所示,但是由于人的活动具有不规律性,例如,病人产生的报点数据很近,时空查询框可能会有重叠的部分,也就是说第二步对数据集进行二次查询时会出现大量的重复查询范围,可例如第一查询范围108A、第二查询范围108B、第三查询范围108C和第四查询范围108D。
如图1C所示,根据客户端的索引信息生成一次查询条件,根据一次查询条件生成一次查询结果,客户端根据一次查询结果生成多个二次查询条件并发送至网络端,可例如,二次查询条件2_1、二次查询条件2_2、二次查询条件2_3、……、二次查询条件2_n,相应的,索引确定数据库中的二次查询结果2_1、二次查询结果2_2、二次查询结果2_3、……、二次查询结果2_n,并将汇总查询结果发送至网络端。
发明人基于图1A、图1B和图1C所示的索引机制,提出了一种改进的数据的存储和组织:
如图1D所示,对于索引的构建,由于时空关联查询需要更侧重空间和时间的推断,基于传统的Z空间填充曲线,衍生出来结合时间属性的Z3填充曲线,Z3索引110是类似geohash的二进制编码,将经度、纬度、时间进行交叉编码的一种索引机制。在这个编码的基础之上,在末尾添加人的ID(Identity document,身份证标识号)的hash(哈希)值,可例如,如图1E所示的Z3索引模型1101、1102、1103、1104、……、1110n等。
可例如,索引条件为“用户Bob坐标(119.25 35.45)2020-06-16 00:23:45”,划分为用户ID“用户Bob”,地理坐标(119.25 35.45),时间为“2020年06月16日00:23:45”,如图1E所示,指向Z3索引模型1103。
如图1E和图1F所示,由于关联查询是针对时空大数据场景,需要对大数据量进行支持,而且由于其往往是结构化或者半结构化的数据,因此需要更加灵活的数据模型设计。而这些问题是NoSQL(非关系型)数据库更加擅长的领域,传统数据库无法很好地支持,因此,本发明中对于数据的存储和组织是基于常用的NoSQL数据库112来进行的,其基本的数据结构是以Key-Value形式存储在磁盘当中的。
如图1F所示,Key(键)是磁头搜索数据的地址信息,可例如“Key01”,Value(值)是二进制化存储的数据本体,可例如“value01”,这样能够保证只要知道查询地址信息,最终可以快速获取到具体的数据,而这个过程与数据量的大小无关,只与磁盘I/O(Input/Output,输入/输出)性能有关。
根据本公开的实施例的二次查询条件的构建:
(1)获取到针对一个人的时空记录后,在构建第二次查询时,为了避免重合查询的 情况,需要将第一次的数据集转换的时空查询条件转为时空索引范围。
(2)此时通过基于Geohash(空间索引地址编码)格子来控制空间范围的查询。结果会是一个Long(长整型)类型集合,在存储时,由于它是长整型,因此可以利用HashSet(哈希集)对其进行去重。
(3)利用去重后的集合来进行查询,就能够保证查询地址两两不同,数据不会重复扫描,同时基于geohash格子生成的查询条件,查询结果就是需要的查询结果,不需要再将结果再次的确认。
根据本公开的实施例的最终关联性的统计:
传统方案在做统计时,由于缺乏针对大数据量的去重统计算法,依然需要将数据全量采集,然后对不同查询条件的结果根据不同人的ID进行聚合,并计算每个情况中的时空关联性,其中可能会有大量的数据,这样大规模的聚合操作可能会给系统带来很大的压力。为了解决这个问题,这里引入了HyperLogLog(基于对数函数的去重计数的优化算法)算法,能够对这样大规模的数据集进行聚合统计的操作。
HyperLogLog算法是一种大数据场景中常用的数据统计算法,它会在较少资源情况下计算出大量数据集的基数。它有非常优异的空间复杂度,数据集的基数增大,其占用的存储空间不会随之增大,虽然会有一些误差,但是对于大量的数据集,配置精度更高的纠偏参数,误差能够保证在1%以下,而其对于内存空间的压力也是微乎其微的。例如,如果统计一亿用户的数据,需要的内存仅仅只有12kb左右。
例如,在二次时空查询结束以后,获取到了100亿条数据,但是这些数据可能来自不同的用户,需要对每个独立用户进行统计,而由于用户产生报点的时间和空间可能会非常近,因此同样会出现大量的重复数据。对这些重复数据需要进行去重统计。
首先,需要将这些数据当中的索引key值进行切分。通过前述的查询条件可以知道每一个key值都是属性以及时空交叉索引之后组成的二进制编码,后者的长度是确定的,可以利用计算机的位运算来获取到前面的属性编码,因此就可以依据前面的属性编码,即用户ID来进行分组。
其次,就是对每个分组内部的时空交叉索引的二进制编码利用HyperLogLog算法来进行去重统计,在这一步,需要将时空编码进行分桶平均,将存储的索引值分到m个桶中,分桶按照哈希值的前几位bit的值来决定哪一个桶,分别统计并求得m个桶的调和平均数。最后利用用户的自定义参数来对偏差进行修正,最后能够得出每个相关联用户与病人的关联信息去重统计值,HyperLogLog算法的公式如下:
Figure PCTCN2021116775-appb-000001
其中,constant常数是一个修正结果的参数,R j代表第j个桶中的数据的最大前导零数目+1,m为大于1的正整数。
上述HyperLogLog算法是用一个哈希函数,在疫情关联分析的场景中即为时空属性 索引的hash值,对数据流中的每一个元素求出一个哈希值,然后对于每个哈希值,取最后P位来决定桶序号,而在疫情关联分析中,可以按照属性值的hash值来进行分桶。所有元素处理完毕后,求所有桶中的值的调和平均数,调和平均数是将所有数值取倒数而得,最后乘以m得到最后结果E。
最后,基于这些关联信息的统计值计算出病人与算出的每个关联用户的关联系数,此处的关联系数主要是根据所查出来结果的时间间隔和空间距离来做出的计算值。
下面,将结合附图及实施例对本示例实施方式中的时空关联数据的查询方法的各个步骤进行更详细的说明。
图2示出本公开实施例中一种时空关联数据的查询方法流程图。本公开实施例提供的方法可以由任意具备计算处理能力的电子设备执行,譬如,服务器或终端,但不限于此。在下面的举例说明中,以终端为执行主体进行示例说明。
如图2所示,终端执行时空关联数据的查询方法,包括以下步骤:
步骤S202,接收查询请求,并根据所述查询请求进行第一次索引,以生成查询条件,所述查询条件包括时间查询条件、空间查询条件和对象属性查询条件中的至少一种。
在上述实施例中,通过时间查询条件确定一个索引的时间窗,譬如,一个小时、一天、一个星期、一个月、一年等,但不限于此。
另外,通过空间查询条件确定地理区域范围,譬如,一个楼座、一个小区、一个街区、区、市、省等,但不限于此。
最后,通过对象属性条件确定对象个体对象的唯一标识,譬如、指纹、声纹、姓名和ID等,但不限于此。
步骤S204,根据所述查询条件确定预设的时空属性索引模型中的查询范围,并根据所述查询范围进行第二次索引。
步骤S206,对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合。
在上述实施例中,通过将查询条件转换为查询范围,减少了重合查询,精简了查询逻辑和查询条件,可例如,通过基于Geohash格子来控制空间范围的查询,查询结果是一个长整型类型集合,在存储时,由于查询结果是长整型数据,因此可以利用哈希集对其进行去重。
步骤S208,根据所述检索集合和所述查询条件确定时空关联查询结果。
在上述实施例中,通过去重后的集合来进行查询,以保证查询地址不重合,另外,通过对时空交叉模型的结果集进行最终统计,进一步精简了检索结果,提升了检索效率和可靠性。
在图2所示的方法步骤的基础上,如图3所示,接收查询请求,并根据所述查询请求进行第一次索引包括:
步骤S302,接收查询请求并确定所述查询请求中包含的用户的时间数据、空间数据和对象属性数据。
步骤S304,生成所述空间数据的填充曲线,并对于所述空间数据的填充曲线衍生所述时间数据和所述对象属性数据,以得到时空索引编码。
在上述实施例中,发明人发现传统的Geohash空间索引是没有时间维度的,因此在数据库底层,空间索引、时间索引、属性索引是分开组织的,也就是说,数据可能会冗余多份,而且在进行数据扫描的过程当中,也会分别进行三次扫描,一方面占用了很多的存储空间,另一方面也增大了查询的时间。
而本公开的时空属性索引模型从已往的二维索引(经度、纬度)变成了四维索引(经度、维度、时间、属性)。传统的geohash值本身就是经过交叉编码所得的一串01代码,时间可以利用unix时间编码转换成为一串01代码,属性值同样也可以利用哈希算法转换为一串01代码,这三个不同维度的属性(geohash编码内包含了经度和纬度两个维度的信息)同样可以利用交叉编码的形式来形成新的编码。
在图2所示的方法步骤的基础上,如图4所示,时空关联数据的查询还包括:
步骤S402,对所述时空索引编码写入所述用户的标识的哈希值。
在上述实施例中,在本公开的时空交叉索引模型中,属性值的编码并没有加入到时空维度的交叉编码中,而是作为一个前缀放在时空属性索引前,以供用户根据属性来进行查询过滤,也即兼容了这种属性查询的业务场景。
在图2所示的方法步骤的基础上,如图5所示,时空关联数据的查询还包括:
步骤S502,将所述时空索引编码以键值形式写入待索引的数据库。
在上述实施例中,通过将所述时空索引编码以键值形式写入待索引的数据库,这个编码可以成为键值数据库当中的key值,一方面能保证足够散列,将数据充分分布在编码表当中,另一方面也可以提高查询的性能。
在图2所示的方法步骤的基础上,如图6所示,根据所述查询条件确定预设的时空属性索引模型中的查询范围,并根据所述查询范围进行第二次索引包括:
步骤S6042,根据所述查询条件中的数值确定所述查询范围。
步骤S6044,通过地理编码算法和所述查询范围进行空间数据的查询,并将查询结果存储为hashset类的结果集。
在上述实施例中,通过地理编码算法和所述查询范围进行空间数据的查询,即利用去重后的集合来进行查询,就能够保证查询地址两两不同,数据不会重复扫描,同时基于geohash格子生成的查询条件,得到直接的查询结果,提高了索引效率。
在图2所示的方法步骤的基础上,如图7所示,对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合包括:
步骤S7062,对所述第二次索引的结果集进行索引键值切分,以得到所述结果集的属性编码。
在上述实施例中,对所述第二次索引的结果集进行索引键值切分,通过前述的查询条件可知,每一个key值都是属性以及时空交叉索引之后组成的二进制编码,后者的长度是 确定的,可以利用计算机的位运算来获取到前面的属性编码,因此就可以依据前面的属性编码,即用户ID来进行分组。
步骤S7064,根据所述属性编码进行数据分组,以得到分组数据。
步骤S7066,通过HyperLogLog算法对所述分组数据的时空索引编码进行去重统计,以得到所述检索集合的基数。
在上述实施例中,传统的空间索引方式往往是利用geohash来进行组织的,这个是一个典型的静态索引,格子数目是确定的,因此需要的更多的是空间范围的过滤,因此,HyperLogLog算法在时空数据管理场景里面使用的情况非常少。而本公开的技术方案,针对疫情当中关联人群的探查,需要利用到上述时空属性索引模型,而这个模型并不是一个静态索引,要素个数是可变的,因此需要用到HyperLogLog算法来进行一个快速的去重统计,获取到精简的数据集的基数,提高了二次索引的效率。
步骤S7068,根据所述检索集合确定关联关系数据,并生成相应的检索集合。
在上述实施例中,在二次索引结束以后,会获取到一些检索集合(即数据集),而这些数据集的个数可能会有很多,而且数据集内部的数据也会有很多,如果对每个数据集中的数据都加载进内存进行计数,进而做关联系数的计算,而这个关联系数的计算是需要对每个数据集的基数进行统计的,那会造成很大的计算开销和性能延迟,利用HyperLogLog算法可以快速得完成这个基数统计过程的。
在图2和图7所示的方法步骤的基础上,如图8所示,通过HyperLogLog算法对所述分组数据的时空索引编码进行去重统计包括:
步骤S80662,确定所述时空索引编码的位值,并对所述位值进行分桶平均处理,以确定调和平均数。
步骤S80664,根据所述调和平均数对所述分组数据进行偏差修正。
步骤S80666,对所述偏差修正的结果进行去重处理。
在上述实施例中,对每个分组内部的时空交叉索引的二进制编码利用HyperLogLog算法来进行去重统计,在这一步,需要将时空编码进行分桶平均,将存储的索引值分到m个桶中,分桶按照哈希值的前几位bit的值来决定哪一个桶,分别统计并求得m个桶的调和平均数。最后利用用户的自定义参数来对偏差进行修正,最后能够得出每个相关联用户与病人的关联信息去重统计值,HyperLogLog算法的公式如下:
Figure PCTCN2021116775-appb-000002
其中,constant常数是一个修正结果的参数,R j代表第j个桶中的数据的最大前导零数目+1,m为大于1的正整数。
最终,基于这些关联信息的统计值计算出病人与算出的每个关联用户的关联系数,此处的关联系数主要是根据所查出来结果的时间间隔和空间距离确定的计算值。
下面参照图9来描述根据本公开的这种实施方式的时空关联数据的查询装置900。图9所示的时空关联数据的查询装置900仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
时空关联数据的查询装置900以硬件模块的形式表现。时空关联数据的查询装置900的组件可以包括但不限于:第一索引模块902、第二索引模块904、去重模块906和确定模块908。
第一索引模块902,用于接收查询请求,并根据所述查询请求进行第一次索引,以生成查询条件,所述查询条件包括时间查询条件、空间查询条件和对象属性查询条件中的至少一种。
第二索引模块904,用于根据所述查询条件确定预设的时空属性索引模型中的查询范围,并根据所述查询范围进行第二次索引。
去重模块906,用于对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合。
确定模块908,用于根据所述检索集合和所述查询条件确定时空关联查询结果。
下面参照图10来描述根据本公开的这种实施方式的电子设备1000。图10显示的电子设备1000仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图10所示,电子设备1000以通用计算设备的形式表现。电子设备1000的组件可以包括但不限于:上述至少一个处理单元1010、上述至少一个存储单元1020、连接不同系统组件(包括存储单元1020和处理单元1010)的总线1030。
其中,存储单元存储有程序代码,程序代码可以被处理单元1010执行,使得处理单元1010执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。例如,处理单元1010可以执行如图2至图8中所示的步骤,以及本公开的时空关联数据的查询方法中限定的其他步骤。
存储单元1020可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)10201和/或高速缓存存储单元10202,还可以进一步包括只读存储单元(ROM)10203。
存储单元1020还可以包括具有一组(至少一个)程序模块10205的程序/实用工具10204,这样的程序模块10205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线1030可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备1000也可以与一个或多个外部设备1040(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备交互的设备通信,和/或与使得该电子设备1000能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制 解调器等等)通信。这种通信可以通过输入/输出(I/O)接口1050进行。并且,电子设备1000还可以通过网络适配器1060与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图10所示,网络适配器1060通过总线1030与电子设备1000的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。
在本公开的示例性实施例中,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在终端设备上运行时,程序代码用于使终端设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。
根据本公开的实施方式的用于实现上述方法的程序产品,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网 服务提供商来通过因特网连接)。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
此外,尽管在附图中以特定顺序描述了本公开中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本公开实施方式的方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由所附的权利要求指出。
工业实用性
本公开提供的方案,通过这三个不同维度的属性(geohash编码内包含了经度和纬度两个维度的信息)同样可以利用交叉编码的形式来形成新的编码,而这个编码可以成为键值数据库当中的key值,一方面能保证足够散列,将数据充分分布在编码表当中,另一方面也可以提高查询的性能。进一步地,通过对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合,并进行快速地去重统计,获取到检索集合的基数,减少了查询记录中的冗余数据,进而降低了数据的交互压力和运算压力。

Claims (10)

  1. 一种时空关联数据的查询方法,其特征在于,包括:
    接收查询请求,并根据所述查询请求进行第一次索引,以生成查询条件,所述查询条件包括时间查询条件、空间查询条件和对象属性查询条件中的至少一种;
    根据所述查询条件确定预设的时空属性索引模型中的查询范围,并根据所述查询范围进行第二次索引;
    对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合;
    根据所述检索集合和所述查询条件确定时空关联查询结果。
  2. 根据权利要求1所述的时空关联数据的查询方法,其特征在于,在接收查询请求前包括:
    接收查询请求并确定所述查询请求中包含的用户的时间数据、空间数据和对象属性数据;
    生成所述空间数据的填充曲线,并对于所述空间数据的填充曲线衍生所述时间数据和所述对象属性数据,以得到时空索引编码。
  3. 根据权利要求2所述的时空关联数据的查询方法,其特征在于,还包括:
    对所述时空索引编码写入所述用户的标识的哈希值。
  4. 根据权利要求2所述的时空关联数据的查询方法,其特征在于,还包括:
    将所述时空索引编码以键值形式写入待索引的数据库。
  5. 根据权利要求1-4中任一项所述的时空关联数据的查询方法,其特征在于,根据所述查询条件确定预设的时空属性索引模型中的查询范围,并根据所述查询范围进行第二次索引包括:
    根据所述查询条件中的数值确定所述查询范围;
    通过地理编码算法和所述查询范围进行空间数据的查询,并将查询结果存储为hashset类的结果集。
  6. 根据权利要求1-4中任一项所述的时空关联数据的查询方法,其特征在于,对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合包括:
    对所述第二次索引的结果集进行索引键值切分,以得到所述结果集的属性编码;
    根据所述属性编码进行数据分组,以得到分组数据;
    通过HyperLogLog算法对所述分组数据的时空索引编码进行去重统计,以得到所述检索集合的基数;
    根据所述检索集合确定关联关系数据,并生成相应的检索集合。
  7. 根据权利要求1-4中任一项所述的时空关联数据的查询方法,其特征在于,通过HyperLogLog算法对所述分组数据的时空索引编码进行去重统计包括:
    确定所述时空索引编码的位值,并对所述位值进行分桶平均处理,以确定调和平均数;
    根据所述调和平均数对所述分组数据进行偏差修正;
    对所述偏差修正的结果进行去重处理。
  8. 一种时空关联数据的查询装置,其特征在于,
    第一索引模块,用于接收查询请求,并根据所述查询请求进行第一次索引,以生成查询条件,所述查询条件包括时间查询条件、空间查询条件和对象属性查询条件中的至少一种;
    第二索引模块,用于根据所述查询条件确定预设的时空属性索引模型中的查询范围,并根据所述查询范围进行第二次索引;
    去重模块,用于对所述第二次索引的结果集进行去重处理,以得到去重后的检索集合;
    确定模块,用于根据所述检索集合和所述查询条件确定时空关联查询结果。
  9. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1-7中任一项所述的时空关联数据的查询方法。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,
    所述计算机程序被处理器执行时实现权利要求1-7中任一项所述的时空关联数据的查询方法。
PCT/CN2021/116775 2020-11-17 2021-09-06 时空关联数据的查询方法、装置、电子设备和存储介质 WO2022105372A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011285394.1A CN113806458A (zh) 2020-11-17 2020-11-17 时空关联数据的查询方法、装置、电子设备和存储介质
CN202011285394.1 2020-11-17

Publications (1)

Publication Number Publication Date
WO2022105372A1 true WO2022105372A1 (zh) 2022-05-27

Family

ID=78943489

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/116775 WO2022105372A1 (zh) 2020-11-17 2021-09-06 时空关联数据的查询方法、装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN113806458A (zh)
WO (1) WO2022105372A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722427A (zh) * 2022-06-07 2022-07-08 腾讯科技(深圳)有限公司 联邦学习中的隐私去重方法、装置、设备及存储介质
CN116188232A (zh) * 2023-04-19 2023-05-30 北京数牍科技有限公司 一种名单查询方法、装置、设备、介质及产品
CN117909301A (zh) * 2024-03-19 2024-04-19 上海合见工业软件集团有限公司 基于索引的对象查询方法、装置、设备及介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756139B (zh) * 2023-05-12 2024-04-23 中国自然资源航空物探遥感中心 一种数据索引方法、系统、存储介质和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426491A (zh) * 2015-11-23 2016-03-23 武汉大学 一种时空地理大数据的检索方法及系统
CN109165215A (zh) * 2018-07-27 2019-01-08 苏州视锐信息科技有限公司 一种云环境下时空索引的构建方法、装置及电子设备
CN110347680A (zh) * 2019-06-21 2019-10-18 北京航空航天大学 一种面向云际环境的时空数据索引方法
CN111782742A (zh) * 2020-06-06 2020-10-16 中国科学院电子学研究所苏州研究院 一种面向大规模地理空间数据的存储和检索方法及其系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426491A (zh) * 2015-11-23 2016-03-23 武汉大学 一种时空地理大数据的检索方法及系统
CN109165215A (zh) * 2018-07-27 2019-01-08 苏州视锐信息科技有限公司 一种云环境下时空索引的构建方法、装置及电子设备
CN110347680A (zh) * 2019-06-21 2019-10-18 北京航空航天大学 一种面向云际环境的时空数据索引方法
CN111782742A (zh) * 2020-06-06 2020-10-16 中国科学院电子学研究所苏州研究院 一种面向大规模地理空间数据的存储和检索方法及其系统

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722427A (zh) * 2022-06-07 2022-07-08 腾讯科技(深圳)有限公司 联邦学习中的隐私去重方法、装置、设备及存储介质
CN116188232A (zh) * 2023-04-19 2023-05-30 北京数牍科技有限公司 一种名单查询方法、装置、设备、介质及产品
CN117909301A (zh) * 2024-03-19 2024-04-19 上海合见工业软件集团有限公司 基于索引的对象查询方法、装置、设备及介质
CN117909301B (zh) * 2024-03-19 2024-06-07 上海合见工业软件集团有限公司 基于索引的对象查询方法、装置、设备及介质

Also Published As

Publication number Publication date
CN113806458A (zh) 2021-12-17

Similar Documents

Publication Publication Date Title
WO2022105372A1 (zh) 时空关联数据的查询方法、装置、电子设备和存储介质
US9189520B2 (en) Methods and systems for one dimensional heterogeneous histograms
US9519687B2 (en) Minimizing index maintenance costs for database storage regions using hybrid zone maps and indices
US11734234B1 (en) Data architecture for supporting multiple search models
WO2015096582A1 (zh) 一种时空数据的索引建立方法、查询方法、装置及设备
US8700605B1 (en) Estimating rows returned by recursive queries using fanout
US10936551B1 (en) Aggregating alternate data stream metrics for file systems
US10936538B1 (en) Fair sampling of alternate data stream metrics for file systems
US10311045B2 (en) Aggregation/evaluation of heterogenic time series data
Li et al. Pyro: A {Spatial-Temporal}{Big-Data} Storage System
WO2018097846A1 (en) Edge store designs for graph databases
US20190197175A1 (en) Progressive optimization for implicit cast predicates
Zhao et al. Multiple nested schema of HBase for migration from SQL
CN116126942B (zh) 一种多维空间气象网格数据分布式存储查询方法
US20230385353A1 (en) Spatial search using key-value store
CN110720097A (zh) 图数据库中元组和边的功能性等价
US11520763B2 (en) Automated optimization for in-memory data structures of column store databases
He et al. Spatial query processing for location based application on Hbase
de Bernardo Roca New data structures and algorithms for the efficient management of large spatial datasets
Carter et al. Nanosecond indexing of graph data with hash maps and VLists
Xie et al. Silverback: Scalable association mining for temporal data in columnar probabilistic databases
KR102233944B1 (ko) 데이터베이스 관리를 위한 컴퓨터 프로그램
Zeng et al. PA‐LBF: Prefix‐Based and Adaptive Learned Bloom Filter for Spatial Data
Mathew et al. Novel research framework on SN's NoSQL databases for efficient query processing
CN115795180B (zh) 一种基于社交网络分析用户社交关系的轻量级方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893523

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 070923)

122 Ep: pct application non-entry in european phase

Ref document number: 21893523

Country of ref document: EP

Kind code of ref document: A1