CN117331919A - Database joint query method and device, electronic equipment and storage medium - Google Patents

Database joint query method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117331919A
CN117331919A CN202311206063.8A CN202311206063A CN117331919A CN 117331919 A CN117331919 A CN 117331919A CN 202311206063 A CN202311206063 A CN 202311206063A CN 117331919 A CN117331919 A CN 117331919A
Authority
CN
China
Prior art keywords
result set
query
original
target
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311206063.8A
Other languages
Chinese (zh)
Other versions
CN117331919B (en
Inventor
罗拉全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Primitive Data Beijing Information Technology Co ltd
Original Assignee
Primitive Data Beijing Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Primitive Data Beijing Information Technology Co ltd filed Critical Primitive Data Beijing Information Technology Co ltd
Priority to CN202311206063.8A priority Critical patent/CN117331919B/en
Priority claimed from CN202311206063.8A external-priority patent/CN117331919B/en
Publication of CN117331919A publication Critical patent/CN117331919A/en
Application granted granted Critical
Publication of CN117331919B publication Critical patent/CN117331919B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/2445Data retrieval commands; View definitions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a method, a device, electronic equipment and a storage medium for joint query of a database, and belongs to the technical field of databases. The method comprises the following steps: acquiring an original query result set of target items; the original query result set comprises query content data of a target object acquired according to a preset query type; performing content repetition detection according to the object number of the target objects and the query content data to obtain the content repetition rate of the original query result set; comparing the content repetition rate with a preset repetition threshold value to obtain a comparison result; performing de-duplication treatment on the original query result set according to the comparison result to obtain a target query result set; combining the target query result set to obtain an original combined result set; and performing de-duplication treatment on the original combined result set to obtain a target combined result set. The embodiment of the application can improve the overall execution efficiency of the joint query.

Description

Database joint query method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of database technologies, and in particular, to a method and apparatus for joint query of a database, an electronic device, and a storage medium.
Background
In the prior art, joint queries are a relatively common type of query operation in database usage. Joint queries are typically used to merge the result sets of two or more Select statements, and if the merged result set contains duplicate records, operators filter out duplicate records in the result set by default, and perform a deduplication operation on the final result set.
In the duplicate removal process of the joint query, whether a hash aggregation method or a sequencing method is adopted, all data are scanned in the memory, and if the combined result set is larger, the size of a hash table or a sequencing structure exceeds the available memory limit. In order to avoid that the memory exceeds the limit, the prior art scheme generally groups and drops data, writes the data into a plurality of files, and sequentially processes one file to ensure that the memory is sufficient. However, data landing is a relatively time-consuming operation, which can result in a reduction in overall execution efficiency. Therefore, how to provide a database joint query method to avoid the problem that the overall execution efficiency is reduced due to data drop caused by overlarge result set after the results of multiple Select sentences are combined is a technical problem to be solved.
Disclosure of Invention
The main purpose of the embodiments of the present application is to provide a method, an apparatus, an electronic device, and a storage medium for joint query of a database, which aim to improve the overall execution efficiency of joint query.
To achieve the above object, a first aspect of an embodiment of the present application provides a method for joint query of databases, where the method includes:
acquiring an original query result set of target items; the original query result set comprises query content data of a target object acquired according to a preset query type;
performing content repetition detection according to the number of objects of the original query result set and the query content data to obtain the content repetition rate of the original query result set;
comparing the content repetition rate with a preset repetition threshold value to obtain a comparison result;
performing de-duplication treatment on the original query result set according to the comparison result to obtain a target query result set;
combining the target query result set to obtain an original combined result set;
and performing de-duplication treatment on the original combined result set to obtain a target combined result set.
In some embodiments, the performing deduplication processing on the original query result set according to the comparison result to obtain a target query result set includes:
If the comparison result shows that the content repetition rate is larger than the repetition threshold, performing de-duplication processing on the original query result set to obtain a target query result set;
and if the comparison result shows that the content repetition rate is smaller than or equal to the repetition threshold, taking the original query result set as the target query result set.
In some embodiments, the detecting content repetition according to the number of objects of the original query result set and the query content data to obtain a content repetition rate of the original query result set includes:
taking the query content data of the same target object as a data group, and acquiring the number of the data groups meeting preset conditions to obtain the group number; wherein the preset condition includes that the query content data is not all empty and the query content data is not repeated;
and calculating the repetition rate according to the group number and the object number to obtain the content repetition rate of the original query result set.
In some embodiments, the calculating the repetition rate according to the group number and the object number to obtain the content repetition rate of the original query result set includes:
Calculating the ratio of the group number to the object number to obtain a unique data proportion;
and calculating a difference value according to the preset quantity and the unique data proportion to obtain the content repetition rate.
In some embodiments, the repetition threshold includes a first threshold or a second threshold, and before the comparing the content repetition rate with a preset repetition threshold to obtain a comparison result, the method further includes constructing the repetition threshold, specifically including:
obtaining duplication elimination requirement data;
if the de-duplication requirement data represents that de-duplication operation is performed, constructing the first threshold; wherein the first threshold is a negative number;
if the de-duplication requirement data indicates that the de-duplication operation is not performed, constructing the second threshold; wherein the second threshold is a number greater than 1.
In some embodiments, the target item includes target object data, and the obtaining the original query result set of the target item includes:
determining a target object from a preset object database according to the target object data;
and extracting the content according to the preset query type and the target object to obtain the original query result set.
In some embodiments, the merging the target query result set to obtain an original merged result set includes:
Combining the preset query types with the same type to obtain a common query type;
combining the target query result set according to the common query type to obtain the original combined result;
to achieve the above object, a second aspect of the embodiments of the present application proposes a database joint query apparatus, including:
the data acquisition module is used for acquiring an original query result set of the target item; the original query result set comprises query content data of a target object acquired according to a preset query type;
the repetition rate calculation module is used for carrying out content repetition detection according to the number of objects of the original query result set and the query content data to obtain the content repetition rate of the original query result set;
the repetition rate comparison module is used for comparing the content repetition rate with a preset repetition threshold value to obtain a comparison result;
the first deduplication module is used for performing deduplication processing on the original query result set according to the comparison result to obtain a target query result set;
the merging module is used for merging the target query result set to obtain an original merged result set;
The second duplicate removal module is used for carrying out duplicate removal processing on the original combination result set to obtain a target combination result set;
to achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory and a processor, the memory storing a computer program, the processor implementing the method according to the first aspect when executing the computer program.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of the first aspect.
According to the database joint query method, the device, the electronic equipment and the storage medium, the content repetition rate is detected on the original query result of the target item in the joint query, and if the content repetition rate exceeds the preset repetition threshold, the original query result is subjected to de-duplication processing to obtain the target query result. And merging the target query results to obtain an original merged result set, and finally performing de-duplication treatment on the original merged result set to obtain the target merged result set. Therefore, the method and the device can filter out a large amount of invalid data in advance by adaptively selecting the original query result set of the target item for deduplication, so that the data amount is reduced during final deduplication, the probability of data disk drop is reduced, and the overall execution efficiency is improved.
Drawings
FIG. 1 is a flowchart of a database joint query method provided in an embodiment of the present application;
fig. 2 is a flowchart of step S101 in fig. 1;
fig. 3 is a flowchart of step S102 in fig. 1;
fig. 4 is a flowchart of step S303 in fig. 3;
fig. 5 is a flowchart before step S103 in fig. 1;
fig. 6 is a flowchart of step S104 in fig. 1;
fig. 7 is a flowchart of step S105 in fig. 1;
FIG. 8 is a flow chart of one embodiment of the present application;
fig. 9 is a schematic structural diagram of a database joint query device provided in an embodiment of the present application;
fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
First, several nouns referred to in this application are parsed:
a database (database) is a collection that organizes and stores large amounts of data. It is a structured data storage system that can be used to store, manage and retrieve data. Databases provide an efficient way to organize and access data to meet different types of application and business requirements.
The structured query language, SQL (Structured Query Language), is a database query and programming language for accessing data and querying, updating and managing relational database systems. SQL statements are one language in which databases are operated upon.
A joint query is an SQL statement that, when executed in a database, may combine multiple similar result sets of selected queries. Keywords are used as UNION or UNION ALL. Default deduplication is performed when executing the UNION joint query statement, and deduplication is performed on the results that are merged together. However, in the deduplication process, whether a hash aggregation method or a sorting method is adopted, all data is scanned in the memory, and if the result set after merging is relatively large, the size of the hash table or the sorting structure exceeds the available memory limit. In order to avoid that the memory exceeds the limit, the prior art scheme generally groups and drops data, writes the data into a plurality of files, and sequentially processes one file to ensure that the memory is sufficient. However, data landing is a relatively time-consuming operation, which can result in a reduction in overall execution efficiency.
Based on the above, the embodiment of the application provides a method, a device, an electronic device and a storage medium for joint query of a database, which aim to improve the overall execution efficiency of joint query.
The method, the device, the electronic equipment and the storage medium for the database joint query provided by the embodiment of the application are specifically described through the following embodiments, and the method for the database joint query in the embodiment of the application is described first.
The embodiment of the application provides a database joint query method, and relates to the technical field of databases. The database joint query method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a database federated query method, but is not limited to the above.
The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Fig. 1 is an optional flowchart of a database joint query method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S106.
Step S101, obtaining an original query result set of target items; the original query result set comprises query content data of a target object acquired according to a preset query type;
step S102, content repetition detection is carried out according to the number of objects of the original query result set and query content data, so that the content repetition rate of the original query result set is obtained;
step S103, comparing the content repetition rate with a preset repetition threshold value to obtain a comparison result;
step S104, performing de-duplication processing on the original query result set according to the comparison result to obtain a target query result set;
step S105, combining the target query result set to obtain an original combined result set;
and S106, performing de-duplication processing on the original combined result set to obtain a target combined result set.
In the steps S101 to S106 illustrated in the embodiments of the present application, by detecting the content repetition rate of the original query result of the target item in the joint query, if the content repetition rate exceeds the preset repetition threshold, the original query result is subjected to deduplication processing, so as to obtain the target query result. And merging the target query results to obtain an original merged result set, and finally performing de-duplication treatment on the original merged result set to obtain the target merged result set. Therefore, the method and the device can filter out a large amount of invalid data in advance by adaptively selecting the original query result set of the target item for deduplication, so that the data amount is reduced during final deduplication, the probability of data disk drop is reduced, and the overall execution efficiency is improved.
In step S101 of some embodiments, an original set of query results for target items in a joint query is obtained. The original query result set comprises query content data of the target object acquired according to a preset query type. Wherein the target item may be a sub-query in the federated query and the original set of query results may be a set of query data for the sub-query. The preset query type may be a field (or "column") designated by the query, the target object may be object data in a sub-query table of the database, and the query content data may be specific data of the designated field. For example, there is an SQL statement as follows,
SELECT course number, course name FROM course
UNION
SELECT course number, course name FROM coarse 1;
in the above joint query SQL statement, the preset query types may be "course number" and "course name", the target item may be "SELECT course number", the set of original query results may be the query result of "SELECT course number", the target object may be a data object in the "course" table, the query content data may be "SELECT course number", and the course name FROM course "obtains the data of each" course number ".
Referring to fig. 2, in some embodiments, the target item includes target object data, and step S101 may include, but is not limited to, steps S201 to S202:
step S201, determining a target object from a preset object database according to target object data;
step S202, extracting contents according to a preset query type and a target object to obtain an original query result set.
In step S201 of some embodiments, the target object data may be a table that the user wants to specify. And determining the target object from a preset object database through the target object data. Wherein the object database is a user-specified database.
In step S202 of some embodiments, according to a preset query type proposed by a user, data is extracted from a target object, so as to obtain an original query result set.
For example, to obtain student name information data in a student database, there will be the following SQL statement:
USE database_student;
SELECT Sno,Sname FROM student;
in the above SQL statement, the preset object database may be a database_student database, the target object data may be a student table, the preset query types may be Sno and name, and the target object may be the object data in the student table. The original query result set may be query data after execution of the "SELECT Sno, sname FROM student" SQL statement.
In step S102 of some embodiments, content repetition detection is performed according to the number of the original query result sets and the query content data, so as to obtain a content repetition rate of the original query result sets.
Referring to fig. 3, in some embodiments, step S102 may include, but is not limited to, steps S301 to S302:
step S301, query content data of the same target object from a source is used as a data group, and the number of the data groups meeting preset conditions is obtained to obtain the group number; the preset conditions comprise that the query content data is not completely empty and the query content data is not repeated;
step S302, calculating the repetition rate according to the number of groups and the number of objects to obtain the content repetition rate of the original query result set.
In step S301 of some embodiments, the number of objects of the target object may be the total number of rows (total) of a table in the database, which may be queried in some SQL languages, for example SHOW TABLE STATUS or using COUNT key, and in the Postgresql database, the total number of rows may be obtained through the deltap fields in the system table pg_class.
In step S302 of some embodiments, the data set refers to query content data of the same target object, which is obtained according to a preset query type, and may be a row in a table in the database. After the number of the objects of the target object is obtained, query content data of the same target object is used as a data group, the number of the data groups meeting the preset condition is obtained, and the group number is obtained. The preset condition may be that the query content data is not completely empty, and the query content data is not repeated, and the condition that the query content data is not completely empty means that all query content of the data set cannot be completely empty, but there may be individual empty data (null). Wherein, query content data does not repeat, meaning that all query content data of one data set is a whole and cannot be completely identical with all query content data of other data sets. For example, in a Customer table, there are three attributes of name, age, telephone, which are preset query types, as shown in table 1 below:
name age telephone
Alice 29 147-225
Jack 18 123-999
Linda 31 149-333
Caroline 26 Null
Null Null Null
Jack 18 123-999
TABLE 1
The number of data sets satisfying the preset condition in table 1 is 4, and the query content data of the data set in line 5 is Null and Null, so that the preset condition is not satisfied. The data set of the 6 th row is identical to the query content data of the data set of the 2 nd row, and is repeated data, so that the preset condition is not satisfied.
It should be noted that, in some databases, the number of groups of data groups satisfying the above-mentioned preset condition may be obtained by querying the system of the database. For example, in the Postgresql database, there is a system table pg_ STATISTIC _ext in which the data items stxnlist hold the amount of data after multiple columns of deduplication, e.g., multiple columns of statistics created for the table student (sno, sname, ssex), corresponding to { sno, sname }, { sno, ssex }, { sname, ssex }, { sno, sname, ssex }, the amount of data after deduplication of stxnlist held in stxnlist is { "1,2":7, "1,3":7, "2,3":7, "1,2,3":7}, where { "1,2,3":7} represents the amount of data after deduplication of { sno, sname, ssex } is 7. Some databases require a certain processing operation to obtain the group number, and thus, the manner how to obtain the group number is not limited here.
In step S303 of some embodiments, a repetition rate calculation is performed according to the number of groups and the number of objects to obtain a content repetition rate of the original query result set.
Referring to fig. 4, in some embodiments, step S303 may include, but is not limited to, steps S401 to S402:
step S401, calculating the ratio of the group number to the object number to obtain a unique data proportion;
step S402, performing difference calculation according to the preset quantity and the unique data proportion to obtain the content repetition rate.
In step S401 of some embodiments, the number of groups is divided by the number of objects to obtain their ratio, i.e., the unique data ratio.
In step S402 of some embodiments, the content repetition rate is obtained by subtracting the unique data proportion from the preset number (i.e. 1). It will be appreciated that, by way of explanation in step S302, it is known that the content repetition rate does not merely represent the proportion of repeated data in the original query result set, but that the content repetition rate also includes the proportion of null data in the original query result set.
Through steps S301 to S303, the content repetition rate of the original query result set may be obtained, and through the content repetition rate, the approximate situation of invalid repeated data in the original query result set may be obtained, so that it is convenient to determine whether to perform a deduplication operation on the original query result set.
Referring to fig. 5, prior to step S103 of some embodiments, the database joint query method further includes constructing the repetition threshold, including but not limited to steps S501 to S503:
step S501, obtaining duplication elimination requirement data;
step S502, if the de-duplication requirement data indicates that de-duplication operation is performed, a first threshold is constructed; wherein the first threshold is a negative number;
step S503, if the de-duplication requirement data indicates that the de-duplication operation is not performed, a second threshold is constructed; wherein the second threshold is a number greater than 1.
In step S501 of some embodiments, the deduplication requirement data represents requirement data that a user wants to customize the duplicate threshold.
In step S502 of some embodiments, if the deduplication requirement data given by the user indicates that the deduplication operation has to be performed on the original query result set, a first threshold is constructed, the first threshold is a negative number, and the first threshold is set as a duplicate threshold;
in step S503 of some embodiments, if the deduplication requirement data given by the user indicates that the deduplication operation is not required for the original query result set, a second threshold is constructed, where the second threshold is a number greater than 1, and the second threshold is set as a repetition threshold;
Through steps S501 to S503, it can be known whether the user requires to force the duplicate removal operation on the original query result set or not according to the duplicate removal requirement data. This enables flexible adjustment of the performance of deduplication operations in accordance with user-set repetition thresholds in joint queries for some special scenarios. It will be appreciated that the repetition threshold may be stored in a database environment, for example in a Postgresql database, the newly added GUC (Grand Unified Configuration) parameters may be stored in a Postgresql. Conf file, and the parameters in the file may be read into the global variables after the database system is started. It will be appreciated that other databases may also be used with the environmental parameters set and the repetition threshold set in the database.
In step S103 of some embodiments, the content repetition rate calculated in S102 is compared with a preset repetition threshold, so as to obtain a comparison result.
In step S104 of some embodiments, it is determined whether the original query result set needs to be subjected to deduplication processing according to the obtained comparison result, so as to obtain a target query result set.
Referring to fig. 6, in some embodiments, step S104 may include, but is not limited to, steps S601 to S602:
step S601, if the comparison result shows that the content repetition rate is larger than the repetition threshold, performing de-duplication processing on the original query result set to obtain a target query result set;
in step S602, if the comparison result indicates that the content repetition rate is less than or equal to the repetition threshold, the original query result set is taken as the target query result set.
In step S601 of some embodiments, if the comparison result indicates that the content repetition rate is greater than the repetition threshold, invalid repetition data that indicates that the original query result set is greater than the user' S expectation, performing deduplication processing on the original query result set to obtain a target query result set.
In step S602 of some embodiments, if the comparison result indicates that the content repetition rate is less than or equal to the repetition threshold, the original query result set is directly used as the target query result set without performing deduplication processing on the original query result set.
It should be noted that, under normal conditions, the calculated value range of the content repetition rate is [0,1], so in the above steps S501 to S503, if the deduplication requirement data indicates that the deduplication operation has to be performed on the original query result set, a first threshold is constructed, and the first threshold is a negative number. When the first threshold is set as the repetition threshold, when the repetition threshold is compared with the content repetition rate, it is known that the comparison result necessarily indicates that the content repetition rate is greater than the repetition threshold because the repetition threshold is negative, and therefore the original query result set is necessarily subjected to de-duplication processing. If the deduplication requirement data indicates that the deduplication operation is not required for the original query result set, a second threshold is constructed, and the second threshold is a value greater than 1. When the second threshold is set as the repetition threshold at this time, when the repetition threshold is compared with the content repetition rate, it is known that the comparison result necessarily indicates that the content repetition rate is smaller than the repetition threshold because the repetition threshold is a value greater than 1, and thus the original query result set is not necessarily subjected to deduplication. Therefore, under normal conditions, the user can set the repetition threshold in [0,1] according to the actual requirement, so as to determine whether to perform the deduplication operation on the original query result set.
It will be appreciated that there are many ways to perform the deduplication process, which may be hash aggregation (hashigg) or sort aggregation (sortAgg), and the deduplication aggregation method used in the different databases is not exactly the same, and thus the specific method of deduplication is not limited herein.
Through steps S601 to S602, by comparing the results, it is obtained whether the content repetition rate of the original query result set exceeds a preset repetition threshold, so as to determine whether to perform a deduplication operation on the original query result set, and obtain a target query result set. In the joint query, whether the content repetition rate exceeds a repetition threshold value is judged through an original query result set of target items, so that the duplicate removal operation is performed, a large amount of invalid data is filtered in advance, the data amount is reduced during final duplicate removal, the probability of data disc falling is reduced, and the overall execution efficiency is improved.
In step S105 of some embodiments, the target query result set obtained after the deduplication processing of the plurality of target items is subjected to merging processing, so as to obtain an original merging processing set.
Referring to fig. 7, in some embodiments, step S105 may include, but is not limited to, steps S701 to S702:
Step S701, merging preset query types with the same type to obtain a common query type;
step S702, merging the target query result sets according to the common query type to obtain an original merged result;
in step S701 of some embodiments, merging a plurality of preset query types with the same type to obtain a common query type that can be used by all the plurality of target query result sets; it is understood that "type identical" means that the names of the preset query types may be different in the joint query process, but the number of preset query types must be identical, and the data types specified by the preset query types and the order of the preset query types should be identical. In the database, the number of fields of the query must be consistent, the types and the sequence of the fields should be the same, wherein the types of the fields should be the same, meaning that the data types are identical, or the other fields can be automatically converted into the same data types.
In step S702 of some embodiments, after the common query type is obtained, the data of the target query result set is aligned to the common query type and combined, so as to obtain an original combined result.
For example, in the following SQL statement:
SELECT emp_id number, emp_name name, dept_id department or class FROM emp UNION
SELECT stu_id number, stu_name name, class_id department or class FROM student;
in the above example, it can be seen that the names of the preset query types of the two target items are not the same, the preset query type of the first target item is emp_id, emp_name and dept_id, and the preset query type of the second target item is stu_id, stu_name and class_id. But the data types of the emp_id and the stu_id are the same and are both numbered, and in the two target matters, the setting sequence of the number, the name and the preset query types of the departments or the classes is the same, wherein the common query type can be the number, the name and the departments or the classes. Thus, the target query results obtained by processing the two target items can be combined according to the common query type to obtain the original combined result.
In step S106 of some embodiments, a final deduplication process is performed on the original combined result set, resulting in a target combined result set. Wherein the target set of merged results represents the final result of the federated query.
Referring to fig. 8, in a specific embodiment, in a joint query, the original result set of the target item 1 is subjected to content repetition detection to obtain a content repetition rate, the content repetition rate is compared with a repetition threshold, if the content repetition rate exceeds the threshold, the target query result set is obtained by performing a first deduplication process, and if the content repetition rate does not exceed the repetition threshold, the target query result set is obtained by skipping the deduplication process. And combining the target query result set obtained by the process of the target item 1 with the target query result set obtained by the process of the other target items n to obtain an original combined result set, and finally performing duplicate removal processing to obtain a target combined result set, thereby completing one joint query.
In the embodiment of the application, in the joint query, whether the content repetition rate exceeds the repetition threshold value is judged through the original query result set of the target item, so that the deduplication operation is performed, a large amount of invalid data is filtered in advance, the data amount is reduced during final deduplication, the probability of data falling is reduced, and the overall execution efficiency is improved.
Referring to fig. 9, an embodiment of the present application further provides a database joint query device, which may implement the database joint query method, where the device includes:
The data acquisition module is used for acquiring an original query result set of the target item; the original query result set comprises query content data of a target object acquired according to a preset query type;
the repetition rate calculation module is used for carrying out content repetition detection according to the number of objects of the original query result set and the query content data to obtain the content repetition rate of the original query result set;
the repetition rate comparison module is used for comparing the content repetition rate with a preset repetition threshold value to obtain a comparison result;
the first deduplication module is used for performing deduplication processing on the original query result set according to the comparison result to obtain a target query result set;
the merging module is used for merging the target query result set to obtain an original merged result set;
the second duplicate removal module is used for carrying out duplicate removal processing on the original combination result set to obtain a target combination result set;
the specific implementation manner of the database joint query device is basically the same as the specific embodiment of the database joint query method, and is not described herein.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the database joint query method when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.
Referring to fig. 10, fig. 10 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:
the processor 1001 may be implemented by using a general-purpose CPU (Central Processing Unit ), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. to execute related programs to implement the technical solutions provided by the embodiments of the present application;
the Memory 1002 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). The memory 1002 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 1002, and the processor 1001 invokes a database joint query method to execute the embodiments of the present application;
an input/output interface 1003 for implementing information input and output;
the communication interface 1004 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);
A bus 1005 for transferring information between the various components of the device (e.g., the processor 1001, memory 1002, input/output interface 1003, and communication interface 1004);
wherein the processor 1001, the memory 1002, the input/output interface 1003, and the communication interface 1004 realize communication connection between each other inside the device through the bus 1005.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the database joint query method when being executed by a processor.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
According to the database joint query method, the device, the electronic equipment and the storage medium, the content repetition rate is detected on the original query result of the target item in the joint query, and if the content repetition rate exceeds the preset repetition threshold, the original query result is subjected to de-duplication processing to obtain the target query result. And merging the target query results to obtain an original merged result set, and finally performing de-duplication treatment on the original merged result set to obtain the target merged result set. Therefore, the method and the device can filter out a large amount of invalid data in advance by adaptively selecting the original query result set of the target item for deduplication, so that the data amount is reduced during final deduplication, the probability of data disk drop is reduced, and the overall execution efficiency is improved.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.
Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims (10)

1. A method for joint query of databases, the method comprising:
acquiring an original query result set of target items; the original query result set comprises query content data of a target object acquired according to a preset query type;
performing content repetition detection according to the number of objects of the original query result set and the query content data to obtain the content repetition rate of the original query result set;
comparing the content repetition rate with a preset repetition threshold value to obtain a comparison result;
performing de-duplication treatment on the original query result set according to the comparison result to obtain a target query result set;
combining the target query result set to obtain an original combined result set;
and performing de-duplication treatment on the original combined result set to obtain a target combined result set.
2. The method of claim 1, wherein performing a deduplication process on the original query result set according to the comparison result to obtain a target query result set, comprises:
if the comparison result shows that the content repetition rate is larger than the repetition threshold, performing de-duplication processing on the original query result set to obtain a target query result set;
and if the comparison result shows that the content repetition rate is smaller than or equal to the repetition threshold, taking the original query result set as the target query result set.
3. The method according to claim 1, wherein the performing content repetition detection according to the number of objects of the original query result set and the query content data to obtain a content repetition rate of the original query result set includes:
taking the query content data of the same target object as a data group, and acquiring the number of the data groups meeting preset conditions to obtain the group number; wherein the preset condition includes that the query content data is not all empty and the query content data is not repeated;
and calculating the repetition rate according to the group number and the object number to obtain the content repetition rate of the original query result set.
4. The method of claim 3, wherein said calculating a repetition rate based on said group number and said object number to obtain a content repetition rate of said original query result set comprises:
calculating the ratio of the group number to the object number to obtain a unique data proportion;
and calculating a difference value according to the preset quantity and the unique data proportion to obtain the content repetition rate.
5. The method according to claim 1, wherein the repetition threshold comprises a first threshold or a second threshold, and wherein before the comparing the content repetition rate with a preset repetition threshold, the method further comprises constructing the repetition threshold, specifically comprising:
obtaining duplication elimination requirement data;
if the de-duplication requirement data represents that de-duplication operation is performed, constructing the first threshold; wherein the first threshold is a negative number;
if the de-duplication requirement data indicates that the de-duplication operation is not performed, constructing the second threshold; wherein the second threshold is a number greater than 1.
6. The method of claim 1, wherein the target item comprises target object data, and the obtaining the original query result set of the target item comprises:
Determining a target object from a preset object database according to the target object data;
and extracting the content according to the preset query type and the target object to obtain the original query result set.
7. The method of claim 1, wherein the merging the target query result set to obtain an original merged result set comprises:
combining the preset query types with the same type to obtain a common query type;
and merging the target query result set according to the common query type to obtain the original merged result.
8. A database federated query apparatus, the apparatus comprising:
the data acquisition module is used for acquiring an original query result set of the target item; the original query result set comprises query content data of a target object acquired according to a preset query type;
the repetition rate calculation module is used for carrying out content repetition detection according to the number of objects of the original query result set and the query content data to obtain the content repetition rate of the original query result set;
the repetition rate comparison module is used for comparing the content repetition rate with a preset repetition threshold value to obtain a comparison result;
The first deduplication module is used for performing deduplication processing on the original query result set according to the comparison result to obtain a target query result set;
the merging module is used for merging the target query result set to obtain an original merged result set;
and the second deduplication module is used for performing deduplication processing on the original merging result set to obtain a target merging result set.
9. An electronic device comprising a memory storing a computer program and a processor implementing the database joint query method of any one of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the database joint query method of any one of claims 1 to 7.
CN202311206063.8A 2023-09-18 Database joint query method and device, electronic equipment and storage medium Active CN117331919B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311206063.8A CN117331919B (en) 2023-09-18 Database joint query method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311206063.8A CN117331919B (en) 2023-09-18 Database joint query method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117331919A true CN117331919A (en) 2024-01-02
CN117331919B CN117331919B (en) 2024-06-11

Family

ID=

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399944A (en) * 2013-08-14 2013-11-20 曙光信息产业(北京)有限公司 Implementation method and implementation device for data duplication elimination query
CN107273506A (en) * 2017-06-19 2017-10-20 西安电子科技大学 A kind of method of database multi-list conjunctive query
CN108256003A (en) * 2017-12-29 2018-07-06 天津南大通用数据技术股份有限公司 A kind of method that union operation efficiencies are improved according to analysis Data duplication rate
CN109117426A (en) * 2017-06-23 2019-01-01 中兴通讯股份有限公司 Distributed networks database query method, apparatus, equipment and storage medium
CN114297238A (en) * 2021-12-23 2022-04-08 北京百度网讯科技有限公司 Data query method, device and system based on distributed database system
CN114443625A (en) * 2020-11-06 2022-05-06 腾讯科技(深圳)有限公司 Database processing method and device
CN114547093A (en) * 2022-01-05 2022-05-27 中国互联网络信息中心 Cache control method, device, equipment and storage medium
CN116384351A (en) * 2023-04-24 2023-07-04 蚂蚁区块链科技(上海)有限公司 Data merging method and related equipment
CN116401271A (en) * 2023-03-23 2023-07-07 金蝶软件(中国)有限公司 Database table query method, computer device and computer storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399944A (en) * 2013-08-14 2013-11-20 曙光信息产业(北京)有限公司 Implementation method and implementation device for data duplication elimination query
CN107273506A (en) * 2017-06-19 2017-10-20 西安电子科技大学 A kind of method of database multi-list conjunctive query
CN109117426A (en) * 2017-06-23 2019-01-01 中兴通讯股份有限公司 Distributed networks database query method, apparatus, equipment and storage medium
CN108256003A (en) * 2017-12-29 2018-07-06 天津南大通用数据技术股份有限公司 A kind of method that union operation efficiencies are improved according to analysis Data duplication rate
CN114443625A (en) * 2020-11-06 2022-05-06 腾讯科技(深圳)有限公司 Database processing method and device
CN114297238A (en) * 2021-12-23 2022-04-08 北京百度网讯科技有限公司 Data query method, device and system based on distributed database system
CN114547093A (en) * 2022-01-05 2022-05-27 中国互联网络信息中心 Cache control method, device, equipment and storage medium
CN116401271A (en) * 2023-03-23 2023-07-07 金蝶软件(中国)有限公司 Database table query method, computer device and computer storage medium
CN116384351A (en) * 2023-04-24 2023-07-04 蚂蚁区块链科技(上海)有限公司 Data merging method and related equipment

Similar Documents

Publication Publication Date Title
US9817877B2 (en) Optimizing data processing using dynamic schemas
US9639542B2 (en) Dynamic mapping of extensible datasets to relational database schemas
EP3688607A1 (en) System and method for load, aggregate and batch calculation in one scan in a multidimensional database environment
WO2019035903A1 (en) Systems and methods for distributed data validation
US20150220600A1 (en) Efficient set operation execution using a single group-by operation
US20040243618A1 (en) Methods and systems for auto-partitioning of schema objects
CN107729399B (en) Data processing method and device
US8880463B2 (en) Standardized framework for reporting archived legacy system data
EP2695087A1 (en) Processing data in a mapreduce framework
US11243987B2 (en) Efficient merging and filtering of high-volume metrics
WO2009108459A2 (en) Indexing large-scale gps tracks
EP3217296A1 (en) Data query method and apparatus
CN106407360B (en) Data processing method and device
US20220114181A1 (en) Fingerprints for compressed columnar data search
CN109062936B (en) Data query method, computer readable storage medium and terminal equipment
US9043330B2 (en) Normalized search
US20110264703A1 (en) Importing Tree Structure
US20230124432A1 (en) Database Indexing Using Structure-Preserving Dimensionality Reduction to Accelerate Database Operations
CN108073641B (en) Method and device for querying data table
CN112527824B (en) Paging query method, paging query device, electronic equipment and computer-readable storage medium
CN117331919B (en) Database joint query method and device, electronic equipment and storage medium
CN117331919A (en) Database joint query method and device, electronic equipment and storage medium
CN116089417A (en) Information acquisition method, information acquisition device, storage medium and computer equipment
US9483560B2 (en) Data analysis control
CN110955637A (en) Method for realizing ordering of oversized files based on low memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant