CN111708809A - Associated query method, device and equipment based on data tilt and storage medium - Google Patents

Associated query method, device and equipment based on data tilt and storage medium Download PDF

Info

Publication number
CN111708809A
CN111708809A CN202010581205.9A CN202010581205A CN111708809A CN 111708809 A CN111708809 A CN 111708809A CN 202010581205 A CN202010581205 A CN 202010581205A CN 111708809 A CN111708809 A CN 111708809A
Authority
CN
China
Prior art keywords
data
data set
query request
association
tilt
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010581205.9A
Other languages
Chinese (zh)
Other versions
CN111708809B (en
Inventor
李慎刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Property and Casualty Insurance Company of China Ltd
Original Assignee
Ping An Property and Casualty Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Property and Casualty Insurance Company of China Ltd filed Critical Ping An Property and Casualty Insurance Company of China Ltd
Priority to CN202010581205.9A priority Critical patent/CN111708809B/en
Publication of CN111708809A publication Critical patent/CN111708809A/en
Application granted granted Critical
Publication of CN111708809B publication Critical patent/CN111708809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of big data, and discloses a correlation query method, a correlation query device, correlation query equipment and a storage medium based on data skew, which are used for reducing the probability of correlation data query failure. The associated query method based on the data tilt comprises the following steps: reading the first table data, the second table data and the corresponding data quantity; obtaining a first non-inclined data set, a first inclined data set, a second non-inclined data set, a second inclined data set or a third non-inclined data set, a third inclined data set, a fourth non-inclined data set and a fourth inclined data set according to the first table data and the second table data; and determining a first target data set or a second target data set based on the plurality of tilted data sets, the plurality of non-tilted data sets and the associated query request, and transmitting the first target data set or the second target data set to the target terminal.

Description

Associated query method, device and equipment based on data tilt and storage medium
Technical Field
The invention relates to the field of big data, in particular to an association query method, device, equipment and storage medium based on data inclination.
Background
At present, when processing a large amount of data from different applications and data sources, a large-scale memory computing platform is widely used, for example, a Spark computing platform, which is a fast and general computing engine specially designed for large-scale data processing and can be used to complete various operations, including SQL queries, text processing, machine learning, and the like. The basic principle of Spark calculation engine is to divide the data into small time slices and process these small portions of data in a manner similar to batch processing.
In the prior art, when SQL queries such as left correlation, full connection and equal value connection are performed on two tables based on Spark, if data skew occurs, some hot data needs to be scattered and distributed to other nodes according to the specific distribution condition of the data, and a large amount of time is needed to analyze and process the problem of data skew, so that the problems of low data correlation query efficiency and high failure rate are caused.
Disclosure of Invention
The invention mainly aims to solve the problems of low query efficiency and high query failure rate caused by data inclination when data association query is carried out.
The invention provides a correlation query method based on data inclination, which comprises the following steps: acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, and counting the data volume of the first table data and the data volume of the second table data to obtain a first data volume and a second data volume, wherein the association query request is an equivalent connection query request, a left association query request or a full connection query request; when at least one of the first data volume and the second data volume is larger than a tilt threshold, judging whether the first data volume is larger than the second data volume or the second data volume is larger than the first data volume; if the first data volume is larger than the second data volume, obtaining a first non-inclined data set, a first inclined data set, a second non-inclined data set and a second inclined data set according to the first table data and the second table data; determining a first target data set according to the first non-tilt data set, the first tilt data set, the second non-tilt data set, the second tilt data set and the association query request, and transmitting the first target data set to the target terminal, wherein the first target data set is a first equivalence connection data set, a first left association data set or a first full connection data set; if the second data volume is larger than the first data volume, obtaining a third non-inclined data set, a third inclined data set, a fourth non-inclined data set and a fourth inclined data set according to the first table data and the second table data; determining a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set, and the association query request, and transmitting the second target data set to the target terminal, where the second target data set is a second equal-value join data set, a second left-associated data set, or a second fully-joined data set.
Optionally, in a first implementation manner of the first aspect of the present invention, the obtaining an association query request of a target terminal, reading first table data and second table data based on the association query request, and counting a data amount of the first table data and a data amount of the second table data to obtain a first data amount and a second data amount, where the association query request is an equal-value connection query request, a left association query request, or a full-connection query request includes: acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, dividing the first table data into a plurality of first column data, and dividing the second table data into a plurality of second column data; performing data processing on the first rows of data to obtain a plurality of first sub-table data, and counting the data volume of the first sub-table data to obtain a plurality of first sub-table data volumes; performing data processing on the plurality of second line data to obtain a plurality of second sub-table data, and counting the data quantity of the plurality of second sub-table data to obtain a plurality of second sub-table data quantities; adding each first sub-table data quantity in the plurality of first sub-table data quantities to obtain a first data quantity; and adding each second sub-table data quantity in the plurality of second sub-table data quantities to obtain a second data quantity.
Optionally, in a second implementation manner of the first aspect of the present invention, if the first data amount is greater than the second data amount, obtaining a first non-tilted data set, a first tilted data set, a second non-tilted data set, and a second tilted data set according to the first table data and the second table data includes: if the first data volume is larger than the second data volume, processing the first table data into first mark data, and performing left association on the second table data and the first mark data to obtain a first result set, wherein the first result set comprises a plurality of first small data identifiers; extracting a data set with a first small data identifier as a null value from the first result set, and adding a first small data identifier which is not a null value again to obtain a first non-inclined data set; extracting a data set with a first small data identification not being a null value in the first result set to obtain a first inclined data set; adding a first big data identifier to the second table data to obtain first table identifier data, and performing left association on the first table identifier data and the first label data to obtain a second result set, wherein the second result set comprises a plurality of first small table column data; extracting a data set with first small tabular data as a null value from the second result set, and deleting the corresponding first small tabular data to obtain a second non-inclined data set; and extracting a data set of which the first small tabular data is not null in the second result set, and deleting the corresponding first tabular data to obtain a second inclined data set.
Optionally, in a third implementation manner of the first aspect of the present invention, the determining a first target data set according to the first non-skewed data set, the first skewed data set, the second non-skewed data set, the second skewed data set, and the association query request, and transmitting the first target data set to the target terminal, where the first target data set is a first equal-valued connection data set, a first left-associated data set, or a first full-connection data set, includes: fully connecting the first non-oblique data set with the second non-oblique data set to obtain a first initial full data set, and fully connecting the first oblique data set with the second oblique data set to obtain a second initial full data set; merging the first initial full data set and the second initial full data set to obtain a first full data set; when the association query request is the equivalence connection query request, extracting a data set with a first small data identifier not being a null value and a data set with a first big data identifier not being a null value from the first full data set, and deleting the corresponding plurality of first small data identifiers and the corresponding plurality of first big data identifiers to obtain a first equivalence connection data set; when the association query request is the left association query request, extracting a data set of which the first big data identifier is not a null value from the first full data set, and deleting the corresponding plurality of first small data identifiers and the corresponding plurality of first big data identifiers to obtain a first left association data set; and when the associated query request is the full-connection query request, extracting the first full data set, and deleting the plurality of first small data identifications and the plurality of first large data identifications to obtain a first full-connection data set.
Optionally, in a fourth implementation manner of the first aspect of the present invention, if the second data amount is greater than the first data amount, obtaining a third non-tilted data set, a third tilted data set, a fourth non-tilted data set, and a fourth tilted data set according to the first table data and the second table data includes: if the second data volume is larger than the first data volume, processing the second tabular data into second marked data, and performing left association on the first tabular data and the second marked data to obtain a third result set, wherein the third result set comprises a plurality of second small data identifiers; extracting a data set with a second small data identifier as a null value from the third result set, and adding a second small data identifier which is not a null value again to obtain a third non-inclined data set; extracting a data set with a second small data identifier not being a null value in the third result set to obtain a third inclined data set; adding a second big data identifier to the first table data to obtain second table identifier data, and performing left association on the second table identifier data and the second label data to obtain a fourth result set, wherein the fourth result set comprises a plurality of second small table column data; extracting a data set with second small tabular data as a null value from the fourth result set, and deleting the corresponding second small tabular data to obtain a fourth non-inclined data set; and extracting a data set of which the second small tabular data is not null in the fourth result set, and deleting the corresponding second tabular data to obtain a fourth inclined data set.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the determining a second target data set according to the third non-skewed data set, the third skewed data set, the fourth non-skewed data set, the fourth skewed data set, and the association query request, and transmitting the second target data set to the target terminal, where the second target data set is a second equal-valued joined data set, a second left-associated data set, or a second fully-joined data set includes: fully connecting the third non-oblique data set with the fourth non-oblique data set to obtain a third initial full data set, and fully connecting the third oblique data set with the fourth oblique data set to obtain a fourth initial full data set; merging the third initial full data set and the fourth initial full data set to obtain a second full data set; when the association query request is the equal-value connection query request, extracting a data set with a second small data identifier not being a null value and a data set with a second big data identifier not being a null value from the second full data set, and deleting the corresponding second small data identifiers and the corresponding second big data identifiers to obtain a second equal-value connection data set; when the association query request is the left association query request, extracting a data set of which a second big data identifier is not a null value from the second full data set, and deleting the corresponding second small data identifiers and the corresponding second big data identifiers to obtain a second left association data set; and when the associated query request is the full-connection query request, extracting the second full data set, and deleting the plurality of second small data identifications and the plurality of second large data identifications to obtain a second full-connection data set.
Optionally, in a sixth implementation manner of the first aspect of the present invention, after determining a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set, and the association query request, and transmitting the second target data set to the target terminal, where the second target data set is a second equal-valued join data set, a second left-associated data set, or a second fully-joined data set, the method for querying associations based on data tilt further includes: and when the first data volume is smaller than or equal to the tilt threshold and the second data volume is smaller than or equal to the tilt threshold, performing corresponding connection on the first table data and the second table data according to the association query request to obtain a third target data set, wherein the third target data set is a third equal-value connection data set, a third left association data set or a third full connection data set.
The second aspect of the present invention provides an association query apparatus based on data skew, including: the data acquisition module is used for acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, and counting the data volume of the first table data and the data volume of the second table data to obtain a first data volume and a second data volume, wherein the association query request is an equivalent connection query request, a left association query request or a full connection query request; a judging module, configured to judge whether the first data amount is larger than a second data amount or the second data amount is larger than the first data amount when at least one of the first data amount and the second data amount is larger than a tilt threshold; the first data set extraction module is used for obtaining a first non-inclined data set, a first inclined data set, a second non-inclined data set and a second inclined data set according to the first table data and the second table data if the first data amount is larger than the second data amount; a first association module, configured to determine a first target data set according to the first non-skewed data set, the first skewed data set, the second non-skewed data set, the second skewed data set, and the association query request, and transmit the first target data set to the target terminal, where the first target data set is a first equal-valued connection data set, a first left association data set, or a first fully-connected data set; the second data set extraction module is used for obtaining a third non-inclined data set, a third inclined data set, a fourth non-inclined data set and a fourth inclined data set according to the first table data and the second table data if the second data amount is larger than the first data amount; a second association module, configured to determine a second target data set according to the third non-skewed data set, the third skewed data set, the fourth non-skewed data set, the fourth skewed data set, and the association query request, and transmit the second target data set to the target terminal, where the second target data set is a second equal-valued connected data set, a second left-associated data set, or a second fully-connected data set.
Optionally, in a first implementation manner of the second aspect of the present invention, the data obtaining module is specifically configured to: acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, dividing the first table data into a plurality of first column data, and dividing the second table data into a plurality of second column data; performing data processing on the first rows of data to obtain a plurality of first sub-table data, and counting the data volume of the first sub-table data to obtain a plurality of first sub-table data volumes; performing data processing on the plurality of second line data to obtain a plurality of second sub-table data, and counting the data quantity of the plurality of second sub-table data to obtain a plurality of second sub-table data quantities; adding each second sub-table data quantity in the plurality of second sub-table data quantities to obtain a first data quantity; and adding each second sub-table byte quantity in the plurality of second sub-table byte quantities to obtain a second data quantity.
Optionally, in a second implementation manner of the second aspect of the present invention, the first data set extraction module is specifically configured to: if the first data volume is larger than the second data volume, processing the first table data into first mark data, and performing left association on the second table data and the first mark data to obtain a first result set, wherein the first result set comprises a plurality of first small data identifiers; extracting a data set with a first small data identifier as a null value from the first result set, and adding a first small data identifier which is not a null value again to obtain a first non-inclined data set; extracting a data set with a first small data identification not being a null value in the first result set to obtain a first inclined data set; adding a first big data identifier to the second table data to obtain first table identifier data, and performing left association on the first table identifier data and the first label data to obtain a second result set, wherein the second result set comprises a plurality of first small table column data; extracting a data set with first small tabular data as a null value from the second result set, and deleting the corresponding first small tabular data to obtain a second non-inclined data set; and extracting a data set of which the first small tabular data is not null in the second result set, and deleting the corresponding first tabular data to obtain a second inclined data set.
Optionally, in a third implementation manner of the second aspect of the present invention, the first association module is specifically configured to: fully connecting the first non-oblique data set with the second non-oblique data set to obtain a first initial full data set, and fully connecting the first oblique data set with the second oblique data set to obtain a second initial full data set; merging the first initial full data set and the second initial full data set to obtain a first full data set; when the association query request is the equivalence connection query request, extracting a data set with a first small data identifier not being a null value and a data set with a first big data identifier not being a null value from the first full data set, and deleting the corresponding plurality of first small data identifiers and the corresponding plurality of first big data identifiers to obtain a first equivalence connection data set; when the association query request is the left association query request, extracting a data set of which the first big data identifier is not a null value from the first full data set, and deleting the corresponding plurality of first small data identifiers and the corresponding plurality of first big data identifiers to obtain a first left association data set; and when the associated query request is the full-connection query request, extracting the first full data set, and deleting the plurality of first small data identifications and the plurality of first large data identifications to obtain a first full-connection data set.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the second data set extraction module is specifically configured to: if the second data volume is larger than the first data volume, processing the second tabular data into second marked data, and performing left association on the first tabular data and the second marked data to obtain a third result set, wherein the third result set comprises a plurality of second small data identifiers; extracting a data set with a second small data identifier as a null value from the third result set, and adding a second small data identifier which is not a null value again to obtain a third non-inclined data set; extracting a data set with a second small data identifier not being a null value in the third result set to obtain a third inclined data set; adding a second big data identifier to the first table data to obtain second table identifier data, and performing left association on the second table identifier data and the second label data to obtain a fourth result set, wherein the fourth result set comprises a plurality of second small table column data; extracting a data set with second small tabular data as a null value from the fourth result set, and deleting the corresponding second small tabular data to obtain a fourth non-inclined data set; and extracting a data set of which the second small tabular data is not null in the fourth result set, and deleting the corresponding second tabular data to obtain a fourth inclined data set.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the second association module is specifically configured to: fully connecting the third non-oblique data set with the fourth non-oblique data set to obtain a third initial full data set, and fully connecting the third oblique data set with the fourth oblique data set to obtain a fourth initial full data set; merging the third initial full data set and the fourth initial full data set to obtain a second full data set; when the association query request is the equal-value connection query request, extracting a data set with a second small data identifier not being a null value and a data set with a second big data identifier not being a null value from the second full data set, and deleting the corresponding second small data identifiers and the corresponding second big data identifiers to obtain a second equal-value connection data set; when the association query request is the left association query request, extracting a data set of which a second big data identifier is not a null value from the second full data set, and deleting the corresponding second small data identifiers and the corresponding second big data identifiers to obtain a second left association data set; and when the associated query request is the full-connection query request, extracting the second full data set, and deleting the plurality of second small data identifications and the plurality of second large data identifications to obtain a second full-connection data set.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the data tilt-based association query apparatus further includes: and the third association module is used for performing corresponding connection on the first table data and the second table data according to the association query request to obtain a third target data set when the first data volume is smaller than or equal to the tilt threshold and the second data volume is smaller than or equal to the tilt threshold, wherein the third target data set is a third equal-value connection data set, a third left association data set or a third full-connection data set.
The third aspect of the present invention provides an association query device based on data tilt, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the data tilt-based association query device to perform the data tilt-based association query method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-mentioned data tilt-based association query method.
In the technical scheme provided by the invention, an association query request of a target terminal is obtained, first table data and second table data are read based on the association query request, the data volume of the first table data and the data volume of the second table data are counted to obtain a first data volume and a second data volume, and the association query request is an equivalent connection query request, a left association query request or a full connection query request; when at least one of the first data volume and the second data volume is larger than a tilt threshold, judging whether the first data volume is larger than the second data volume or the second data volume is larger than the first data volume; if the first data volume is larger than the second data volume, obtaining a first non-inclined data set, a first inclined data set, a second non-inclined data set and a second inclined data set according to the first table data and the second table data; determining a first target data set according to the first non-tilt data set, the first tilt data set, the second non-tilt data set, the second tilt data set and the association query request, and transmitting the first target data set to the target terminal, wherein the first target data set is a first equivalence connection data set, a first left association data set or a first full connection data set; if the second data volume is larger than the first data volume, obtaining a third non-inclined data set, a third inclined data set, a fourth non-inclined data set and a fourth inclined data set according to the first table data and the second table data; determining a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set, and the association query request, and transmitting the second target data set to the target terminal, where the second target data set is a second equal-value join data set, a second left-associated data set, or a second fully-joined data set. In the embodiment of the invention, the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications are extracted from the first table data and the second table data according to the first table data amount and the second table data amount, and the target data set is obtained based on the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications, so that the efficiency of querying the associated data is improved, and the probability of failure of querying the associated data is reduced.
Drawings
FIG. 1 is a diagram of an embodiment of an association query method based on data skew according to an embodiment of the present invention;
FIG. 2 is a diagram of another embodiment of an association query method based on data skew according to an embodiment of the present invention;
FIG. 3 is a diagram of an embodiment of an association query apparatus based on data skew according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of another embodiment of an association query apparatus based on data tilt in the embodiment of the present invention;
FIG. 5 is a diagram of an embodiment of an association query device based on data tilting in the embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a correlation query method, a correlation query device, equipment and a storage medium based on data inclination.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of the association query method based on data skew in the embodiment of the present invention includes:
101. acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, and counting the data volume of the first table data and the data volume of the second table data to obtain a first data volume and a second data volume, wherein the association query request is an equivalent connection query request, a left association query request or a full connection query request;
the service acquires an equivalent connection query request, a left association query request or a full connection query request from a target terminal, reads first table data and second table data according to the equivalent connection query request, the left association query request or the full connection query request, and reads a first data volume corresponding to the first table data and a second data volume corresponding to the second table data.
The association query request of the target terminal is a left association (left join) request, an equal value connection (inner join) request or a full join (full join) request. The left association query request can be understood as that the left table in the two specified tables is taken as a main table, namely a first table, and all data of the first table and part of data of a second table meeting the link condition are reserved; isojunctions can be understood as preserving data in which the fields in the first table and the second table are equal; fully connected may be understood as retaining the union of left associated data and right associated data.
The server acquires first table data and second table data which need to be subjected to data connection, and counts the data volume of the first table data and the data volume of the second table according to the number of bytes or the number of records to obtain the first data volume and the second data volume.
For example, the first table data and the second table data obtained by analyzing the association query request are the table data a and the table data B, respectively, and the specific table data is shown in the following tables 1 and 2:
table 1: a Table data, i.e. first Table data
user_id enterprise_id
Zhangsan E1
Lisi E1
Wangwu E1
Table 2: b table data, i.e. second table data
user_id Age
Zhangsan 18
Lisi 19
Zhaoliu 20
In the above table, user _ id is the name of a person, entrprise _ id is the name of a business, and Age is the Age. The server respectively counts the data of the user _ id column and the data of the entry _ id column in the A table data to obtain the byte number of the two sub-tables or the record number of the two sub-tables, and adds the byte number of the two sub-tables and the record number of the two sub-tables to obtain the A data volume; the server respectively counts the data of the user _ id column and the data of the Age column in the data of the B table to obtain the byte number of the two sub-tables or the record number of the two sub-tables, and adds the byte number of the two sub-tables and the record number of the two sub-tables to obtain the data amount of the B.
It is to be understood that the execution subject of the present invention may be an association query apparatus based on data tilt, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.
102. When at least one of the first data volume and the second data volume is larger than the inclination threshold, judging whether the first data volume is larger than the second data volume or whether the second data volume is larger than the first data volume;
when any one of the first data volume and the second data volume is larger than the inclination threshold, the server judges whether the first data volume is larger than the second data volume or the second data volume is larger than the first data volume, if the first data volume is larger than the second data volume, the first table data is processed, and then corresponding data association is carried out on the first table data, the second table data and the processed first table data; and if the second data quantity is larger than the inclination threshold, processing the second table data, and then performing corresponding data association on the first table data, the second table data and the processed second table data. When the first data volume and the second data volume are not larger than the inclination threshold, the server firstly broadcasts the table data with smaller data volume to each node of the associated thread, and then directly performs left data association, equal data connection or full data connection on the first table data and the second table data according to the associated query request.
103. If the first data volume is larger than the second data volume, obtaining a first non-inclined data set, a first inclined data set, a second non-inclined data set and a second inclined data set according to the first table data and the second table data;
if the server determines that the first amount of data is greater than the second amount of data, the server extracts a first non-skewed data set, a first skewed data set, a second non-skewed data set, and a second skewed data set based on the first tabular data and the second tabular data.
For example, it is assumed that the tilt threshold is measured in the number of recording lines, the tilt threshold is 10kw, the first data amount is 12kw, and the second data amount is 11 kw. Therefore, the first data volume is larger than the second data volume, the server processes the first table data, and after the first table data is processed, the server acquires the first non-tilt data set, the first tilt data set, the second non-tilt data set and the second tilt data set according to the first table data, the second table data and the processed first table data.
The tilt threshold may be measured by the number of bytes or the number of recording lines.
104. Determining a first target data set according to the first non-oblique data set, the first oblique data set, the second non-oblique data set, the second oblique data set and the association query request, and transmitting the first target data set to a target terminal, wherein the first target data set is a first equivalence connection data set, a first left association data set or a first full connection data set;
the server determines a first equal-value connection data set, a first left association data set or a first full-connection data set according to the association query request, the first non-tilt data, the first tilt data set, the second non-tilt data set and the second tilt data set, and transmits the first equal-value connection data set, the first left association data set or the first full-connection data set to the target terminal.
The server fully connects the first non-tilt data set and the second non-tilt data set to obtain a first transition data set; the server fully connects the first oblique data set and the second oblique data set to obtain a second transition data set; and merging the first transition data set and the second transition data set to obtain a first complete transition data set. And the server correspondingly associates the first transition data set and the second transition data set according to the association query request to obtain a first target data set. Assuming that the association query request is an equivalent connection query request, the server reserves part of data in the first complete transition data set and deletes part of data, so as to obtain a first equivalent connection data set; assuming that the association query request is a left association query request, the server reserves partial data and deletes partial data in the first complete transition data set, so as to obtain a first left association data set; and if the associated query request is a full-connection query request, reserving the data part in the first complete transition data set, and deleting the identification part to obtain a first full-connection data set. Specifically, refer to step 204 for specific data retention and data deletion.
105. If the second data volume is larger than the first data volume, obtaining a third non-inclined data set, a third inclined data set, a fourth non-inclined data set and a fourth inclined data set according to the first table data and the second table data;
if the server determines that the second amount of data is greater than the first amount of data, the server extracts a third non-skewed set of data, a third skewed set of data, a fourth non-skewed set of data, and a fourth skewed set of data based on the first tabular data and the second tabular data.
For example, assume that the skew threshold is measured in bytes, the skew threshold is 200M, the first amount of data is 210M, and the second amount of data is 220M. It can be seen that, the second data amount is larger than the first data amount, the server processes the second table data, and after processing the second table data, the server obtains the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, and the fourth tilt data set through the first table data, the second table data, and the processed second table data.
The tilt threshold may be measured by the number of bytes or the number of recording lines.
106. And determining a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set and the association query request, and transmitting the second target data set to the target terminal, wherein the second target data set is a second equal-value connected data set, a second left associated data set or a second fully connected data set.
The server determines a second equal-value connection data set, a second left association data set or a second full-connection data set according to the association query request, the third non-tilt data, the third tilt data set, the fourth non-tilt data set and the fourth tilt data set, and transmits the second equal-value connection data set, the second left association data set or the second full-connection data set to the target terminal.
The server fully connects the third non-tilt data set and the fourth non-tilt data set to obtain a third transition data set; the server fully connects the third oblique data set and the fourth oblique data set to obtain a fourth transition data set; and merging the third transition data set and the fourth transition data set to obtain a second complete transition data set. And the server correspondingly associates the third transition data set with the fourth transition data set according to the association query request to obtain a second target data set. Assuming that the association query request is an equivalent connection query request, the server reserves part of data in the second complete transition data set and deletes part of data, so as to obtain a first equivalent connection data set; assuming that the association query request is a left association query request, the server reserves partial data and deletes partial data in the second complete transition data set, so as to obtain a first left association data set; and if the associated query request is a full-connection query request, reserving partial data in the second complete transition data set, and deleting the identification part to obtain the first full-connection data set. For which data needs to be specifically retained and deleted in the second complete transition data set for the left association query request, the equal-value join query request, and the full-join query request, please refer to the description of step 206.
It should be noted that the present invention also relates to a block chain technology, and the first target data set and the second target data set may be stored in a block chain.
In the embodiment of the invention, the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications are extracted from the first table data and the second table data according to the first table data amount and the second table data amount, and the target data set is obtained based on the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications, so that the efficiency of querying the associated data is improved, and the probability of failure of querying the associated data is reduced.
Referring to fig. 2, another embodiment of the association query method based on data skew according to the embodiment of the present invention includes:
201. acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, and counting the data volume of the first table data and the data volume of the second table data to obtain a first data volume and a second data volume, wherein the association query request is an equivalent connection query request, a left association query request or a full connection query request;
the service acquires an equivalent connection query request, a left association query request or a full connection query request from a target terminal, reads first table data and second table data according to the equivalent connection query request, the left association query request or the full connection query request, and reads a first data volume corresponding to the first table data and a second data volume corresponding to the second table data.
Specifically, the server divides the first table data into a plurality of first columns of data by adopting a group by function, the plurality of first columns of data comprise a plurality of complete first columns of data, and the server performs data processing on each first column of data in the plurality of first columns of data to obtain a plurality of independent first sub-table data; the server counts the data quantity of the independent first sub-table data to obtain a plurality of independent first sub-table data quantities, and the server adds the independent first sub-table data quantities to obtain a first data quantity. The server divides the second table data into a plurality of second line data by adopting a group by function, the plurality of second line data comprise a plurality of complete second line data, and the server performs data processing on every two second line data in the plurality of second line data to obtain a plurality of independent second sub-table data; the server counts the data quantity of the plurality of independent second sub-table data to obtain a plurality of independent second sub-table data quantities, and the server adds the plurality of independent second sub-table data quantities to obtain a second data quantity.
202. When at least one of the first data volume and the second data volume is larger than the inclination threshold, judging whether the first data volume is larger than the second data volume or whether the second data volume is larger than the first data volume;
when any one of the first data volume and the second data volume is larger than the inclination threshold, the server judges whether the first data volume is larger than the second data volume or the second data volume is larger than the first data volume, if the first data volume is larger than the second data volume, the first table data is processed, and then corresponding data association is carried out on the first table data, the second table data and the processed first table data; and if the second data quantity is larger than the inclination threshold, processing the second table data, and then performing corresponding data association on the first table data, the second table data and the processed second table data. When the first data volume and the second data volume are not larger than the inclination threshold, the server firstly broadcasts the table data with smaller data volume to each node of the associated thread, and then directly performs left data association, equal data connection or full data connection on the first table data and the second table data according to the associated query request.
203. If the first data volume is larger than the second data volume, obtaining a first non-inclined data set, a first inclined data set, a second non-inclined data set and a second inclined data set according to the first table data and the second table data;
if the server determines that the first amount of data is greater than the second amount of data, the server extracts a first non-skewed data set, a first skewed data set, a second non-skewed data set, and a second skewed data set based on the first tabular data and the second tabular data.
Specifically, if the server determines that the first data volume is larger than the second data volume, the first table data is processed into first label data, and the second table data is left-associated with the first label data to obtain a first result set including a plurality of first small data identifiers; the server extracts a data set with a first small data identifier as a null value from the first result set, and adds a first small data identifier which is not a null value again to obtain a first non-inclined data set; the server extracts a data set of which the first small data identifier is not null from the first result set to obtain a first inclined data set; then adding a first big data identifier for the second table data by the server to obtain first table identifier data, and performing left association on the first table identifier data and the first label data to obtain a second result set comprising a plurality of first small table list data; extracting a data set with the first small tabular data as a null value from the second result set, and deleting the corresponding first small tabular data to obtain a second non-inclined data set; and extracting a data set of which the first small tabular data is not null from the second result set, and deleting the corresponding first tabular data to obtain a second inclined data set.
204. Determining a first target data set according to the first non-oblique data set, the first oblique data set, the second non-oblique data set, the second oblique data set and the association query request, and transmitting the first target data set to a target terminal, wherein the first target data set is a first equivalence connection data set, a first left association data set or a first full connection data set;
the server determines a first equal-value connection data set, a first left association data set or a first full-connection data set according to the association query request, the first non-tilt data, the first tilt data set, the second non-tilt data set and the second tilt data set, and transmits the first equal-value connection data set, the first left association data set or the first full-connection data set to the target terminal.
Specifically, the server fully connects the first non-tilt data set with the second non-tilt data set, and fully connects the first tilt data set with the second tilt data set to obtain a first initial full data set and a second initial full data set; secondly, the server extracts a union set of the first initial full data set and the second full data set to obtain a first full data set; then when the association query request is an equivalence connection query request, the server extracts a data set of which the first small data identifier is not null and a data set of which the first big data identifier is not null from the first full data set, and deletes the corresponding plurality of first small data identifiers and the plurality of first big data identifiers, so as to obtain a first equivalence connection data set; when the association query request is a left association query request, the server extracts a data set of which the first big data identifier is not a null value from the first full data set, and deletes the corresponding plurality of first small data identifiers and the plurality of first big data identifiers to obtain a first left association data set; when the association query request is a full-connection query request, the server extracts the first full data set, and deletes the plurality of first small data identifiers and the plurality of first large data identifiers in the first full data set, thereby obtaining a first full-connection data set.
205. If the second data volume is larger than the first data volume, obtaining a third non-inclined data set, a third inclined data set, a fourth non-inclined data set and a fourth inclined data set according to the first table data and the second table data;
if the server determines that the second amount of data is greater than the first amount of data, the server extracts a third non-skewed set of data, a third skewed set of data, a fourth non-skewed set of data, and a fourth skewed set of data based on the first tabular data and the second tabular data.
Specifically, if the server determines that the second data volume is larger than the first data volume, the second table data is processed into second tag data, and the first table data and the second tag data are left-associated to obtain a third result set including a plurality of second small data identifiers; the server extracts a data set with the second small data identifier as a null value from the third result set, and adds the second small data identifier which is not the null value again to obtain a third non-inclined data set; extracting a data set with a second small data identifier not being a null value from the third result set to obtain a third inclined data set; the server adds a second big data identifier to the first table data to obtain second table identifier data, and performs left association on the second table identifier data and second label data to obtain a fourth result set comprising a plurality of second small table list data; extracting a data set with the second small tabular data as a null value from the fourth result set, and deleting the corresponding second small tabular data to obtain a fourth non-inclined data set; and extracting a data set of which the second small tabular data is not null from the fourth result set, and deleting the corresponding second small tabular data to obtain a fourth inclined data set.
For the convenience of understanding, the following detailed description of step 205 is provided in conjunction with the practical application:
after the first table data is broadcasted, second marking data is obtained, and the second marking data is shown in the following table 3:
table 3: second label data
user_id enterprise_id 1-smaller_mark
Zhangsan E1 1
Wangwu E1 1
The server performs left correlation between the first table data and the second tag data to obtain a third result set as shown in table 4 below:
table 4: third result set
user_id Age enterprise_id 1-smaller_mark
Zhangsan 18 E1 1
Lisi 19 Null Null
Zhaoliu 20 Null Null
The column data corresponding to the 1-small _ mark is a plurality of second small data identifiers, the server extracts a data set with the second small data identifiers being Null values from the third result set, and adds the second small data identifiers which are not Null values again, and the obtained third non-inclined data set is specifically as the following table 5:
table 5: third non-skewed dataset
user_id Age enterprise_id 1-smaller_mark
Lisi 19 Null 1
Zhaoliu 20 Null 1
The server extracts a data set with the second small data identifier not being Null from the third result set to obtain a third tilted data set, which is specifically shown in table 6 below:
table 6: third oblique data set
user_id Age enterprise_id 1-smaller_mark
Zhangsan 18 E1 1
The server adds a second big data identifier to the first table data to obtain second table identifier data, and performs left association between the second table identifier data and the second tag data to obtain a fourth result set, which is specifically shown in table 7 below:
table 7: fourth result set
user_id enterprise_id bigger_mark 2-smaller_mark
Zhangsan E1 1 1
Lisi E1 1 Null
Wangwu E1 1 1
The plurality of table data in the column data corresponding to the bigger _ mark are a plurality of second big data identifiers, and the column data corresponding to the 2-smallarmark are a plurality of second small table column data. The server extracts the data set with the second small table list data as Null from the fourth result set, and deletes the corresponding second small table list data to obtain a fourth non-inclined data set as shown in the following table 8:
table 8: fourth non-oblique dataset:
user_id enterprise_id bigger_mark
lisi E1 1
the server extracts a data set of which the second small list data is not Null from the fourth result set, and deletes the corresponding second small list data to obtain a fourth tilted data set as shown in the following table 9:
table 9: fourth oblique data set
user_id enterprise_id bigger_mark
zhangsan E1 1
wangwu E1 1
206. Determining a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set and the association query request, and transmitting the second target data set to the target terminal, wherein the second target data set is a second equal-value connected data set, a second left associated data set or a second fully connected data set;
the server determines a second equal-value connection data set, a second left association data set or a second full-connection data set according to the association query request, the third non-tilt data, the third tilt data set, the fourth non-tilt data set and the fourth tilt data set, and transmits the second equal-value connection data set, the second left association data set or the second full-connection data set to the target terminal.
Specifically, the server fully connects the third non-tilt data set with the fourth non-tilt data set, and fully connects the third tilt data set with the fourth tilt data set to obtain a third initial full data set and a fourth initial full data set; the server obtains a union set of the third initial full data set and the fourth initial full data set to obtain a second full data set; when the association query request is an equivalence connection query request, extracting a second small data identifier and a plurality of data sets of which the second large data identifiers are not null values in a second full data set, and deleting the corresponding second small data identifiers and the corresponding second large data identifiers to obtain a second equivalence connection data set; when the association query request is a left association query request, extracting a data set of which the second big data identifier is not a null value from the second full data set, and deleting the corresponding second small data identifiers and the second big data identifiers to obtain a second left association data set; and when the association query request is a full-connection query request, extracting the second full data set and deleting the plurality of second small data identifications and the plurality of second large data identifications in the second full data set, thereby obtaining a second full-connection data set.
For ease of understanding, step 206 is described in detail below with reference to an actual application:
a third initial full data set obtained after the server fully connects the third non-tilted data set and the fourth non-tilted data set is specifically as shown in the following table 10:
table 10: third initial full dataset
user_id enterprise_id Age bigger_mark 1-smaller_mark
lisi E1 19 1 1
zhaoliu E1 20 Null 1
A fourth initial full dataset obtained after the server fully connects the third tilted dataset and the fourth tilted dataset is shown in table 11 below:
table 11: fourth initial full dataset
user_id enterprise_id Age bigger_mark 1-smaller_mark
Zhangsan E1 18 1 1
Wangwu E1 Null 1 Null
The server merges the third initial full dataset with the fourth initial full dataset, and the obtained second full dataset is specifically as shown in table 12 below:
table 12: second full dataset
user_id enterprise_id Age bigger_mark 1-smaller_mark
Zhangsan E1 18 1 1
Lisi E1 19 1 1
Wangwu E1 Null 1 Null
Zhaoliu E1 Null Null Null
When the association query request is an equal value connection query request, the second equal value connection data set extracted by the server is shown in the following table 13:
table 13: second equal-valued join dataset
user_id enterprise_id Age
Zhansgan E1 18
Lisi E1 19
When the association query request is a left association query request, the second left association data set extracted by the server is as shown in table 14 below:
table 14: second left associated data set
user_id enterprise_id Age
Zhansgan E1 18
Lisi E1 19
Wangwu E1 Null
When the association query request is a full connection query request, the second full connection data set extracted by the server is shown in table 15 below:
table 15: second fully connected data set
user_id enterprise_id Age
Zhansgan E1 18
Lisi E1 19
Wangwu E1 Null
Zhaoliu Null 20
207. And when the first data volume is less than or equal to the tilt threshold and the second data volume is less than or equal to the tilt threshold, performing corresponding connection on the first table data and the second table data according to the association query request to obtain a third target data set, wherein the third target data set is a third equal-value connection data set, a third left association data set or a third full connection data set.
When the first data volume is smaller than or equal to the inclination threshold and the second data volume is smaller than or equal to the inclination threshold, broadcasting the data with smaller data volume in the first table data and the second table data to each node of the associated thread, and then directly performing left association, full connection or equivalent connection on the first table data and the second table data by the server according to the associated query request.
For example, the tilt threshold is 200M, the first data amount is 150M, the second data amount is 170M, the server first broadcasts the first table data to each node of the association thread, and then performs corresponding data association on the first table data and the second table data according to the association query request.
In the embodiment of the invention, the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications are extracted from the first table data and the second table data according to the first table data amount and the second table data amount, and the target data set is obtained based on the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications, so that the efficiency of querying the associated data is improved, and the probability of failure of querying the associated data is reduced.
With reference to fig. 3, the above description is provided for an association query method based on data skew in the embodiment of the present invention, and an association query device based on data skew in the embodiment of the present invention is described below, where an embodiment of the association query device based on data skew in the embodiment of the present invention includes:
the data acquisition module 301 is configured to acquire an association query request of a target terminal, read first table data and second table data based on the association query request, and count a data amount of the first table data and a data amount of the second table data to obtain a first data amount and a second data amount, where the association query request is an equivalent connection query request, a left association query request, or a full connection query request;
a determining module 302, configured to determine whether the first data amount is larger than the second data amount or the second data amount is larger than the first data amount when at least one of the first data amount and the second data amount is larger than the tilt threshold;
the first data set extraction module 303, if the first data amount is greater than the second data amount, is configured to obtain a first non-tilted data set, a first tilted data set, a second non-tilted data set, and a second tilted data set according to the first table data and the second table data;
a first association module 304, configured to determine a first target data set according to the first non-skewed data set, the first skewed data set, the second non-skewed data set, the second skewed data set, and the association query request, and transmit the first target data set to the target terminal, where the first target data set is a first equi-value connection data set, a first left association data set, or a first fully-connected data set;
a second data set extracting module 305, configured to obtain a third non-tilted data set, a third tilted data set, a fourth non-tilted data set, and a fourth tilted data set according to the first table data and the second table data if the second data amount is greater than the first data amount;
a second association module 306, configured to determine a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set, and the association query request, and transmit the second target data set to the target terminal, where the second target data set is a second equal-valued join data set, a second left-associated data set, or a second fully-joined data set.
In the embodiment of the invention, the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications are extracted from the first table data and the second table data according to the first table data amount and the second table data amount, and the target data set is obtained based on the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications, so that the efficiency of querying the associated data is improved, and the probability of failure of querying the associated data is reduced.
Referring to fig. 4, another embodiment of the association query apparatus based on data skew according to the embodiment of the present invention includes:
the data acquisition module 301 is configured to acquire an association query request of a target terminal, read first table data and second table data based on the association query request, and count a data amount of the first table data and a data amount of the second table data to obtain a first data amount and a second data amount, where the association query request is an equivalent connection query request, a left association query request, or a full connection query request;
a determining module 302, configured to determine whether the first data amount is larger than the second data amount or the second data amount is larger than the first data amount when at least one of the first data amount and the second data amount is larger than the tilt threshold;
the first data set extraction module 303, if the first data amount is greater than the second data amount, is configured to obtain a first non-tilted data set, a first tilted data set, a second non-tilted data set, and a second tilted data set according to the first table data and the second table data;
a first association module 304, configured to determine a first target data set according to the first non-skewed data set, the first skewed data set, the second non-skewed data set, the second skewed data set, and the association query request, and transmit the first target data set to the target terminal, where the first target data set is a first equi-value connection data set, a first left association data set, or a first fully-connected data set;
a second data set extracting module 305, configured to obtain a third non-tilted data set, a third tilted data set, a fourth non-tilted data set, and a fourth tilted data set according to the first table data and the second table data if the second data amount is greater than the first data amount;
a second association module 306, configured to determine a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set, and the association query request, and transmit the second target data set to the target terminal, where the second target data set is a second equal-valued join data set, a second left-associated data set, or a second fully-joined data set.
Optionally, the data obtaining module 301 may be further specifically configured to:
acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, dividing the first table data into a plurality of first-column data, and dividing the second table data into a plurality of second-column data;
performing data processing on the first rows of data to obtain a plurality of first sub-table data, and counting the data volume of the first sub-table data to obtain a plurality of first sub-table data volumes;
performing data processing on the plurality of second line data to obtain a plurality of second sub-table data, and counting the data quantity of the plurality of second sub-table data to obtain a plurality of second sub-table data quantities;
adding each first sub-table data quantity of the plurality of first sub-table data quantities to obtain a first data quantity;
and adding each second sub-table data quantity in the plurality of second sub-table data quantities to obtain a second data quantity.
Optionally, the first data set extracting module 303 may be further specifically configured to:
if the first data volume is larger than the second data volume, processing the first table data into first mark data, and performing left association on the second table data and the first mark data to obtain a first result set, wherein the first result set comprises a plurality of first small data identifiers;
extracting a data set with a first small data identifier as a null value in the first result set, and adding the first small data identifier again to obtain a first non-inclined data set;
extracting a data set with a first small data identification not being a null value in the first result set to obtain a first inclined data set;
adding a first big data identifier to second table data to obtain first table identifier data, and performing left association on the first table identifier data and first label data to obtain a second result set, wherein the second result set comprises a plurality of first small table list data;
extracting a data set with the first small tabular data as a null value from the second result set, and deleting the corresponding first small tabular data to obtain a second non-inclined data set;
and extracting a data set of which the first small tabular data is not null in the second result set, and deleting the corresponding first tabular data to obtain a second inclined data set.
Optionally, the first association module 304 may be further specifically configured to:
fully connecting the first non-oblique data set with the second non-oblique data set to obtain a first initial full data set, and fully connecting the first oblique data set with the second oblique data set to obtain a second initial full data set;
merging the first initial full data set and the second initial full data set to obtain a first full data set;
when the association query request is the equivalence connection query request, extracting a data set with a first small data identifier not being a null value and a data set with a first big data identifier not being a null value from the first full data set, and deleting the corresponding plurality of first small data identifiers and the corresponding plurality of first big data identifiers to obtain a first equivalence connection data set;
when the association query request is the left association query request, extracting a data set of which the first big data identifier is not a null value from the first full data set, and deleting the corresponding plurality of first small data identifiers and the corresponding plurality of first big data identifiers to obtain a first left association data set;
and when the associated query request is the full-connection query request, extracting the first full data set, and deleting the plurality of first small data identifications and the plurality of first large data identifications to obtain a first full-connection data set.
Optionally, the second data set extracting module 305 may further specifically be configured to:
if the second data size is larger than the first data size, processing the second table data into second mark data, and performing left association on the first table data and the second mark data to obtain a third result set, wherein the third result set comprises a plurality of second small data identifiers;
extracting a data set with a second small data identifier as a null value in the third result set, and adding the second small data identifier again to obtain a third non-inclined data set;
extracting a data set with a second small data identifier not being a null value in a third result set to obtain a third inclined data set;
adding a second big data identifier to the first table data to obtain second table identifier data, and performing left association on the second table identifier data and second label data to obtain a fourth result set, wherein the fourth result set comprises a plurality of second small table list data;
extracting a data set with the second small tabular data as a null value from the fourth result set, and deleting the corresponding second small tabular data to obtain a fourth non-inclined data set;
and extracting a data set of which the second small tabular data is not null in the fourth result set, and deleting the corresponding second tabular data to obtain a fourth inclined data set.
Optionally, the second association module 306 may be further specifically configured to:
fully connecting the third non-oblique data set with the fourth non-oblique data set to obtain a third initial full data set, and fully connecting the third oblique data set with the fourth oblique data set to obtain a fourth initial full data set;
merging the third initial full data set and the fourth initial full data set to obtain a second full data set;
when the association query request is the equal-value connection query request, extracting a data set with a second small data identifier not being a null value and a data set with a second big data identifier not being a null value from the second full data set, and deleting the corresponding second small data identifiers and the corresponding second big data identifiers to obtain a second equal-value connection data set;
when the association query request is the left association query request, extracting a data set of which a second big data identifier is not a null value from the second full data set, and deleting the corresponding second small data identifiers and the corresponding second big data identifiers to obtain a second left association data set;
and when the associated query request is the full-connection query request, extracting the second full data set, and deleting the plurality of second small data identifications and the plurality of second large data identifications to obtain a second full-connection data set.
Optionally, the data tilt-based association query apparatus further includes:
and a third association module 307, configured to, when the first data amount is less than or equal to the tilt threshold and the second data amount is less than or equal to the tilt threshold, perform corresponding connection on the first table data and the second table data according to the association query request to obtain a third target data set, where the third target data set is a third equal-value connection data set, a third left-association data set, or a third full-connection data set.
In the embodiment of the invention, the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications are extracted from the first table data and the second table data according to the first table data amount and the second table data amount, and the target data set is obtained based on the plurality of inclined data sets, the plurality of non-inclined data sets and the data identifications, so that the efficiency of querying the associated data is improved, and the probability of failure of querying the associated data is reduced.
Fig. 3 and fig. 4 describe the association query apparatus based on data tilting in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the association query apparatus based on data tilting in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 5 is a schematic structural diagram of a data tilt-based association query apparatus 500 according to an embodiment of the present invention, where the data tilt-based association query apparatus 500 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) for storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on storage medium 530 may include one or more modules (not shown), each of which may include a sequence of instruction operations for associating querying devices 500 based on data tilting. Still further, processor 510 may be configured to communicate with storage medium 530 to execute a series of instruction operations in storage medium 530 on data tilt-based association query device 500.
The data tilt-based association query device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the data tilt-based association query device architecture illustrated in FIG. 5 does not constitute a limitation of data tilt-based association query devices, and may include more or fewer components than those illustrated, or some components in combination, or a different arrangement of components.
Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when executed on a computer, cause the computer to perform the steps of the data tilt-based association query method.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A correlation query method based on data tilt is characterized in that the correlation query method based on data tilt comprises the following steps:
acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, and counting the data volume of the first table data and the data volume of the second table data to obtain a first data volume and a second data volume, wherein the association query request is an equivalent connection query request, a left association query request or a full connection query request;
when at least one of the first data volume and the second data volume is larger than a tilt threshold, judging whether the first data volume is larger than the second data volume or the second data volume is larger than the first data volume;
if the first data volume is larger than the second data volume, obtaining a first non-inclined data set, a first inclined data set, a second non-inclined data set and a second inclined data set according to the first table data and the second table data;
determining a first target data set according to the first non-tilt data set, the first tilt data set, the second non-tilt data set, the second tilt data set and the association query request, and transmitting the first target data set to the target terminal, wherein the first target data set is a first equivalence connection data set, a first left association data set or a first full connection data set;
if the second data volume is larger than the first data volume, obtaining a third non-inclined data set, a third inclined data set, a fourth non-inclined data set and a fourth inclined data set according to the first table data and the second table data;
determining a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set, and the association query request, and transmitting the second target data set to the target terminal, where the second target data set is a second equal-value join data set, a second left-associated data set, or a second fully-joined data set.
2. The association query method based on data skew as claimed in claim 1, wherein the obtaining of the association query request of the target terminal, reading first table data and second table data based on the association query request, and performing statistics on the data size of the first table data and the data size of the second table data to obtain the first data size and the second data size, wherein the association query request is an equal-value connection query request, a left association query request, or a full-connection query request includes:
acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, dividing the first table data into a plurality of first column data, and dividing the second table data into a plurality of second column data;
performing data processing on the first rows of data to obtain a plurality of first sub-table data, and counting the data volume of the first sub-table data to obtain a plurality of first sub-table data volumes;
performing data processing on the plurality of second line data to obtain a plurality of second sub-table data, and counting the data quantity of the plurality of second sub-table data to obtain a plurality of second sub-table data quantities;
adding each first sub-table data quantity in the plurality of first sub-table data quantities to obtain a first data quantity;
and adding each second sub-table data quantity in the plurality of second sub-table data quantities to obtain a second data quantity.
3. The method according to claim 1, wherein if the first data amount is larger than the second data amount, obtaining a first non-skewed data set, a first skewed data set, a second non-skewed data set, and a second skewed data set according to the first table data and the second table data comprises:
if the first data volume is larger than the second data volume, processing the first table data into first mark data, and performing left association on the second table data and the first mark data to obtain a first result set, wherein the first result set comprises a plurality of first small data identifiers;
extracting a data set with a first small data identifier as a null value from the first result set, and adding a first small data identifier which is not a null value again to obtain a first non-inclined data set;
extracting a data set with a first small data identification not being a null value in the first result set to obtain a first inclined data set;
adding a first big data identifier to the second table data to obtain first table identifier data, and performing left association on the first table identifier data and the first label data to obtain a second result set, wherein the second result set comprises a plurality of first small table column data;
extracting a data set with first small tabular data as a null value from the second result set, and deleting the corresponding first small tabular data to obtain a second non-inclined data set;
and extracting a data set of which the first small tabular data is not null in the second result set, and deleting the corresponding first tabular data to obtain a second inclined data set.
4. The method of claim 1, wherein the determining a first target dataset according to the first non-skewed dataset, the first skewed dataset, the second non-skewed dataset, the second skewed dataset, and the association query request, and transmitting the first target dataset to the target terminal, the first target dataset being a first equi-valued connected dataset, a first left associated dataset, or a first fully connected dataset comprises:
fully connecting the first non-oblique data set with the second non-oblique data set to obtain a first initial full data set, and fully connecting the first oblique data set with the second oblique data set to obtain a second initial full data set;
merging the first initial full data set and the second initial full data set to obtain a first full data set;
when the association query request is the equivalence connection query request, extracting a data set with a first small data identifier not being a null value and a data set with a first big data identifier not being a null value from the first full data set, and deleting the corresponding plurality of first small data identifiers and the corresponding plurality of first big data identifiers to obtain a first equivalence connection data set;
when the association query request is the left association query request, extracting a data set of which the first big data identifier is not a null value from the first full data set, and deleting the corresponding plurality of first small data identifiers and the corresponding plurality of first big data identifiers to obtain a first left association data set;
and when the associated query request is the full-connection query request, extracting the first full data set, and deleting the plurality of first small data identifications and the plurality of first large data identifications to obtain a first full-connection data set.
5. The method according to claim 1, wherein if the second amount of data is greater than the first amount of data, obtaining a third non-skewed data set, a third skewed data set, a fourth non-skewed data set, and a fourth skewed data set according to the first table data and the second table data comprises:
if the second data volume is larger than the first data volume, processing the second tabular data into second marked data, and performing left association on the first tabular data and the second marked data to obtain a third result set, wherein the third result set comprises a plurality of second small data identifiers;
extracting a data set with a second small data identifier as a null value from the third result set, and adding a second small data identifier which is not a null value again to obtain a third non-inclined data set;
extracting a data set with a second small data identifier not being a null value in the third result set to obtain a third inclined data set;
adding a second big data identifier to the first table data to obtain second table identifier data, and performing left association on the second table identifier data and the second label data to obtain a fourth result set, wherein the fourth result set comprises a plurality of second small table column data;
extracting a data set with second small tabular data as a null value from the fourth result set, and deleting the corresponding second small tabular data to obtain a fourth non-inclined data set;
and extracting a data set of which the second small tabular data is not null in the fourth result set, and deleting the corresponding second tabular data to obtain a fourth inclined data set.
6. The data tilt-based association query method of claim 1, wherein the determining a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set, and the association query request, and transmitting the second target data set to the target terminal, wherein the second target data set is a second equal-valued connected data set, a second left-associated data set, or a second fully-connected data set, comprises:
fully connecting the third non-oblique data set with the fourth non-oblique data set to obtain a third initial full data set, and fully connecting the third oblique data set with the fourth oblique data set to obtain a fourth initial full data set;
merging the third initial full data set and the fourth initial full data set to obtain a second full data set;
when the association query request is the equal-value connection query request, extracting a data set with a second small data identifier not being a null value and a data set with a second big data identifier not being a null value from the second full data set, and deleting the corresponding second small data identifiers and the corresponding second big data identifiers to obtain a second equal-value connection data set;
when the association query request is the left association query request, extracting a data set of which a second big data identifier is not a null value from the second full data set, and deleting the corresponding second small data identifiers and the corresponding second big data identifiers to obtain a second left association data set;
and when the associated query request is the full-connection query request, extracting the second full data set, and deleting the plurality of second small data identifications and the plurality of second large data identifications to obtain a second full-connection data set.
7. The data tilt-based association query method according to any one of claims 1-6, wherein after determining a second target data set according to the third non-tilt data set, the third tilt data set, the fourth non-tilt data set, the fourth tilt data set, and the association query request, and transmitting the second target data set to the target terminal, the second target data set being a second equal-valued join data set, a second left-associated data set, or a second fully-joined data set, the data tilt-based association query method further comprises:
and when the first data volume is smaller than or equal to the tilt threshold and the second data volume is smaller than or equal to the tilt threshold, performing corresponding connection on the first table data and the second table data according to the association query request to obtain a third target data set, wherein the third target data set is a third equal-value connection data set, a third left association data set or a third full connection data set.
8. An association query device based on data tilt, characterized in that the association query device based on data tilt comprises:
the data acquisition module is used for acquiring an association query request of a target terminal, reading first table data and second table data based on the association query request, and counting the data volume of the first table data and the data volume of the second table data to obtain a first data volume and a second data volume, wherein the association query request is an equivalent connection query request, a left association query request or a full connection query request;
a judging module, configured to judge whether the first data amount is larger than a second data amount or the second data amount is larger than the first data amount when at least one of the first data amount and the second data amount is larger than a tilt threshold;
the first data set extraction module is used for obtaining a first non-inclined data set, a first inclined data set, a second non-inclined data set and a second inclined data set according to the first table data and the second table data if the first data amount is larger than the second data amount;
a first association module, configured to determine a first target data set according to the first non-skewed data set, the first skewed data set, the second non-skewed data set, the second skewed data set, and the association query request, and transmit the first target data set to the target terminal, where the first target data set is a first equal-valued connection data set, a first left association data set, or a first fully-connected data set;
the second data set extraction module is used for obtaining a third non-inclined data set, a third inclined data set, a fourth non-inclined data set and a fourth inclined data set according to the first table data and the second table data if the second data amount is larger than the first data amount;
a second association module, configured to determine a second target data set according to the third non-skewed data set, the third skewed data set, the fourth non-skewed data set, the fourth skewed data set, and the association query request, and transmit the second target data set to the target terminal, where the second target data set is a second equal-valued connected data set, a second left-associated data set, or a second fully-connected data set.
9. An association query device based on data tilt, characterized in that the association query device based on data tilt comprises: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invokes the instructions in the memory to cause the data tilt-based association query device to perform the data tilt-based association query method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a data tilt-based association query method according to any one of claims 1 to 7.
CN202010581205.9A 2020-06-23 2020-06-23 Associated query method, device, equipment and storage medium based on data inclination Active CN111708809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010581205.9A CN111708809B (en) 2020-06-23 2020-06-23 Associated query method, device, equipment and storage medium based on data inclination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010581205.9A CN111708809B (en) 2020-06-23 2020-06-23 Associated query method, device, equipment and storage medium based on data inclination

Publications (2)

Publication Number Publication Date
CN111708809A true CN111708809A (en) 2020-09-25
CN111708809B CN111708809B (en) 2024-05-03

Family

ID=72542378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010581205.9A Active CN111708809B (en) 2020-06-23 2020-06-23 Associated query method, device, equipment and storage medium based on data inclination

Country Status (1)

Country Link
CN (1) CN111708809B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095413A (en) * 2015-07-09 2015-11-25 北京京东尚科信息技术有限公司 Method and apparatus for solving data skew
CN108268586A (en) * 2017-09-22 2018-07-10 广东神马搜索科技有限公司 Across the data processing method of more tables of data, device, medium and computing device
CN111241111A (en) * 2020-02-12 2020-06-05 网易(杭州)网络有限公司 Data query method and device, data comparison method and device, medium and equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095413A (en) * 2015-07-09 2015-11-25 北京京东尚科信息技术有限公司 Method and apparatus for solving data skew
CN108268586A (en) * 2017-09-22 2018-07-10 广东神马搜索科技有限公司 Across the data processing method of more tables of data, device, medium and computing device
CN111241111A (en) * 2020-02-12 2020-06-05 网易(杭州)网络有限公司 Data query method and device, data comparison method and device, medium and equipment

Also Published As

Publication number Publication date
CN111708809B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
CN108268586B (en) Data processing method, device, medium and computing equipment across multiple data tables
CN111177302B (en) Service bill processing method, device, computer equipment and storage medium
EP3767483A1 (en) Method, device, system, and server for image retrieval, and storage medium
CN104794123A (en) Method and device for establishing NoSQL database index for semi-structured data
CN106611064B (en) Data processing method and device for distributed relational database
CN110597852A (en) Data processing method, device, terminal and storage medium
CN111580965A (en) Data request processing method and system
US20140280929A1 (en) Multi-tier message correlation
CN111858659A (en) Data query method, device and equipment based on row key salt value and storage medium
CN112069048A (en) Log processing method, device and storage medium
CN114741368A (en) Log data statistical method based on artificial intelligence and related equipment
CN104881475A (en) Method and system for randomly sampling big data
Thachuk Indexing hypertext
CN116719822A (en) Method and system for storing massive structured data
CN110874365B (en) Information query method and related equipment thereof
CN111708809A (en) Associated query method, device and equipment based on data tilt and storage medium
CN117093556A (en) Log classification method, device, computer equipment and computer readable storage medium
CN117171161A (en) Data query method and device
CA2418093A1 (en) Data compiling method
CN116126864A (en) Index construction method, data query method and related equipment
CN110851437A (en) Storage method, device and equipment
CN114398373A (en) File data storage and reading method and device applied to database storage
US11501020B2 (en) Method for anonymizing personal information in big data and combining anonymized data
CN114297236A (en) Data blood relationship analysis method, terminal equipment and storage medium
CN112527776A (en) Data auditing method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant