CN112100199A - Analysis method, device, equipment and medium based on data set grouping - Google Patents

Analysis method, device, equipment and medium based on data set grouping Download PDF

Info

Publication number
CN112100199A
CN112100199A CN202010995383.6A CN202010995383A CN112100199A CN 112100199 A CN112100199 A CN 112100199A CN 202010995383 A CN202010995383 A CN 202010995383A CN 112100199 A CN112100199 A CN 112100199A
Authority
CN
China
Prior art keywords
result set
determining
intermediate result
data set
input data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010995383.6A
Other languages
Chinese (zh)
Other versions
CN112100199B (en
Inventor
张钦
朱仲颖
万伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dameng Database Co Ltd
Original Assignee
Shanghai Dameng Database Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dameng Database Co Ltd filed Critical Shanghai Dameng Database Co Ltd
Priority to CN202010995383.6A priority Critical patent/CN112100199B/en
Publication of CN112100199A publication Critical patent/CN112100199A/en
Application granted granted Critical
Publication of CN112100199B publication Critical patent/CN112100199B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Abstract

The invention discloses an analysis method, an analysis device, analysis equipment and an analysis medium based on data set grouping. The method comprises the following steps: after receiving a structured query statement SQL, if an input data set meets a preset condition, determining that an execution plan comprises a connection operational character; removing repeated data rows of the input data set to obtain a left node connected with the operational character; under the matching condition, detecting and matching the right node of the connection operational character according to the left node, and determining an intermediate result set; and determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set. According to the scheme, the input data sets can be grouped and then analyzed, the problem that the efficiency of the existing grouping analysis method is low when only partial data sets are needed in the result set is effectively solved, and the high-efficiency grouping analysis of the data sets is realized.

Description

Analysis method, device, equipment and medium based on data set grouping
Technical Field
The present invention relates to data processing technologies, and in particular, to an analysis method, an analysis device, an analysis apparatus, and a medium based on data set grouping.
Background
The database analysis function may calculate a correlation value based on the grouping of the data rows. The analysis function needs to group data lines first, then perform other operations on a group of data, and finally output a data set by group.
In the prior art, a database management system may use one operator to complete the calculation of the analysis function, and one operator to complete the grouping and other operations. All data is input to an operator, which outputs the results of the analysis function. All data is input and all data is output.
When the result set only needs partial data groups, the analysis function process does not need to be completed completely, and the existing grouping analysis method is low in efficiency and cannot realize partial grouping analysis of the data set efficiently.
Disclosure of Invention
The invention provides an analysis method, an analysis device, analysis equipment and an analysis medium based on data set grouping, which are used for solving the problem that the existing analysis method is low in efficiency when a result set only needs partial data groups, and realizing partial grouping analysis of the data set with high efficiency.
In a first aspect, an embodiment of the present invention provides a method for analyzing based on data set grouping, where the method includes:
after receiving a structured query statement SQL, if an input data set meets a preset condition, determining that an execution plan comprises a connection operational character;
removing key value repeated data rows of the input data set to obtain a left node connected with the operational characters;
under the matching condition, detecting and matching the right node of the connection operational character according to the left node, and determining an intermediate result set;
and determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set.
In a second aspect, an embodiment of the present invention further provides an analysis apparatus based on data set grouping, where the apparatus includes: a determination module, a first execution module, a second execution module, and an output module, wherein,
the determining module is used for determining that the execution plan comprises a connection operational character if the input data set meets a preset condition after receiving the structured query statement SQL;
the first execution module is used for removing key value repeated data rows of the input data set to obtain a left node connected with the operational character;
the second execution module is used for detecting and matching the right node of the connection operational character according to the left node under the matching condition and determining an intermediate result set;
and the output module is used for determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set.
In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the analysis method based on the data set grouping according to the first aspect when executing the program.
In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions for performing the method for data set grouping based analysis according to the first aspect when executed by a computer processor.
After receiving a Structured Query Language (SQL), if an input data set meets a preset condition, determining that an execution plan comprises a connection operational character; removing key value repeated data rows of the input data set to obtain a left node connected with the operational characters; under the matching condition, detecting and matching the right node of the connection operational character according to the left node, and determining an intermediate result set; and determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set, solving the problem of low efficiency of the existing analysis method when the result set only needs partial data groups, and realizing partial analysis of the data set.
Drawings
Fig. 1 is a flowchart of an analysis method based on data set grouping according to an embodiment of the present invention;
fig. 2 is a flowchart of an analysis method based on data set grouping according to a second embodiment of the present invention;
fig. 3 is a diagram of an implementation of an analysis method based on data set grouping according to a second embodiment of the present invention;
fig. 4 is a structural diagram of an analysis apparatus based on data set grouping according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like. In addition, the embodiments and features of the embodiments in the present invention may be combined with each other without conflict.
Example one
Fig. 1 is a flowchart of an analysis method based on data set grouping according to an embodiment of the present invention, which is applicable to a situation where an existing analysis method is inefficient when a result set only needs a partial data set, and the method may be executed by a computer, and specifically includes the following steps:
step 110, after receiving the structured query statement SQL, if the input data set meets a preset condition, determining that the execution plan includes a join operator.
A user entering SQL statements into a database management system may trigger a group analysis of a data set. When the input SQL statement includes a TOP clause or the input data set satisfies a preset condition, the data set may be processed using a grouped analysis method.
The preset condition may include that the data set includes a data group composed of a plurality of identical data lines. When the number of data rows with repeated key values in the data set is large and the number of data groups is small, the data set can be processed by adopting a grouped analysis method. If the number of data groups in the data set is large, the data set cannot be processed by adopting a grouping analysis method to obtain better performance, and a query optimizer of the system can be adopted for analysis and selection.
SQL statements are a special purpose programming language, database query, and programming language that may be used to access data and query, update, and manage a relational database system.
The TOP clause in the SQL statement may be used to extract the TOP-ranked dataset records.
The operators may include an AFUN operator and a nested intra-connection operator NLI. In this embodiment, the join operator may include a nested intra-join operator NLI, which may be used to join data tables.
And 120, removing key value repeated data rows of the input data set to obtain a left node connected with the operator.
A data set may include a plurality of key-value-duplicated data rows, the key-value-duplicated data rows may constitute one data group, and the data set may include a plurality of data groups.
The left node may include the left child of the join operator. Specifically, the left node may include a data set obtained BY removing duplicate key values from the data set of PARTITION BY items.
PARTITION BY is part of an analytical function in a database and may return multiple records in one group, which may be used to group datasets. Of course, if not specified, the PARTITION BY may group the entire result set as one.
Each entry of the PARTITION BY entry may be from a specific table or view, the corresponding table or view is a node in one plan tree, and a sub plan tree rooted at a node of an upper plan tree of the above nodes may be a PARTITION BY entry dataset.
When the PARTITION BY item data set includes multiple columns of data, the PARTITION BY item data set, from which duplicate key values are removed, may include multiple columns of data rows.
And step 130, under the matching condition, performing detection matching on the right node of the connection operator according to the left node, and determining an intermediate result set.
And detecting and matching in the right node connected with the operator through the data row in the left node by taking the matching condition as a constraint condition, and if a target data row in the right node meeting the matching condition is obtained, determining an intermediate result set according to the data row of the left node and the target data row of the right node.
The intermediate result set may include all data columns of the left node and the right node, or may include some data columns, and the selection of the data columns of the intermediate result set may be specifically determined according to actual requirements and analysis functions.
The right node of the join operator may include the right child of the join operator. The left node, right node, left child, and right child may be used to distinguish between two nodes that join operators.
In addition, the matching condition may be determined according to an analysis function.
For example, when the analysis function includes:
SUM(C3)OVER(PARTITION BY C1)
FROM T1,T2
WHERE C1=D1 AND C2=D2);
AND a TOP clause is included in the statement, the match condition may include the sum of the C3 column values of records in the lookup tables T1, T2 that satisfy the C1-D1 AND C2-D2 condition, where the C1, C3, AND C1 columns are equal.
The intermediate result set may include the sum of the C3 column values of records having C1, C3, and C1 columns equal in the data row that satisfies the match condition.
And step 140, determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set.
The number of rows in the result set can be determined according to the conditions contained in the TOP clause.
When the TOP clause includes a SELECT TOP 4 FROM, the number of rows in the result set may include the first 4 rows of the intermediate result set.
Therefore, after the current intermediate result set is obtained, the number of rows of the current intermediate result set can be determined; and if the sum of the number of lines of the current intermediate result set and the number of lines of the intermediate result set is greater than or equal to the condition contained in the TOP clause, determining and outputting the result set, wherein the number of lines of the result set can be the sum of the number of lines of the current intermediate result set and the number of lines of the intermediate result set.
The embodiment of the invention provides an analysis method based on data set grouping, which comprises the following steps: after receiving a structured query statement SQL, if an input data set meets a preset condition, determining that an execution plan comprises a connection operational character; removing key value repeated data rows of the input data set to obtain a left node connected with the operational characters; under the matching condition, detecting and matching the right node of the connection operational character according to the left node, and determining an intermediate result set; and determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set, solving the problem of low efficiency of the existing analysis method when the result set only needs partial data groups, and realizing partial analysis of the data set.
Example two
Fig. 2 is a flowchart of an analysis method based on data set grouping according to a second embodiment of the present invention, which is embodied on the basis of the above embodiments. In this embodiment, the method may further include:
step 210, after receiving the structured query statement SQL, performing syntax recognition on the SQL language to obtain an analysis function included in the SQL statement.
Specifically, the analysis function included in the input SQL statement may include a processing method for the data set. Specifically, the processing method for the data set may include a method of detecting a matching right node by a left node, and extracting left node data and right node data.
The extracted data rows of the left node and the right node may include all the left node data rows and the right node data rows which satisfy the condition, or may include part of the left node data rows and the right node data rows which satisfy the condition.
The processing of the SQL statement may include parsing, semantic analysis, query optimization, generating an execution plan, and executing a plan. The analysis method based on data set grouping proposed by the present embodiment may act on the phase of generating an execution plan.
Step 220, determining a matching condition according to the analysis function.
In particular, the analysis function may determine the matching condition. Different analysis functions may correspond to different matching conditions, and the same analysis function may also include different matching conditions. The specific matching conditions may be set according to actual requirements, and are not specifically limited herein.
Step 230, if the number of groups of the input data set is less than the predetermined number of groups, determining that the execution plan includes join operators.
When the number of key value repeating data rows of the input data set is large and the number of groups of the input data set is small, the number of NLI circulation times is small, and the efficiency of grouping analysis of the input data set can be improved by adopting the method in the embodiment. When the number of key value repeating data rows of the input data set is small or the number of groups of the input data set is large, the efficiency of analyzing the input data set by the method is not greatly improved.
Specifically, when the number of groups of the input data set is smaller than the preset number of groups, the input data set may be analyzed by the method described in this example. The preset number of groups may include four groups or six groups, etc., and is not specifically limited herein, and may be limited according to a specific analysis function.
And 240, removing key value repeated data rows of the input data set to obtain a left node connected with the operator.
In one embodiment, step 240 specifically includes:
a sub-execution plan containing grouping items is obtained.
And removing the key value repeating data row of the sub-execution plan to obtain the left node of the connection operation character.
Specifically, key value repeating data rows may form data groups, implementing grouping of input data sets. Multiple data sets may constitute a data set
And according to the grouping result, removing the key value repeated data row of the target data group to obtain a left node connected with the operator.
Specifically, a DISTINCT operator may be employed to remove key-value duplicate data rows of the target data set.
The execution plan for removing key-value duplicate data rows of the target data set using the DISTINCT operator may include:
NLI
- - -DISTINCT P/. removing duplicate data based on the PARTITION BY entry
------REP_L
---AFUN*(NO PARTITION BY)
------L_CHILD*
And after removing the key value repeated data rows, connecting the rest data rows to obtain a left node.
For example, if the PARTITION BY clause includes PARTITION BY a.c1, then REP _ L is a; if the PARTITION BY clause includes PARTITION BY b.c2, c.c2, then table B and table C need to be concatenated, REP _ L may be as follows.
CROSS
---B
---C
If the PARTITION BY clause includes PARTITION BY a.c1, b.c2, c.c2, REP _ L may be as follows.
CROSS1
---A
---CROSS2
-----B
-----C
And 250, under the preset matching condition, detecting and matching the right node connected with the operator according to the left node, and determining an intermediate result set.
In one embodiment, step 250 specifically includes:
and under the preset matching condition, searching a target data row matched with the left node in the right node.
Specifically, the number of rows of the data lines matching each other in the left node and the right node is not particularly limited. For example, a row of data in the left node may be matched with multiple rows of data in the right node; multiple rows of data in the left node may be matched to one row of data in the right node. The specific matching data row may be determined from the actual data set.
And determining an intermediate result set according to the left node and the target data row.
In particular, if the data set includes multiple columns of data, the columns of data included in the intermediate result set may be determined according to an analysis function.
When the analysis function includes:
(SELECT C1,C3,SUM(C3)OVER(PARTITION BY C1)RK,
the intermediate result set may include the sum of the C3 column values of C1, C3, and C1 column-equal records, and the sum column name of the C3 column values of C1 column-equal records may be RK.
Of course, in practical applications, the intermediate result set may also include the data column of the right node, and the specific data row of the intermediate result set may be determined according to the specific analysis function and the specific requirement.
And step 260, if the left node does not have the corresponding intermediate result set, determining that the intermediate result set is empty.
Specifically, if the data row matched with the left node cannot be found in the right node according to the matching condition, the result set is determined to be an empty set, and no output is made.
In addition, if the data row matched with one or more data rows of the left node cannot be found in the right node according to the matching condition, determining that the intermediate result set of one or more data rows of the left node is an empty set. The search for data rows in the right node that match other data rows of the left node may then continue and be determined as an intermediate result set.
And step 270, determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set.
In one embodiment, step 270 specifically includes:
and determining the line number of the intermediate result set according to the intermediate result set.
Specifically, a target data row matched with the current data row of the left node can be found in the right node, and the row number of the target data row is determined as the row number of the current intermediate result set.
Through the plurality of data rows of the left node, a plurality of target data rows matched with the left node can be found in the right node, and the number of the rows of the plurality of target data rows is determined as the number of the rows of the intermediate result set.
And if the line number of the intermediate result set is greater than or equal to the preset line number, determining that the result set comprises the current intermediate result set and the previous intermediate result set, and outputting a result set obtained by analyzing the input data set.
Wherein, the preset number of rows can be determined according to the TOP clause.
When the number of rows in the intermediate result set is equal to the preset number of rows included in the TOP clause, the intermediate result set may be output as a result set, which may include the current intermediate result set and the previous intermediate result set. If the sum of the number of lines in the current intermediate result set and the number of lines in the previous intermediate result set is greater than the preset number of lines, and the number of lines in the previous result set is less than the preset number of lines, the result set may also include the current intermediate result set and the previous intermediate result set.
In addition, when the number of groups of input data sets is small, all intermediate result sets may also be output. The number of rows may be preset to include the number of groups of the input data set, or the number of rows of the result set may be left undefined and the output of the intermediate result set continues until the data row matching of the left node ends.
The second embodiment of the invention provides an analysis method based on data set grouping, which comprises the following steps: after receiving a structured query statement SQL, carrying out grammar recognition on the SQL language, acquiring an analysis function contained in the SQL statement, determining a matching condition according to the analysis function, if the number of groups of an input data set is smaller than the preset number of groups, determining that an operational character comprises a connection operational character, removing key value repeated data rows of the input data set, obtaining a left node of the connection operational character, carrying out detection matching on a right node of the connection operational character according to the left node under the preset matching condition, determining an intermediate result set, if the left node does not have a corresponding intermediate result set, determining that the intermediate result set is empty, and determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set. According to the scheme, the input data sets can be grouped and then analyzed, the problem that the efficiency of the existing grouping analysis method is low when only partial data sets are needed in the result set is effectively solved, and the high-efficiency grouping analysis of the data sets is realized.
Fig. 3 is a flowchart of an implementation of an analysis method based on data set grouping according to a second embodiment of the present invention, and an implementation manner of the analysis method is exemplarily shown. As shown in figure 3 of the drawings,
before performing the analysis method based on data set grouping provided by the present embodiment, tables T1 and T2 may be created, where table T1 may include three columns of C1, C2, and C3, and table T2 may include two columns of D1 and D2:
CREATE TABLE T1(C1 CHAR,C2 INT,C3 INT);
CREATE TABLE T2(D1 CHAR,D2 INT);
table 1 may include the data of table T1 and table T2, as follows:
table 1: data of T1 and T2
Figure BDA0002692435380000121
Figure BDA0002692435380000131
The following statement contains the analytic function SUM (C3) OVER (PARTITION BY C1), and includes TOP clauses. The meaning of the statement may be: the C3 column value of records in the lookup tables T1, T2 that satisfy the C1 ═ D1 AND C2 ═ D2 conditions, where C1, C3, AND C1 columns are equal.
SELECT TOP 4*FROM
(SELECT C1,C3,SUM(C3)OVER(PARTITION BY C1)RK
FROM T1,T2
WHERE C1=D1 AND C2=D2);
In this embodiment, the execution plan of the above statements may be as follows:
TOP
---PRJT
------NLI TMP.C1=T1.C1
---------DISTINCT T1(TMP)
---------AFUN*(NO PARTITION BY)
------------CROSS
---------------T1(C1=TMP.C1)
---------------T2
if the data of T1 and T2 are shown in table 1, the execution process of the execution plan may be as follows:
step 310, execute left son DISTINCT T1(TMP) of NLI operator.
Wherein DISTINCT represents a duplicate removal value, TMP is a temporary table formed after duplicate removal values of PARTITION BY entries, if PARTITION BY entries are C1, TMP is a table with only C1 columns, data may include (a, B, C.), and step 320 may be performed without obtaining all DISTINCT values.
The left son may correspond to the left node in the foregoing embodiments.
And step 320, taking the first row data of the TMP table as A, and performing detection matching in the right son of the NLI operator.
Wherein the right son may correspond to the right node in the foregoing embodiments.
The matching conditions may include: the result set for C1 ═ a (C1 ═ tmp.c1, where tmp.c1 is a) corresponds to the row number (1, 2, 3, 4, 5) of table T1.
And step 330, connecting the tables T1 and T2, and determining an intermediate result set meeting the matching condition.
Among them, the intermediate result set satisfying C1 ═ D1 AND C2 ═ D2 has three rows, as shown in table 2:
table 2: intermediate result set of group 1
Line number T1 C1 C2 C3 Line number T2 D1 D2
3 A 33 3 1 A 33
4 A 33 4 1 A 33
5 A 33 4 1 A 33
And step 340, determining the value of AFUN based on the intermediate result set, and outputting the value to the NLI operator, wherein the NLI operator also continues to output the result set to the upper operator.
Specifically, the first group execution may end as shown in table 3.
Table 3: grouping analysis results of group 1
C1 C3 RK
A 3 11
A 4 11
A 4 11
Step 350, determining the number of rows of the intermediate result set, and if the number of rows is larger than the number of rows contained in the TOP clause, outputting the intermediate result set as the result set; if the number of rows is less than the number of rows included in the TOP clause, the process continues to step 310, and the matching search of the right child is continued through the data rows in the left child.
The specific exploration process can be as follows:
returning to the NLI operator, taking the next DISTINCT value of the left son of the NLI operator, namely the next row of data of the TMP table, obtaining B, and starting the execution of the second group;
performing probe matching by using the right son of the operator from B to NLI, obtaining a result set which meets the filtering condition C1 ═ B by using the subset T1, wherein the row number of the corresponding table T1 is (6, 7, 8);
tables T1 AND T2 are connected, AND if the intermediate result set of C1 ═ D1 AND C2 ═ D2 is not satisfied, then left son of AFUN has no intermediate result set AND no output, AND the second group of executions ends;
returning to the NLI operator, taking the next DISTINCT value of the left son of the NLI operator, namely the next row of data of the TMP table, obtaining C, and starting the execution of the third group;
performing probe matching by using the right son of the operator from C to NLI, wherein the subset T1 obtains a result set which meets the filtering condition of C1 ═ C, and the row number of the corresponding table T1 is (9, 10, 11, 12 and 13);
tables T1, T2 are concatenated, with an intermediate result set satisfying C1-D1 AND C2-D2 having two rows, as shown in table 4:
table 4: intermediate result set of group 3
Line number T1 C1 C2 C3 Line number T2 D1 D2
9 C 11 2 8 C 11
9 C 11 2 9 C 11
Calculating the value of AFUN based on the intermediate result set, outputting the value to the NLI operator, and continuously outputting the result set to the upper operator by the NLI operator, wherein the third group finishes execution as shown in the table 5;
table 5: group 3 grouping analysis results
C1 C3 RK
C 2 4
C 2 4
3+ 2-5 rows of data are output so far, and are larger than 4 rows required by the TOP clause, the packet analysis execution is finished, and the inquiry and calculation of the next group are not needed;
and outputting a five-element result set.
In this embodiment, after creating tables T1 and T2, execute left child DISTINCT T1(TMP) of NLI operator, obtain temporary table TMP formed after removing key value repetition values of PARTITION BY entries, take the first row data of TMP table as a, perform probe matching in right child of NLI operator, join tables T1 and T2, determine an intermediate result set satisfying matching conditions, determine the value of AFUN based on the intermediate result set, and output the value to NLI operator, NLI operator also continues to output the result set to upper operator, determine the number of rows of intermediate result set, and if the number of rows is greater than the number of rows included in TOP clause, output the intermediate result set as result set; if the number of rows is less than the number of rows included in the TOP clause, the process continues to step 310, and the matching search of the right child is continued through the data rows in the left child. According to the scheme, the input data sets can be grouped and then analyzed, the problem that the efficiency of the existing grouping analysis method is low when only partial data sets are needed in the result set is effectively solved, and the high-efficiency grouping analysis of the data sets is realized.
EXAMPLE III
Fig. 4 is a structural diagram of an analysis apparatus based on data set grouping according to a third embodiment of the present invention, which can be adapted to output a result set required to satisfy TOP clauses, thereby improving the efficiency of data set analysis. The apparatus may be implemented by software and/or hardware and is typically integrated in a computer device.
As shown in fig. 4, the apparatus includes: a determination module 410, a first execution module 420, a second execution module 430, and an output module 440, wherein,
the determining module 410 is configured to determine that the execution plan includes a join operator if the input data set meets a preset condition after receiving the structured query statement SQL;
a first executing module 420, configured to remove key-value duplicate rows of the input data set, and obtain a left node of the join operator;
a second executing module 430, configured to perform detection matching on the right node of the join operator according to the left node under a preset matching condition, and determine an intermediate result set;
and an output module 440, configured to determine and output a result set obtained by analyzing the input data set according to the intermediate result set.
In the analysis device based on data set grouping provided by this embodiment, after receiving the structured query statement SQL, if the input data set satisfies the preset condition, it is determined that the execution plan includes a join operator; removing key value repeated data rows of the input data set to obtain a left node connected with the operational characters; under a preset matching condition, detecting and matching the right node connected with the operational characters according to the left node, and determining an intermediate result set; and determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set, solving the problem of low efficiency of the existing grouping analysis method when the result set only needs partial data groups, and realizing partial grouping analysis of the data set.
On the basis of the foregoing embodiment, the determining module 410 is specifically configured to:
if the number of groups of the input data set is less than the predetermined number of groups, determining that the execution plan includes join operators.
On the basis of the foregoing embodiment, the first executing module 420 is specifically configured to:
acquiring a sub-execution plan containing grouping items;
and removing the key value repeating data row of the sub-execution plan to obtain the left node of the connection operation character.
On the basis of the foregoing embodiment, the second executing module 430 is specifically configured to:
under the preset matching condition, searching a target data row matched with the left node in the right node;
and determining an intermediate result set according to the left node and the target data row.
On the basis of the foregoing embodiment, the output module 440 is specifically configured to:
determining the number of lines of the intermediate result set according to the intermediate result set;
and if the line number of the intermediate result set is larger than the preset line number, determining that the result set comprises a current intermediate result set and a previous intermediate result set, and outputting and analyzing the result set obtained by the input data set.
On the basis of the above embodiment, the apparatus further includes: an acquisition module and a preset matching condition determination module, wherein,
and the acquisition module is used for carrying out grammar recognition on the SQL language after receiving the structured query statement SQL and acquiring an analysis function contained in the SQL statement.
And the preset matching condition determining module is used for determining a preset matching condition according to the analysis function.
On the basis of the above embodiment, the apparatus further includes: a third execution module, wherein
A third executing module, configured to determine that the result set is empty if the corresponding intermediate result set does not exist in the left node.
The analysis device based on the data set grouping provided by the embodiment of the invention can execute the analysis method based on the data set grouping provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention, as shown in fig. 5, the computer device includes a processor 510 and a memory 520; the number of the processors 510 in the computer device may be one or more, and one processor 510 is taken as an example in fig. 5; the processor 510 and the memory 520 in the computer device may be connected by a bus or other means, as exemplified by the bus connection in fig. 5.
The memory 520 may be used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the analysis method based on the data set grouping in the embodiment of the present invention (for example, the determination module, the first execution module, the second execution module, and the output module in the analysis method device based on the data set grouping). The processor 510 executes various functional applications of the computer device and data processing, i.e., implements the above-described analysis method based on data set grouping, by executing software programs, instructions, and modules stored in the memory 520.
The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 520 may further include memory located remotely from processor 510, which may be connected to a computer device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
EXAMPLE five
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for dataset group-based analysis, the method comprising:
after receiving a structured query statement SQL, if an input data set meets a preset condition, determining that an execution plan comprises a connection operational character;
removing key value repeated data rows of the input data set to obtain a left node connected with the operational characters;
under a preset matching condition, detecting and matching the right node connected with the operational characters according to the left node, and determining an intermediate result set;
and determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set.
Of course, the storage medium containing computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the data set grouping-based analysis method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the above search apparatus, each included unit and module are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method of analysis based on grouping of data sets, comprising:
after receiving a structured query statement SQL, if an input data set meets a preset condition, determining that an execution plan comprises a connection operational character;
removing key value repeated data rows of the input data set to obtain a left node connected with the operational characters;
under the matching condition, detecting and matching the right node of the connection operational character according to the left node, and determining an intermediate result set;
and determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set.
2. The method of claim 1, wherein determining that an operator comprises a join operator if the input data set satisfies a predetermined condition comprises:
if the number of groups of the input data set is less than the predetermined number of groups, determining that the execution plan includes join operators.
3. The method of claim 1, wherein removing duplicate rows of an input data set resulting in a left node of the join operator comprises:
acquiring a sub-execution plan containing grouping items;
and removing the key value repeating data row of the sub-execution plan to obtain the left node of the connection operation character.
4. The method of claim 1, wherein performing a probe match on a right node of the join operator according to the left node under a preset matching condition to determine an intermediate result set comprises:
under the matching condition, searching a target data row matched with the left node in the right node;
and determining an intermediate result set according to the left node and the target data row.
5. The method of claim 1, wherein determining and outputting a result set from analyzing the input data set based on the intermediate result set comprises:
determining the number of lines of the intermediate result set according to the intermediate result set;
and if the line number of the intermediate result set is larger than the preset line number, determining that the result set comprises a current intermediate result set and a previous intermediate result set, and outputting and analyzing the result set obtained by the input data set.
6. The method of claim 1, wherein under a preset matching condition, performing a probe matching on a right node of the join operator according to the left node, and after determining an intermediate result set, further comprising:
and if the left node does not have a corresponding intermediate result set, determining that the result set is empty.
7. The method of claim 1, after receiving the SQL statement, further comprising:
carrying out grammar recognition on the SQL language to obtain an analysis function contained in the SQL statement;
and determining a matching condition according to the analysis function.
8. An apparatus for analyzing a packet based on a data set, comprising: a determination module, a first execution module, a second execution module, and an output module, wherein,
the determining module is used for determining that the execution plan comprises a connection operational character if the input data set meets a preset condition after receiving the structured query statement SQL;
the first execution module is used for removing key value repeated data rows of the input data set to obtain a left node connected with the operational character;
the second execution module is used for detecting and matching the right node of the connection operational character according to the left node under the matching condition and determining an intermediate result set;
and the output module is used for determining and outputting a result set obtained by analyzing the input data set according to the intermediate result set.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of data set grouping based analysis according to any of claims 1-7 when executing the program.
10. A storage medium containing computer-executable instructions for performing the data set grouping based analysis method of any one of claims 1-7 when executed by a computer processor.
CN202010995383.6A 2020-09-21 2020-09-21 Analysis method, device, equipment and medium based on data set grouping Active CN112100199B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010995383.6A CN112100199B (en) 2020-09-21 2020-09-21 Analysis method, device, equipment and medium based on data set grouping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010995383.6A CN112100199B (en) 2020-09-21 2020-09-21 Analysis method, device, equipment and medium based on data set grouping

Publications (2)

Publication Number Publication Date
CN112100199A true CN112100199A (en) 2020-12-18
CN112100199B CN112100199B (en) 2024-03-26

Family

ID=73756364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010995383.6A Active CN112100199B (en) 2020-09-21 2020-09-21 Analysis method, device, equipment and medium based on data set grouping

Country Status (1)

Country Link
CN (1) CN112100199B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1987861A (en) * 2005-12-22 2007-06-27 国际商业机器公司 System and method for processing database query
JP2014130539A (en) * 2012-12-28 2014-07-10 Fujitsu Ltd Information processor, node extraction program and node extraction method
US20170031989A1 (en) * 2015-07-31 2017-02-02 International Business Machines Corporation Outer join optimizations in database management systems
US20170103104A1 (en) * 2015-10-07 2017-04-13 International Business Machines Corporation Query plan based on a data storage relationship
US20170132295A1 (en) * 2014-06-09 2017-05-11 Hewlett Packard Enterprise Development Lp Top-k projection
US20180081946A1 (en) * 2016-09-16 2018-03-22 Oracle International Corporation Duplicate reduction or elimination with hash join operations
CN109947791A (en) * 2019-03-27 2019-06-28 上海达梦数据库有限公司 A kind of database statement optimization method, device, equipment and storage medium
US20200117664A1 (en) * 2018-10-15 2020-04-16 Ocient Inc. Generation of a query plan in a database system
CN111125151A (en) * 2019-12-26 2020-05-08 上海达梦数据库有限公司 Execution method of aggregation function under MPP (maximum power point) architecture and database system
CN111506602A (en) * 2020-04-20 2020-08-07 上海达梦数据库有限公司 Data query method, device, equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1987861A (en) * 2005-12-22 2007-06-27 国际商业机器公司 System and method for processing database query
JP2014130539A (en) * 2012-12-28 2014-07-10 Fujitsu Ltd Information processor, node extraction program and node extraction method
US20170132295A1 (en) * 2014-06-09 2017-05-11 Hewlett Packard Enterprise Development Lp Top-k projection
US20170031989A1 (en) * 2015-07-31 2017-02-02 International Business Machines Corporation Outer join optimizations in database management systems
US20170103104A1 (en) * 2015-10-07 2017-04-13 International Business Machines Corporation Query plan based on a data storage relationship
US20180081946A1 (en) * 2016-09-16 2018-03-22 Oracle International Corporation Duplicate reduction or elimination with hash join operations
US20200117664A1 (en) * 2018-10-15 2020-04-16 Ocient Inc. Generation of a query plan in a database system
CN109947791A (en) * 2019-03-27 2019-06-28 上海达梦数据库有限公司 A kind of database statement optimization method, device, equipment and storage medium
CN111125151A (en) * 2019-12-26 2020-05-08 上海达梦数据库有限公司 Execution method of aggregation function under MPP (maximum power point) architecture and database system
CN111506602A (en) * 2020-04-20 2020-08-07 上海达梦数据库有限公司 Data query method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112100199B (en) 2024-03-26

Similar Documents

Publication Publication Date Title
US10095742B2 (en) Scalable multi-query optimization for SPARQL
US11188531B2 (en) Systems and methods for converting and resolving structured queries as search queries
CN104636478B (en) Information query method and equipment
US10762087B2 (en) Database search
CN110019384A (en) A kind of acquisition methods of blood relationship data provide the method and device of blood relationship data
CN112100198B (en) Database SQL statement optimization method, device, equipment and storage medium
US11288266B2 (en) Candidate projection enumeration based query response generation
CN108073641B (en) Method and device for querying data table
US10565188B2 (en) System and method for performing a pattern matching search
JP2024504322A (en) Combining JavaScript Object Notation (JASON) queries across cloud resources
CN108549688B (en) Data operation optimization method, device, equipment and storage medium
CN113918605A (en) Data query method, device, equipment and computer storage medium
US11657047B1 (en) Automated query tuning method, computer program product, and system for MPP database platform
CN112100199B (en) Analysis method, device, equipment and medium based on data set grouping
Wang et al. A parallel execution method for minimizing distributed query response time
CN110990423A (en) SQL statement execution method, device, equipment and storage medium
CN116628136A (en) Collaborative query processing method, system and electronic equipment based on declarative reasoning
CN110147396B (en) Mapping relation generation method and device
US10229105B1 (en) Mobile log data parsing
CN116049232A (en) Sub-query extraction method, sub-query extraction device, electronic equipment and storage medium
KR101928819B1 (en) Method for join of Relational Database
CN114254005A (en) Grouping aggregation query method and device for partition table, computer equipment and medium
CN110895529B (en) Processing method of structured query language and related device
US20160117350A1 (en) Column group selection method and apparatus for efficiently storing data in mixed olap/oltp workload environment
Ramezani et al. Finding association rules in linked data, a centralization approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant