CN112612814A

CN112612814A - Data stream query method and device, computer equipment and storage medium

Info

Publication number: CN112612814A
Application number: CN202011530103.0A
Authority: CN
Inventors: 王春凯; 冯键
Original assignee: China Reinsurance Group Co ltd
Current assignee: China Reinsurance Group Co ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-06

Abstract

The application relates to a data stream query method, a data stream query device, computer equipment and a storage medium. The method comprises the following steps: receiving a plurality of data query requests, wherein each data query request carries a query division code set which is an attribute key value set of a data stream; generating a candidate division code set according to the query division code set carried in each data query request; acquiring corresponding attribute column data of the full history data stream according to the candidate division code set, calculating a correlation value between any two attribute column data, and acquiring correlation results between different attribute column data according to a preset correlation threshold; correspondingly merging and inquiring the division code set according to the correlation result to obtain a combined division code set, and dividing the full history data stream according to the combined division code set to obtain a data stream division result; and respectively sending the data stream division results to different processing nodes for processing to obtain query results. By adopting the method, the data stream query accuracy can be improved.

Description

Data stream query method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data query technologies, and in particular, to a data stream query method, apparatus, computer device, and storage medium.

Background

With the continuous development of internet technology, the data volume on the internet is rapidly increased, a data query technology is provided, data of the same data source is divided according to a division code and a data attribute value carried by each query request aiming at the query request input by a user to obtain different data processing tasks, and then the different data processing tasks are distributed to different processing nodes to respectively obtain query results.

However, in the current data query mode, data query division is performed only on data in a data query window, and query results are inaccurate because queried data in the query window is not comprehensive and data distribution uniformity exists.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data stream query method, apparatus, computer device and storage medium.

A data stream query method, the method comprising:

receiving a plurality of data query requests, wherein each data query request carries a query partition code set which is an attribute key value set of a data stream;

generating a candidate division code set according to the query division code set carried in each data query request;

acquiring corresponding attribute column data of the full history data stream according to the candidate division code set, calculating a correlation value between any two attribute column data, and acquiring correlation results between different attribute column data according to a preset correlation threshold;

correspondingly merging the query division code sets according to the correlation results to obtain a joint division code set, and dividing the full history data stream according to the joint division code set to obtain data stream division results;

and respectively sending the data stream division result to different processing nodes for processing to obtain a query result.

In one embodiment, the generating a candidate partition code set according to the query partition code set carried in each data query request includes:

acquiring a query division code set carried in each data query request;

and generating a candidate division code set according to a preset disjunctive normal form principle and the inquiry division code set carried by each data inquiry request.

In one embodiment, the obtaining, according to the candidate partition code set, attribute column data corresponding to a full history data stream, calculating a correlation value between any two attribute column data, and obtaining a correlation result between different attribute column data according to a preset correlation threshold includes:

determining target attributes in all dimension attribute sets of the full history data stream according to all division codes contained in the candidate division code set;

acquiring the target attribute column data according to the target attribute;

calculating a correlation value between attribute data under each timestamp in any two target attribute columns according to timestamp information in the full history data stream;

and if the correlation value is larger than a preset correlation threshold value, the attribute data between the two target attribute columns have a correlation relationship.

In one embodiment, the correspondingly merging the query division code sets according to the correlation results to obtain joint division code sets, and dividing the full history data stream according to the joint division code sets to obtain data stream division results includes:

if the correlation result between the two attribute column data is positive correlation, combining query division code sets of different data query request division bases corresponding to the two attribute column data to obtain a joint division code set;

and carrying out data stream division on the full history data stream according to the joint division code set to obtain a data stream division result.

In one embodiment, if the correlation value is greater than a preset correlation threshold, the attribute data between the two target attribute columns has a correlation relationship, including:

calculating a balance degree value between attribute data under each timestamp in any two target attribute columns according to a preset imbalance factor algorithm;

and if the balance degree value is smaller than a preset balance degree threshold value and the correlation value is larger than a preset correlation threshold value, the attribute data between the two target attribute columns have positive correlation.

In one embodiment, the sending the data stream partitioning result to different processing nodes respectively for processing to obtain a query result includes:

mapping the data stream division result and a division code according to the data stream division result to a preset data division routing table respectively;

and respectively sending the data stream division results to each processing node for processing according to the load information of the processing nodes on the data stream division results reflected in the data division routing table, and obtaining the feedback query results.

In one embodiment, the method further comprises:

receiving a plurality of data query requests of the same type sent under the next timestamp;

acquiring target attribute column data in the newly added historical data stream under the next timestamp according to a candidate division code set generated by the plurality of data query requests of the same type;

dividing the data stream of the target attribute column in the newly added historical data stream to obtain a division result of the newly added data stream;

receiving data stream division result distribution amount information of a timestamp fed back by each processing node, and calculating to obtain a task distribution balance degree value of the timestamp on each processing node according to the data stream division result distribution amount information;

and if the balance degree value is smaller than a preset balance degree threshold value, allocating processing nodes to the newly added data stream division result according to a preset incremental clustering feature tree algorithm.

A data stream querying device, the device comprising:

a receiving module, configured to receive multiple data query requests, where each data query request carries a query partition code set, and the query partition code set is an attribute key value set of a data stream;

a generating module, configured to generate a candidate partition code set according to the query partition code set carried in each data query request;

the calculation module is used for acquiring corresponding attribute column data of the full history data stream according to the candidate division code set, calculating a correlation value between any two attribute column data, and obtaining correlation results between different attribute column data according to a preset correlation threshold;

the dividing module is used for correspondingly combining the query division code sets according to the correlation results to obtain a joint division code set, and dividing the full history data stream according to the joint division code set to obtain data stream dividing results;

and the sending module is used for respectively sending the data stream division results to different processing nodes for processing to obtain query results.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The data stream query method, the data stream query device, the computer equipment and the storage medium receive a plurality of data query requests, each data query request carries a query partition code set, and the query partition code set is an attribute key value set of a data stream; generating a candidate division code set according to the query division code set carried in each data query request; acquiring corresponding attribute column data of the full history data stream according to the candidate division code set, calculating a correlation value between any two attribute column data, and acquiring correlation results between different attribute column data according to a preset correlation threshold; correspondingly merging the query division code sets according to the correlation results to obtain a joint division code set, and dividing the full history data stream according to the joint division code set to obtain data stream division results; and respectively sending the data stream division result to different processing nodes for processing to obtain a query result. By adopting the method, the query division can be carried out on the full history data stream, and the obtained data query result is more accurate.

Drawings

FIG. 1 is a diagram of an application environment of a data stream query method in one embodiment;

FIG. 2 is a flow diagram that illustrates a method for querying a data stream, according to one embodiment;

FIG. 3 is a flowchart illustrating the step of generating a set of candidate partition codes in one embodiment;

FIG. 4 is a flowchart illustrating the step of determining the correlation of data attributes in a data stream according to an embodiment;

FIG. 5 is a flowchart illustrating the step of jointly partitioning data streams in one embodiment;

FIG. 6 is a flow diagram illustrating the process of determining that attribute column data in a data stream has a positive correlation according to one embodiment;

FIG. 7 is a flowchart illustrating a hash partitioning step performed on the partition result in one embodiment;

FIG. 8 is a flowchart illustrating the data flow dividing step performed on the newly added historical data flow in one embodiment;

FIG. 9 is a block diagram showing the structure of a data stream query apparatus according to an embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The data stream query method provided by the application can be applied to the application environment shown in fig. 1. Wherein processing node 102 communicates with a master node 104 (which may also be referred to as a computer device) over a network. The main control node 104 receives a plurality of data query requests, each data query request carries a query division code set, and the query division code set is an attribute key value set of a data stream; then, generating a candidate division code set according to the query division code set carried in each data query request; acquiring corresponding attribute column data of the full history data stream according to the candidate division code set, calculating a correlation value between any two attribute column data, and acquiring correlation results between different attribute column data according to a preset correlation threshold; finally, according to the correlation result, correspondingly combining and inquiring the division code set to obtain a combined division code set, and dividing the full history data stream according to the combined division code set to obtain a data stream division result; and respectively sending the data stream division results to different processing nodes for processing to obtain query results. The processing node 102 may be, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers, and the main control node 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In an embodiment, as shown in fig. 2, a data stream query method is provided, which is described by taking the method as an example applied to the main control node 104 in fig. 1, and includes the following steps:

step 201, receiving a plurality of data query requests, where each data query request carries a query partition code set, and the query partition code set is an attribute key value set of a data stream.

In implementation, a master control node receives a plurality of data query requests sent by a user, each data query request carries a query partition code set, and the query partition code set is an attribute key value set of a data stream.

Specifically, each data query request sent by a user has a different query target, and therefore carries a different set of query partition codes, and taking querying road network data as an example, the set of query partition codes carried in the first data query request Q1 is PK_i＝{PK_a,PK_b,PK_cWherein, PK_aIndicating longitude, PK_bIndicating latitude, PK_cIndicating the vehicle speed. Carried in the second data query request Q2Querying a set of partition codes as PK_j＝{PK_a,PK_b,PK_cWherein, PK_aIndicating longitude, PK_bIndicating latitude, PK_dIndicating the road condition.

Step 202, generating a candidate division code set according to the query division code set carried in each data query request.

In implementation, the master control node generates a candidate partition code set according to the query partition code set carried in each data query request. Specifically, taking the road network data as an example, when the received query requests are two (Q1 and Q2), the query partition code set carried in the first data query request Q1 is PK_i＝{PK_a,PK_b,PK_cThe query partition code set carried in the second data query request Q2 is PK_j＝{PK_a,PK_b,PK_dThen, the master node generates a candidate partition code set PK ═ PK according to the two query partition code sets_a,PK_b,PK_c,PK_d}。

Step 203, acquiring corresponding attribute column data of the full history data stream according to the candidate division code set, calculating a correlation value between any two attribute column data, and obtaining correlation results between different attribute column data according to a preset correlation threshold value.

In implementation, the master control node acquires corresponding attribute column data of the full history data stream according to the candidate division code set, wherein the full history data stream refers to all history data before the time of receiving the data query request. And then, calculating a correlation value between any two attribute column data, and obtaining correlation results between different attribute column data according to a preset correlation threshold value.

Specifically, the full history data stream includes a plurality of attribute column data, for example, a certain network data stream includes an attribute column { PK } corresponding to the attribute column_a,PK_b,PK_c,PK_d,......PK_xAnd the set of candidate partition codes generated according to the plurality of query requests is { PK_a,PK_b,PK_c,PK_dAnd the master control node acquires the full history data according to the candidate division code setCorresponding attribute column data in the stream (i.e. { PK)_a,PK_b,PK_c,PK_dCorresponding data column data), then, for the obtained attribute column data, calculating a correlation value between every two attribute column data, and for the attribute values a and B corresponding to two different attribute columns, calculating a correlation formula as

And then, judging the obtained correlation value according to a preset correlation threshold value to obtain correlation results among different attribute column data.

Optionally, the full history data stream may be a data stream stored in a memory or in another storage device, and when the master node divides the full history data stream, all history data streams of the same data source in the memory or in another storage device before the time when the data query request is received are called.

And 204, correspondingly merging the query division code sets according to the correlation results to obtain a joint division code set, and dividing the full history data stream according to the joint division code set to obtain data stream division results.

In implementation, the master control node correspondingly merges and queries the division code set according to the correlation result to obtain a joint division code set, and divides the full history data stream according to the joint division code set to obtain a data stream division result.

Specifically, if the correlation result indicates that there is correlation between two attribute column data, the master node merges the query partition codes corresponding to the two attribute column data, for example, the query partition code corresponding to the first attribute column is PK_a＝{V_i1,V_i2,V_i3The query division code corresponding to the second attribute column is PK_b＝{V_j1,V_j2,V_j3And if the data of the two attribute columns have correlation, combining the two query division codes to obtain a combined division code set { PK_a,PK_b}，And performing joint division on the two attribute column data according to the joint division code set to obtain a data stream division result.

And step 205, sending the data stream division results to different processing nodes respectively for processing, so as to obtain query results.

In implementation, the main control node sends the data stream division results to different processing nodes respectively for data query processing based on the hash division method, so as to obtain query results.

In the data stream query method, a master control node receives a plurality of data query requests, each data query request carries a query division code set, and the query division code set is an attribute key value set of a data stream; generating a candidate division code set according to the query division code set carried in each data query request; acquiring corresponding attribute column data of the full history data stream according to the candidate division code set, calculating a correlation value between any two attribute column data, and acquiring correlation results between different attribute column data according to a preset correlation threshold; according to the correlation result, correspondingly combining and inquiring the division code sets to obtain a combined division code set, and dividing the full history data stream according to the combined division code set to obtain a data stream division result; and respectively sending the data stream division results to different processing nodes for processing to obtain query results. By adopting the method, the data query is divided aiming at the full historical data stream, and the accuracy of the data query can be provided.

In one embodiment, as shown in fig. 3, the step 202 specifically processes as follows:

step 2021, obtain the query partition code set carried in each data query request.

Step 2022, generating a candidate partition code set according to a preset disjunctive normal form principle and the query partition code set carried by each data query request.

In implementation, the master control node receives and extracts the query division code set carried in each data query request, and then generates a candidate division code set according to a preset disjunctive normal form principle and the extracted query division code set carried in each data query request.

In this embodiment, a candidate partition code set can be determined by merging the query partition code sets carried in the multiple data query requests, and then target attribute column data in the full history data stream can be determined according to the candidate partition code set.

In one embodiment, as shown in fig. 4, the specific processing procedure of step 203 is as follows:

step 2031, according to each division code contained in the candidate division code set, determining the target attribute in all dimension attribute sets of the full history data stream.

In implementation, the master control node determines the target attribute in all the dimensional attribute sets of the full history data stream according to each division code contained in the candidate division code set. For example, the full history data stream corresponding to the road network data includes a multi-dimensional attribute data sequence such as longitude, latitude, road condition, vehicle speed, vehicle identification, and the like, and the attributes corresponding to the partitions included in the candidate partition code set are longitude, latitude, and vehicle speed, respectively, so that the main control node determines that the target attribute of the data stream corresponding to the current query task (including multiple query requests) is longitude, latitude, and vehicle speed.

Step 2032, according to the target attribute, obtain the target attribute column data.

In implementation, the master control node acquires the target attribute column data according to the target attribute. Specifically, the master control node obtains the data of the target attribute column in the data stream according to the determined target attribute, and then processes only the data of the target attribute column.

Step 2033, according to the timestamp information in the history data stream, calculate the correlation value between the attribute data under each timestamp in any two target attribute columns.

In implementation, the master node calculates a correlation value between attribute data under each timestamp of any two target attribute columns corresponding to timestamp dimensions by combining timestamp information in the full history data stream.

Optionally, the correlation of the attribute data between any two target attribute columns is not limited to be calculated, for example, if the determined target attribute column is { a, B, C, D }, the correlation between (a, B) (a, C) (a, D) (a, B, C) (a, B, D) … … (a, B, C, D) may be calculated, in order to improve the calculation efficiency of the master node and reduce the calculation amount, according to the property between data correlations, when the attribute columns a and B do not have correlation, the attribute columns A, B and C also do not have correlation; since the attribute columns a and B have correlation with each other, and the attribute columns B and a also have correlation with each other, simplified calculation can be performed for each attribute column data in the target attribute column, and if the number of division codes included in the attribute column a is the minimum, the correlation is calculated between the attribute column a and another attribute column, thereby reducing the amount of calculation.

Step 2034, if the correlation value is greater than the preset correlation threshold, the attribute data between the two target attribute columns has a correlation relationship.

In implementation, if the correlation value is greater than the preset correlation threshold, the attribute data between the two target attribute columns has correlation, specifically, for example, the correlation of the attribute columns a and B based on the time stamp t is calculated in all the target attribute columns, and when the correlation value of the attribute columns a and B satisfies the requirement

Then, it is determined that the attribute columns a and B have correlation.

In one embodiment, as shown in fig. 5, the specific processing procedure of step 204 is as follows:

step 2041, if the correlation result between the two attribute column data is a positive correlation, merging query division code sets according to which different data query requests corresponding to the two attribute column data are divided, to obtain a joint division code set.

In implementation, if the correlation result between the two attribute column data is a correlation result having a positive correlation, the query division code sets of the different data query request division bases corresponding to the two attribute column data are merged to obtain a joint division code set.

And 2042, performing data stream division on the full history data stream according to the joint division code set to obtain a data stream division result.

In implementation, the main control node performs data stream division on the full history data stream according to the joint division code set to obtain a data stream division result.

In this embodiment, if the correlation result between the two attribute column data has a positive correlation, the two attribute columns are jointly divided, so that the accuracy of data stream division is improved, division tasks (merging query division codes) are merged, and the processing efficiency of the processing node is improved.

In one embodiment, as shown in fig. 6, the step 2034 includes the following specific processing procedures:

step 601, calculating a balance degree value between attribute data under each timestamp in any two target attribute columns according to a preset imbalance factor algorithm.

In implementation, after considering the correlation between the attribute column data, it is determined whether there is a same change direction between the attribute column data (for example, both the attribute column data are incremental data), that is, it is determined whether the attribute columns are relatively balanced, so that the master node calculates a degree of balance value between the attribute data under each timestamp in any two target attribute columns (for example, attribute values a and B corresponding to the attribute columns) according to a preset Imbalance factor algorithm (IR, impedance Ratio), specifically, the calculation formula is:

wherein P (a) -P (B) is an absolute value of a difference between probabilities of attribute values a and B, and P (a) + P (B) -P (avoub) is a probability value including the attribute value a or B. If the directions of the attribute columns corresponding to the attribute values A and B are the same, IR (A, B) is 0, otherwise, the larger the difference between the attribute values A and B is, the larger the value of the imbalance factor is.

In step 602, if the balance degree value is smaller than the preset balance degree threshold and the correlation value is greater than the preset correlation threshold, the attribute data between the two target attribute columns has a positive correlation.

In implementation, if the balance degree value is smaller than the preset balance degree threshold and the correlation value is larger than the preset correlation threshold, the master control node determines that the attribute data between the two target attribute columns has a positive correlation relationship. Specifically, based on the dimension of the timestamp t contained in the full history data stream, if the attribute values corresponding to the two attribute columns a and B are relatively balanced, it needs to be satisfied

The conditions of (1).

In this embodiment, on the basis of calculating the correlation between any two target attribute columns, the balance degree of the two target attribute columns is calculated, and it is determined that the two target attribute columns satisfy the positive correlation according to the balance degree threshold and the correlation value.

In one embodiment, as shown in fig. 7, the specific processing procedure of step 205 is as follows:

step 2051, mapping the data stream partition result and the partition code according to the data stream partition result to a preset data partition routing table, respectively.

And step 2052, according to the load information of the processing nodes on the data stream division result reflected in the data division routing table, sending the data stream division result to each processing node for processing, and obtaining a feedback query result.

In implementation, the master control node maps the data stream division result and the division code of the data stream division result to a preset data division routing table respectively, and then sends the data stream division result to each processing node for processing according to the load information of the processing node and the data stream division result (namely, the data division result to be processed corresponding to each processing node) reflected in the data division routing table, and obtains a feedback query result.

Specifically, the data partitioning routing table is shown in table 1 below:

TABLE 1

In this embodiment, the master control node implements load balancing of each processing node by hash partitioning by constructing a routing table of data partitioning results, thereby improving the efficiency of data query processing.

In one embodiment, as shown in fig. 8, the data stream query method further includes:

step 801, receiving a plurality of data query requests of the same type sent under a next timestamp.

In implementation, the master control node receives a plurality of data query requests of the same type sent under a next timestamp, specifically, a data stream query request for real-time monitoring of a full history data stream, and when the data query request received under the last timestamp is Q₁And Q₂When the same data query request Q can be received by the next time stamp₁And Q₂。

Step 802, according to a candidate division code set generated by a plurality of data query requests of the same type, obtaining target attribute column data in a newly added historical data stream under a next timestamp.

In implementation, the master control node obtains target attribute column data in a newly-added historical data stream under a next timestamp according to a candidate division code set generated by a plurality of newly-received data query requests of the same type. In particular, when the device is at timestamp t₁When a data query request is received, the master control node acquires a timestamp t₁The previous data is the full history data when at the next time stamp t₂When the same data query request is received, the main control node acquires the timestamp t₁And t₂And the data in the process of searching is used as a newly added historical data stream, and simultaneously, target attribute column data in the newly added historical data stream is obtained according to candidate division code sets corresponding to the plurality of query requests.

And 803, performing data stream division on the target attribute column data in the newly added historical data stream to obtain a newly added data stream division result.

In implementation, the main control node performs data stream division on the target attribute column data in the newly added historical data stream according to the method in

steps

201 and 204, so as to obtain a newly added data stream division result. Because the data partitioning process of the newly added historical data stream is the same as the original historical data stream partitioning process, the embodiment of the application is not described again.

And 804, receiving the data flow division result distribution amount information of the timestamp fed back by each processing node, and calculating to obtain a task distribution balance degree value of the timestamp on each processing node according to the data flow division result distribution amount information.

In implementation, during the data query request of the next timestamp, the main control node receives the processing condition of the data stream partitioning result of the previous timestamp fed back by each processing node, and also includes the distribution amount information of the data stream partitioning result of the corresponding previous timestamp on each processing node, and the main control node calculates the task distribution balance degree value of the previous timestamp on each processing node according to the distribution amount information of each data stream partitioning result.

Optionally, the master control node may also obtain, according to the content reflected in the data dividing routing table, allocation amount information of the data stream dividing result corresponding to the last timestamp.

Specifically, for example, there are 4 processing nodes, which are processing node 1, processing node 2, processing node 3, and processing node 4, respectively, and the allocation amount of the data stream partitioning result of each processing node is: the processing node 1 allocates 10 data stream partition results (which may also be referred to as data processing unit task), the processing node 2 allocates 3 data stream partition results, the processing node 3 allocates 5 data stream partition results, and the processing node 4 allocates 0 data stream partition results, so that the main control node calculates a ratio (0/10) of the data stream partition result allocation amount of the processing node with the minimum allocation amount and the data stream partition result allocation amount of the processing node with the maximum allocation amount according to the data stream partition result allocation amount of the selected processing node with the minimum allocation amount and the data stream partition result allocation amount of the processing node with the maximum allocation amount, and further determines the balance degree of the data stream partition results according to a preset threshold.

And step 805, if the balance degree value is smaller than a preset balance degree threshold value, allocating processing nodes to the division result of the newly added data stream according to a preset incremental clustering feature tree algorithm.

In implementation, if the processing task amount balance degree of each processing node is smaller than a preset balance degree threshold, the main control node performs processing node allocation on the division result of the newly added data stream according to a preset incremental clustering feature tree algorithm.

Specific examples of the present inventionAs in the above example of step 804, the ratio of the maximum dispensing amount to the minimum dispensing amount is

Indicating that the load of each processing node is unbalanced when the data stream partitioning result is distributed, and further adopting an incremental clustering feature tree algorithm (ICF-tree) to distribute the data stream partitioning result for the newly added data stream partitioning result in order to improve the data stream processing speed, wherein the clustering number is set to be k because the number of the newly added data stream partitioning result and the number of the processing nodes are known, and adopting a k-means algorithm to cluster the ICF-tree, wherein the clustering process is as follows:

the method comprises the following steps: according to the migration of the timestamp, when newly added historical data stream data is obtained and a newly added data stream partition result corresponding to the newly added historical data stream is obtained, traversing downwards from a root node of a clustering feature tree constructed from an original full historical data stream partition result, determining a leaf node closest to the newly added historical data stream partition result, and further determining a target ICF node contained in the leaf node closest to the newly added historical data stream partition result;

step two: and after the newly-added historical data stream division result is added into the target ICF node, if the radius of the hypersphere corresponding to the target ICF node still meets the condition that the radius of the hypersphere corresponding to the target ICF node is smaller than a preset threshold value T (the triple number threshold value contained in the target ICF node), correspondingly updating all ICF triples on the path upwards, and ending the incremental clustering process, otherwise, executing the step three.

Step three: and if the radius of the hyper-sphere corresponding to the target ICF node is larger than a preset threshold T, creating a new ICF node, adding the triples corresponding to the newly-added historical data stream division results into the new ICF node, if the number of the ICF nodes of the current leaf node is smaller than a preset threshold L, putting the new ICF node into the current leaf node, updating all the ICF triples on the path upwards, ending the incremental clustering process, and otherwise, executing the fourth step.

Step four: if the number of the ICF nodes of the current leaf node is larger than a preset threshold value L, the current leaf node is divided into two new leaf nodes, and the two ICF nodes with the farthest distance of the hyper-sphere in all the ICF nodes in the original leaf node are respectively used as the first ICF node in the two newly divided new leaf nodes. And judging other ICF nodes and the first ICF node in each new leaf node according to a distance principle, putting the ICF nodes into the corresponding new leaf nodes, and finally, upwards checking whether the father node on the path is also split, and if the father node on the path is not required to be split, ending the incremental clustering process. Otherwise, the step three is carried out (wherein the splitting process of the father node is the same as the splitting mode of the leaf node).

In this embodiment, based on the division of the hash data stream partition result, the distribution information of the data stream partition result in the data partition routing table is obtained, the load balance degree of each processing node is judged, and when the load is unbalanced, the newly added data stream partition result is distributed in the manner of the incremental clustering feature tree, so that the situations that some processing nodes have no data query task (unallocated data stream partition result) and some processing nodes have too many data query tasks are avoided, the load balance of each processing node is ensured, and the data stream query efficiency is improved.

Optionally, when the master node allocates each data stream partition result to different processing units according to the content in the data partition routing table, in order to ensure the correctness of the query result of each query request, data replication and transmission may be performed between the processing nodes, specifically, for example, in the record of the data partition routing table, the data stream is partitioned with respect to the first query request Q1, so as to obtain: the result of the data stream division in processing node 1 is { V_a1,V_a2The result of data stream division in processing node 2 is { V }_a3,V_a4,V_a5}; the division of the data stream for the second query request Q2 results in: the result of the data stream division in processing node 1 is { V_a1,V_a2,V_a3The result of data stream division in processing node 2 is { V }_a4,V_a5}. Based on the data flow partitioning of the first query request Q1, to ensure the accuracy of the data query of the second query request Q2 in processing node 1, the data V in processing node 2 is processed_a3The copy is made and sent to processing node 1.

It should be understood that although the various steps in the flow charts of fig. 2-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 9, there is provided a data stream query apparatus 900, including: a receiving module 910, a generating module 920, a calculating module 930, a dividing module 940 and a sending module 950, wherein:

the receiving module 910 is configured to receive a plurality of data query requests, where each data query request carries a query partition code set, and the query partition code set is an attribute key value set of a data stream.

The generating module 920 is configured to generate a candidate partition code set according to the query partition code set carried in each data query request.

The calculating module 930 is configured to obtain attribute column data corresponding to the full history data stream according to the candidate partition code set, calculate a correlation value between any two attribute column data, and obtain a correlation result between different attribute column data according to a preset correlation threshold.

And a dividing module 940, configured to correspondingly merge and query the division code sets according to the correlation result to obtain a joint division code set, and divide the full history data stream according to the joint division code set to obtain a data stream division result.

The sending module 950 is configured to send the data stream division result to different processing nodes respectively for processing, so as to obtain a query result.

In one embodiment, the generating module 920 is specifically configured to obtain a query partition code set carried in each data query request;

In an embodiment, the calculating module 930 is specifically configured to determine, according to each partition code included in the candidate partition code set, a target attribute in all the dimensional attribute sets of the full history data stream;

acquiring target attribute column data according to the target attribute;

In an embodiment, the dividing module 940 is specifically configured to, if the correlation result between the two attribute line data is a positive correlation, merge query division code sets according to which different data query requests corresponding to the two attribute line data are divided, to obtain a joint division code set;

In an embodiment, the calculating module 930 is specifically configured to calculate, according to a preset imbalance factor algorithm, a degree of balance value between attribute data under each timestamp in any two target attribute columns;

if the balance degree value is smaller than the preset balance degree threshold value and the correlation value is larger than the preset correlation threshold value, the attribute data between the two target attribute columns have positive correlation.

In an embodiment, the sending module 950 is specifically configured to map the data stream partition result and the partition code according to the data stream partition result into a preset data partition routing table respectively;

and respectively sending the data stream division results to each processing node for processing according to the load information of the data stream division results of the processing nodes reflected in the data division routing table, and obtaining the fed-back query results.

In one embodiment, the data stream query device 900 further comprises:

the receiving module is used for receiving a plurality of data query requests of the same type sent under the next timestamp;

the acquisition module is used for acquiring target attribute column data in a newly-added historical data stream under a next timestamp according to a candidate division code set generated by a plurality of data query requests of the same type;

the dividing module is used for dividing the data stream of the target attribute column data in the newly-added historical data stream to obtain a newly-added data stream dividing result;

the calculation module is used for receiving the data flow division result distribution amount information of the timestamp fed back by each processing node and calculating to obtain a task distribution balance degree value of the timestamp on each processing node according to the data flow division result distribution amount information;

and the distribution module is used for distributing the processing nodes to the division result of the newly added data stream according to a preset incremental clustering feature tree algorithm if the balance degree value is smaller than a preset balance degree threshold value.

For the specific definition of the data stream query device, reference may be made to the above definition of the data stream query method, which is not described herein again. The modules in the data stream query device can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device (master node) is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the data stream of the full calendar history and the data of the division result of each data stream. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data stream query method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device (master node) comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the following steps when executing the computer program:

receiving a plurality of data query requests, wherein each data query request carries a query division code set which is an attribute key value set of a data stream;

according to the correlation result, correspondingly combining and inquiring the division code sets to obtain a combined division code set, and dividing the full history data stream according to the combined division code set to obtain a data stream division result;

and respectively sending the data stream division results to different processing nodes for processing to obtain query results.

In one embodiment, the processor, when executing the computer program, further performs the steps of:

acquiring a query division code set carried in each data query request;

determining target attributes in all dimension attribute sets of the full history data stream according to all division codes contained in the candidate division code sets;

acquiring target attribute column data according to the target attribute;

if the correlation result between the two attribute column data is positive correlation, combining the query division code sets of different data query request division bases corresponding to the two attribute column data to obtain a combined division code set;

mapping the data stream division result and a division code according to the data stream division result into a preset data division routing table respectively;

acquiring target attribute column data in a newly added historical data stream under a next timestamp according to a candidate division code set generated by a plurality of data query requests of the same type;

receiving data flow division result distribution quantity information of a timestamp fed back by each processing node, and calculating to obtain a task distribution balance degree value of the timestamp on each processing node according to the data flow division result distribution quantity information;

and if the balance degree value is smaller than a preset balance degree threshold value, processing node allocation is carried out on the division result of the newly added data stream according to a preset incremental clustering feature tree algorithm.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

acquiring a query division code set carried in each data query request;

acquiring target attribute column data according to the target attribute;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for querying a data stream, the method comprising:

2. The method of claim 1, wherein the generating a candidate set of partitions from the set of query partitions carried in each of the data query requests comprises:

acquiring a query division code set carried in each data query request;

3. The method according to claim 1, wherein the obtaining of corresponding attribute column data of a full history data stream according to the candidate partition code set, calculating a correlation value between any two attribute column data, and obtaining a correlation result between different attribute column data according to a preset correlation threshold value comprises:

acquiring the target attribute column data according to the target attribute;

4. The method according to claim 1, wherein the correspondingly merging the query partition code sets according to the correlation results to obtain a joint partition code set, and partitioning the full history data stream according to the joint partition code set to obtain data stream partitioning results comprises:

5. The method of claim 3, wherein if the correlation value is greater than a preset correlation threshold, the attribute data between the two target attribute columns has a correlation relationship, including:

6. The method of claim 1, wherein the sending the data stream partitioning result to different processing nodes for processing to obtain a query result comprises:

7. The method of claim 1, further comprising:

8. A data stream querying device, the device comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.