CN113254493A - Data grouping statistical method and system for distributed database - Google Patents

Data grouping statistical method and system for distributed database Download PDF

Info

Publication number
CN113254493A
CN113254493A CN202010752066.1A CN202010752066A CN113254493A CN 113254493 A CN113254493 A CN 113254493A CN 202010752066 A CN202010752066 A CN 202010752066A CN 113254493 A CN113254493 A CN 113254493A
Authority
CN
China
Prior art keywords
data
grouping
fragmentation
target
grouped
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010752066.1A
Other languages
Chinese (zh)
Inventor
熊志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Hanyun Technology Co ltd
Original Assignee
Shenzhen Hanyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Hanyun Technology Co ltd filed Critical Shenzhen Hanyun Technology Co ltd
Priority to CN202010752066.1A priority Critical patent/CN113254493A/en
Publication of CN113254493A publication Critical patent/CN113254493A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application is applicable to the technical field of data analysis, and particularly relates to a data grouping statistical method and system for a distributed database. The method includes the steps that data grouping statistics is conducted on a to-be-grouped data set distributed in N data nodes according to a server and N data nodes, the server obtains data fragment types of the to-be-grouped data set, the N data nodes conduct one-time grouping statistics or two-time grouping statistics according to the data fragment types of the to-be-grouped data set, the server conducts statistics on grouping statistics results of the N data nodes, data grouping statistics on the to-be-grouped data set distributed in the N data nodes is achieved, grouping does not need to be conducted through the server when a large amount of data are responded, the fact that the server calls the to-be-grouped data set formed by a large amount of data from the N data nodes is avoided, the response time of the server is shortened, and the grouping statistics efficiency of the server is improved.

Description

Data grouping statistical method and system for distributed database
Technical Field
The application belongs to the technical field of data analysis, and particularly relates to a data grouping statistical method and system for a distributed database.
Background
The distributed database is to store the data sets to be grouped in a plurality of Data Nodes (DNs) in a scattered manner, and when grouping and counting the data scattered on the plurality of DNs, the server needs to read qualified data from all the DNs and then perform grouping and counting in the server. However, when the data set to be grouped is large, grouping statistics is performed through the server, the response time is long, and the server grouping statistics efficiency is low.
Disclosure of Invention
The embodiment of the application provides a data grouping statistical method and a data grouping statistical system for a distributed database, which can solve the problem that the grouping statistical efficiency of a server is low when a large data set is grouped and counted in the prior art.
In a first aspect, an embodiment of the present application provides a data grouping statistics method for a distributed database, where the data grouping statistics method is applied to a data grouping statistics system for the distributed database, the data grouping statistics system includes a server and N data nodes, N is an integer greater than 1, a set of data to be grouped is distributed in the N data nodes, and the data grouping statistics method includes:
the server acquires the data fragment type of the data set to be grouped;
the N data nodes perform one-time grouping statistics or two-time grouping statistics on respective local data sets according to the data fragment types of the data sets to be grouped to obtain N target grouping statistical result sets, wherein the local data set of one data node is the data set of the data set to be grouped distributed in the data node, and one data node corresponds to one target grouping statistical result set;
and the server carries out statistics on the N target grouping statistical result sets to determine a target statistical result.
In a second aspect, an embodiment of the present application provides a data grouping statistics system for a distributed database, where the data grouping statistics system includes a server and N data nodes, where N is an integer greater than 1, and a set of data to be grouped is distributed in the N data nodes;
the server is used for acquiring the data fragment type of the data set to be grouped;
the N data nodes are configured to perform one-time grouping statistics or two-time grouping statistics on respective local data sets according to the data fragment types of the data sets to be grouped to obtain N target grouping statistics result sets, where a local data set of one data node is a data set in which the data sets to be grouped are distributed, and one data node corresponds to one target grouping statistics result set;
and the server is used for counting the N target grouping statistical result sets and determining a target statistical result.
Compared with the prior art, the embodiment of the application has the advantages that: according to the method, data grouping statistics is carried out on the data sets to be grouped distributed in the N data nodes according to the server and the N data nodes, the data fragment type of the data sets to be grouped is obtained through the server, the N data nodes carry out one-time grouping statistics or two-time grouping statistics according to the data fragment type of the data sets to be grouped, the server carries out statistics on the grouping statistics results of the N data nodes, and a final statistics result can be obtained. According to the method and the device, grouping statistics can be carried out on the to-be-grouped data set through the N data nodes, grouping through the server is not needed when a large amount of data is handled, the to-be-grouped data set formed by calling a large amount of data from the N data nodes by the server is avoided, the response time of the server is shortened, and the grouping statistics efficiency of the server is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a data grouping statistical method for a distributed database according to an embodiment of the present application;
fig. 2 is a schematic flow chart of hash fragmentation data packet statistics according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a data grouping statistical method for a distributed database according to a second embodiment of the present application;
fig. 4 is a schematic flow chart of non-hash fragmentation big data packet statistics provided in the second embodiment of the present application;
fig. 5 is a schematic flow chart of non-hash fragmented small data packet statistics provided in the second embodiment of the present application;
fig. 6 is a schematic flowchart of a class merge grouping statistical method according to a second embodiment of the present application;
fig. 7 is a schematic network architecture of a data packet statistics system of a distributed database according to a third embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The data grouping statistical method for the distributed database provided by the embodiment of the present application can be applied to terminal devices such as a desktop computer, a notebook computer, an ultra-mobile personal computer (UMPC), and the like, and the embodiment of the present application does not limit the specific types of the terminal devices.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
In order to explain the technical means of the present application, the following description will be given by way of specific examples.
Referring to fig. 1, which is a schematic flowchart of a data grouping statistics method for a distributed database according to an embodiment of the present application, the data grouping statistics method may be used in a data grouping statistics system for a distributed database, where the data grouping statistics system includes a server and N data nodes, where N is an integer greater than 1, and as shown in fig. 1, the data grouping statistics method may include the following steps:
step S101, the server obtains the data fragment type of the data set to be grouped.
As shown in fig. 2, the field sets of the to-be-grouped data set in the to-be-grouped data set (original data of DN 1) on DN1 (data node 1) include names (zhang san, li si, etc.), classes (one class and two classes), courses (Chinese, mathematics) and scores (90, 96, etc.).
The data set to be grouped may be distributed in the N data nodes in the form of a data table, the attributes of the data table include attribute information such as a data fragment type, and the server may obtain the data fragment type of the data set to be grouped by obtaining the attribute of the data table corresponding to the data set to be grouped.
Optionally, the server is a scheduling server, and the scheduling server is connected to the N data nodes.
And S102, carrying out one-time grouping statistics or two-time grouping statistics on respective local data sets by the N data nodes according to the data fragment types of the data sets to be grouped to obtain N target grouping statistical result sets.
The local data set of one data node is a data set of a data node distributed with a data set to be grouped, N data nodes carry out grouping statistics on the respective local data sets to obtain respective target grouping statistical result sets, one data node corresponds to one target grouping statistical result set, and N data nodes correspond to the N target grouping statistical result sets.
Optionally, the N data nodes perform one-time grouping statistics or two-time grouping statistics on respective local data sets according to the data fragment types of the data sets to be grouped, and obtaining N target grouping statistics result sets includes:
and when the data fragmentation type of the data set to be grouped is Hash fragmentation and the fragmentation field set of the Hash fragmentation of the data set to be grouped belongs to the subset of the target fragmentation field set, the N data nodes carry out grouping statistics on respective local data sets to obtain N target grouping statistical result sets.
After acquiring the data fragmentation type of the data set to be grouped, the server detects whether the data fragmentation type of the data set to be grouped is hash fragmentation; if the data fragmentation type of the data set to be grouped is hash fragmentation, whether the fragmentation field set of the hash fragmentation of the data set to be grouped belongs to the subset of the target fragmentation field set is detected.
The target fragmentation field set is at least one of M field sets, for example, the field set includes name, class, course and score, the target fragmentation field set may be class and course, the subset of the target fragmentation field set is class or course, if the data set to be grouped is hash fragmentation, the fragmentation field set of hash fragmentation is class, the fragmentation field set of hash fragmentation of the data set to be grouped belongs to the subset of the target fragmentation field set, and if the data set to be grouped is hash fragmentation, the fragmentation field set of hash fragmentation is name, the fragmentation field set of hash fragmentation of the data set to be grouped does not belong to the subset of the target fragmentation field set.
When the data fragmentation type of the to-be-grouped data set is hash fragmentation and the fragmentation field set of the hash fragmentation of the to-be-grouped data set belongs to the subset of the target fragmentation field set, the server sends first result information to the N data nodes, and after the N data nodes receive the first result information, the N data nodes perform grouping statistics on respective local data sets to obtain N target grouping statistical result sets, wherein the first result information may refer to information corresponding to the fact that the data fragmentation type of the to-be-grouped data set is hash fragmentation and the fragmentation field set of the hash fragmentation of the to-be-grouped data set belongs to the subset of the target fragmentation field set.
Fig. 2 is a schematic diagram illustrating a flow of data grouping statistics when a to-be-grouped data set is a hash fragment and a fragment field set of the hash fragment belongs to a subset of a target fragment field set, where the fragment field set of the hash fragment is a class, each data node performs grouping statistics on its own local data set (raw data in fig. 2) according to "class" and "course", the statistics result is "class, course, total score and average score", and d1, d2, and d3 respectively represent target grouping statistics result sets of DN1 (data node 1), DN2 (data node 2), and DN3 (data node 3), and all the target grouping statistics result sets are collected together to obtain a target statistics result.
Optionally, the data packet statistics method further includes:
the server detects whether the data fragmentation type of the data set to be grouped is hash fragmentation or not;
if the data fragmentation type of the data set to be grouped is Hash fragmentation, detecting whether the fragmentation field set of the Hash fragmentation of the data set to be grouped belongs to the subset of the target fragmentation field set or not to obtain a first detection result, and sending the first detection result to the N data nodes;
if the data fragmentation type of the data set to be grouped is not Hash fragmentation, obtaining a second detection result, and sending the second detection result to the N data nodes;
the first detection result comprises that the data fragmentation type of the set to be grouped is hash fragmentation and the fragmentation field set of the hash fragmentation of the set to be grouped belongs to the subset of the target fragmentation field set, the data fragmentation type of the set to be grouped is hash fragmentation and the fragmentation field set of the hash fragmentation of the set to be grouped does not belong to the subset of the target fragmentation field set, and the second detection result indicates that the data fragmentation type of the set to be grouped is not hash fragmentation.
And the data node executes corresponding steps according to the information in the first detection result or the second detection result.
And step S103, the server collects the N target grouping statistical result sets and determines a target statistical result.
After the N data nodes obtain their respective target grouping statistical result sets, the target grouping statistical result sets are sent to the server, and the server collects the N target grouping statistical result sets, for example, integrates the N target grouping statistical result sets to obtain target statistical results. In one embodiment, the server sends a target group statistical result set acquisition instruction to the data node, waits for the data node group statistics to complete to obtain a target group statistical result set, and the data node responds to the target group statistical result set acquisition instruction and feeds back the target group statistical result set to the server.
According to the embodiment of the application, the grouping statistics can be carried out on the group data set to be grouped through the N data nodes, the grouping through the server is not needed when a large amount of data is handled, the group data set to be grouped formed by calling a large amount of data from the N data nodes by the server is avoided, the response time of the server is shortened, and the grouping statistics efficiency of the server is improved.
Referring to fig. 3, which is a schematic flowchart of a data grouping statistical method for a distributed database according to a second embodiment of the present application, as shown in fig. 3, the data grouping statistical method may include the following steps:
step S301, the server obtains the data fragment type of the data set to be grouped.
Step S302, when the data fragmentation type of the data set to be grouped is not Hash fragmentation or the fragmentation field set of the Hash fragmentation of the data set to be grouped does not belong to the subset of the target fragmentation field set, the N data nodes carry out first grouping statistics on respective local data sets to obtain N initial grouping statistical result sets.
After acquiring the data fragmentation type of the data set to be grouped, the server detects whether the data fragmentation type of the data set to be grouped is hash fragmentation; if the data fragmentation type of the data set to be grouped is hash fragmentation, whether the fragmentation field set of the hash fragmentation of the data set to be grouped belongs to the subset of the target fragmentation field set is detected.
When the data fragmentation type of the to-be-grouped data set is hash fragmentation, and the fragmentation field set of the hash fragmentation of the to-be-grouped data set does not belong to the subset of the target fragmentation field set, or the data fragmentation type of the to-be-grouped data set is not hash fragmentation, the server sends second result information to the N data nodes, and after the N data nodes receive the second result information, the N data nodes perform grouping statistics on respective local data sets to obtain N initial grouping statistic result sets, wherein the second result information can mean that the data fragmentation type of the to-be-grouped data set is hash fragmentation, and the fragmentation field set of the hash fragmentation of the to-be-grouped data set does not belong to the subset of the target fragmentation field set, or the data fragmentation type of the to-be-grouped data set is not information corresponding to the hash fragmentation.
Fig. 4 is a schematic flow chart of a packet-to-be-grouped data set that is not hash-sliced, where each data node performs grouping statistics on its own local data set (raw data in fig. 4) according to "class" and "course", the statistics are "class, course, total score and number", d1 represents an initial grouping statistics set of DN1 (data node 1), d2 represents an initial grouping statistics set of DN2 (data node 2), and d3 represents an initial grouping statistics set of DN3 (data node 3).
Step S303, when the total row number of the N initial grouping statistical result sets is greater than the row number threshold value, the N data nodes perform Hash fragmentation on the respective initial grouping statistical result sets according to the target fragmentation field set to obtain respective Hash fragmentation result sets.
After N data nodes obtain respective initial grouping statistical result sets, determining whether hash fragmentation needs to be carried out on the respective initial grouping statistical result sets according to the total row number of the N initial grouping statistical result sets. Because the total line number is larger, the data volume of the initial grouping statistical result set is larger, in order to ensure the statistical efficiency, when the total line number is larger than the line number threshold value, the hash fragmentation is carried out on the respective initial grouping statistical result set to obtain a hash fragmentation result set, and one data node carries out grouping statistics on the hash fragmentation result set of the data node to obtain a target grouping statistical result set of the data node.
Optionally, the hash fragmentation of the respective initial packet statistics result set by the N data nodes according to the target fragmentation field set to obtain respective hash fragmentation result sets includes:
and the N data nodes distribute each row of data in the respective initial grouping statistical result set to the corresponding data nodes by utilizing the Hash fragments according to the target fragment field set to obtain the respective Hash fragment result sets.
Each line of data in the initial grouping statistics result set may be a group of initial grouping statistics results, as shown in fig. 4, "a language 941" in the initial grouping statistics result set d1 of DN1 is a line of data, the initial grouping statistics result set d1 includes 6 lines of data, and each line of data is correspondingly distributed to a corresponding data node according to the target fragment field set.
As shown in fig. 4, if the total number of rows in the initial grouping statistics result sets of d1, d2, and d3 is greater than the row number threshold, the data node hash-fragments its initial grouping statistics result set, in fig. 4, DN1 distributes "four language classes 901" and "four mathematics classes 761" to DN2, and finally, the hash-fragment result set in 3 data nodes is a hash-fragment in "class" grouping category.
Optionally, the distributing, by the N data nodes, each line of data in the respective initial packet statistics result set to the corresponding data node by using hash fragmentation according to the target fragmentation field set, and obtaining the respective hash fragmentation result set includes:
n data nodes obtain the hash value of the target data in each row of data of the respective initial grouping statistical result set, the remainder of the hash value of the target data in each row of data of the respective initial grouping statistical result set is calculated, each row of data is distributed to the data nodes corresponding to the remainder in the N data nodes, and the respective hash fragmentation result set is obtained, wherein the target data refers to the data of which the field set in each row of data is the target fragmentation field set.
The data nodes read one row of data, calculate the hash value of the target data in the row data group, balance the hash value, determine to which data node the row of data corresponding to the hash value is distributed according to the remainder, and distribute the target data with similar hash values to one data node according to the requirement, for example, distribute the data of 'one shift' and 'two shifts' to one data node. The target data may refer to data whose field set in the row data group is a target fragment field set, for example, the row data group is "one class 941", the target fragment field set is a class, the target data of the row data group is "one class", a hash value of "one class" is calculated, the hash value corresponds to data node 1, the data node distributes "one class 941" to data node 1, as shown in fig. 4, the hash values of "one class" and "two classes" both correspond to DN1, DN2 distributes three data of "one class math 981", "two classes of chinese 801" and "two classes of math 861" to DN1, and the hash fragment result set of DN1 is formed by combining "one class of chinese 941", "two classes of chinese" and "two classes of math 961" reserved on DN 1.
And step S304, carrying out secondary grouping statistics on the respective Hash fragmentation result sets by the N data nodes to obtain N target grouping statistical result sets.
As shown in fig. 4, DN1 performs grouping statistics on its hash fragmentation result set to obtain a ds1, ds1 is a target grouping statistical result set of DN1, ds2 is a target grouping statistical result set of DN2, ds3 is a target grouping statistical result set of DN3, and the server gathers ds1, ds2 and ds3 together to obtain a target statistical result.
In step S305, the server collects N target group statistical result sets to determine a target statistical result.
The contents of step S301 and step S305 are the same as those of step S101 and step S103 in the first embodiment, and reference may be made to the description of step S101 and step S103, which is not repeated herein.
Optionally, the data packet statistics method further includes:
and when the total row number is less than or equal to the row number threshold value, the server carries out grouping statistics on the N initial grouping statistical result sets to obtain a target statistical result.
After N data nodes obtain respective initial grouping statistical result sets, determining whether hash fragmentation needs to be carried out on the respective initial grouping statistical result sets according to the total row number of the N initial grouping statistical result sets. Because the total row number is small, the data volume of the initial grouping statistical result set is not large, the server can be directly adopted to carry out grouping statistics on the initial grouping statistical result set, the grouping statistical performance can be ensured, and the communication time between the server and the data nodes can be saved.
Fig. 5 is a schematic flow chart of a to-be-grouped data set not being a hash fragment, where each data node performs grouping statistics on its own local data set (raw data in fig. 5) according to "class" and "course", where the statistics result is "class, course, total fraction and number", d1 represents an initial grouping statistics result set of DN1 (data node 1), d2 represents an initial grouping statistics result set of DN2 (data node 2), d3 represents an initial grouping statistics result set of DN3 (data node 3), and if the total row number of the initial grouping statistics result sets of d1, d2, and d3 is less than or equal to a row number threshold, the server performs grouping statistics on d1, d2, and d3 directly to obtain a target statistics result.
Optionally, the performing packet statistics on the N initial packet statistical result sets to obtain the target statistical result includes:
sorting each row of data in the N initial grouping statistical result sets according to the ascending order of the fragment field sets;
acquiring N first-line data of N initial grouping statistical result sets;
grouping the data with the minimum fragment field set of the N first-row data into the same group and carrying out statistics to obtain a group statistical result;
removing the data with the minimum fragment field set from the corresponding initial grouping statistical result set, and taking the data after the data with the minimum fragment field set in the corresponding initial grouping statistical result set as the first row data;
and traversing each row of data in the N initial grouping statistical result sets to obtain all grouping statistical results, namely the target statistical results.
The server carries out grouping statistics on d1, d2 and d3, the grouping statistics can be processed by a class merging grouping statistics algorithm, as shown in fig. 6, data in an initial grouping statistics result set are sorted in an ascending order of class and class, and data processing is carried out by using a concept similar to merging:
(1) the method comprises the following steps: the first-line data of the data sets d1, d2 and d3 are read respectively, and the minimum data obtained when data comparison is carried out based on the class and the curriculum is searched.
If "class" comparison is used, one class "may be taken as the minimum data," language "may be taken as the minimum data, if" class "comparison is used, and if" class "and" class "comparison is used, as shown in fig. 6, the minimum data is" one class 941 "of d1, which is regarded as a group for statistics: the total score 94 is 1 person, so the average score is 94/1 94, the statistical result "one shift of Chinese 9494" is output to the result set, and then the first row data is removed from d 1.
(2) Step two: the first-line data of the data sets d1, d2 and d3 are read respectively, and the minimum data obtained when data comparison is carried out based on the class and the curriculum is searched.
If based on the "class" and "course" comparisons, as shown in fig. 6, the smallest data is "one class math 981" of d2, which is considered as a group to be counted: the total score 98 is 98, the number of people is 1, so the average score is 98/1-98, the statistical result of the group, "one shift mathematics 9898" is output to the result set, and then the first row data is removed from d 2.
(3) Step three: the first-line data of the data sets d1, d2 and d3 are read respectively, and the minimum data obtained when data comparison is carried out based on the class and the curriculum is searched.
If based on the comparison between "class" and "class", as shown in fig. 6, the minimum data is "two-class language 901" of d1 and "two-class language 801" from d2, and the two data are regarded as a group for statistics: the total score 90+80 equals 170, and the number of people 1+1 equals 2, so the average score 170/2 equals 85, the statistical result "two shift languages 17085" of the group is output to the result set, and then the first row data is removed from d1 and d2, respectively.
And repeating the steps, circularly processing the head line data of d1, d2 and d3 until all the data of d1, d2 and d3 are processed, and counting all the removed head line data together to obtain a target grouping result.
Optionally, the data packet statistics method further includes:
and the server acquires the total line number of the N initial grouping statistical result sets and judges whether the total line number is greater than a line number threshold value.
After the N data nodes obtain their respective initial grouping statistics result sets, each data node may send the line number of its own initial grouping statistics result set to the server, and the server summarizes and obtains the total line number of the N initial grouping statistics result sets. In one embodiment, after the data node obtains the initial grouping statistics result set, the data node sends a grouping statistics completion instruction to the server, the server receives the grouping statistics completion instruction and sends a line number acquisition instruction to the data node, and the data node calculates the line number of its own initial grouping statistics result set according to the line number acquisition instruction and feeds back the line number to the server.
According to the data grouping method and device, the data grouping statistics of the data to be grouped sets which are distributed on the N data nodes in a non-Hash mode can be achieved through the N data nodes, grouping through the server is not needed when a large amount of data are handled, the situation that the server calls the data to be grouped sets formed by the large amount of data from the N data nodes is avoided, the response time of the server is shortened, and the server grouping statistical efficiency is improved.
Corresponding to the data packet statistics method in the foregoing embodiment, fig. 7 shows a schematic network architecture diagram of a data packet statistics system of a distributed database provided in the third embodiment of the present application, and for convenience of explanation, only the part related to the embodiment of the present application is shown.
Referring to fig. 7, the data packet statistics system includes: the data grouping method comprises a server and N data nodes, wherein N is an integer greater than 1, and a data set to be grouped is distributed in the N data nodes;
the server is used for acquiring the data fragment type of the data set to be grouped;
the data nodes are used for carrying out one-time grouping statistics or two-time grouping statistics on respective local data sets according to the data fragment types of the data sets to be grouped to obtain N target grouping statistical result sets, wherein the local data set of one data node is a data set of the data nodes distributed with the data sets to be grouped, and one data node corresponds to one target grouping statistical result set;
and the server is used for collecting the N target grouping statistical result sets and determining a target statistical result.
Optionally, the N data nodes are specifically configured to:
and when the data fragment type of the data set to be grouped is the Hash fragment and the fragment field set of the Hash fragment of the data set to be grouped belongs to the subset of the target fragment field set, carrying out grouping statistics on respective local data sets to obtain N target grouping statistical result sets.
Optionally, the N data nodes are specifically configured to:
when the data fragmentation type of the data set to be grouped is not hash fragmentation or the fragmentation field set of the hash fragmentation of the data set to be grouped does not belong to the subset of the target fragmentation field set, carrying out grouping statistics on respective local data sets to obtain N initial grouping statistical result sets;
when the total row number of the N initial grouping statistical result sets is larger than the row number threshold value, carrying out Hash fragmentation on the respective initial grouping statistical result sets according to the target fragmentation field set to obtain respective Hash fragmentation result sets;
and carrying out grouping statistics on the respective Hash fragmentation result sets to obtain N target grouping statistical result sets.
Optionally, the server is further configured to:
and when the total row number is less than or equal to the row number threshold, performing grouping statistics on the N initial grouping statistical result sets to obtain a target statistical result.
Optionally, when the total number of rows is less than or equal to the number of rows threshold, the server is specifically configured to:
sorting each row of data in the N initial grouping statistical result sets according to the ascending order of the fragment field sets;
acquiring N first-line data of N initial grouping statistical result sets;
grouping the data with the minimum fragment field set of the N first-row data into the same group and carrying out statistics to obtain a group statistical result;
removing the data with the minimum fragment field set from the corresponding initial grouping statistical result set, and taking the data after the data with the minimum fragment field set in the corresponding initial grouping statistical result set as the first row data;
and traversing each row of data in the N initial grouping statistical result sets to obtain all grouping statistical results, namely the target statistical results.
Optionally, the N data nodes are specifically configured to:
and distributing each row of data in the respective initial grouping statistical result set to the corresponding data node by utilizing the Hash fragmentation according to the target fragmentation field set to obtain the respective Hash fragmentation result set.
Optionally, the N data nodes are specifically configured to:
and obtaining the hash value of the target data in each line of data of the respective initial grouping statistical result set, calculating the remainder of the hash value of the target data in each line of data of the respective initial grouping statistical result set, and distributing each line of data to the data nodes corresponding to the remainder in the N data nodes to obtain the respective hash fragmentation result set, wherein the target data refers to the data of which the field set in each line of data is the target fragmentation field set.
Optionally, the server is further configured to:
and acquiring the total line number of the N initial grouping statistical result sets, and judging whether the total line number is greater than a line number threshold value.
Optionally, the server is further configured to:
detecting whether the data fragmentation type of a data set to be grouped is Hash fragmentation or not;
if the data fragmentation type of the data set to be grouped is Hash fragmentation, detecting whether the fragmentation field set of the Hash fragmentation of the data set to be grouped belongs to the subset of the target fragmentation field set or not to obtain a first detection result, and sending the first detection result to the N data nodes;
if the data fragmentation type of the data set to be grouped is not Hash fragmentation, obtaining a second detection result, and sending the second detection result to the N data nodes;
the first detection result comprises that the data fragmentation type of the set to be grouped is hash fragmentation and the fragmentation field set of the hash fragmentation of the set to be grouped belongs to the subset of the target fragmentation field set, the data fragmentation type of the set to be grouped is hash fragmentation and the fragmentation field set of the hash fragmentation of the set to be grouped does not belong to the subset of the target fragmentation field set, and the second detection result indicates that the data fragmentation type of the set to be grouped is not hash fragmentation.
It should be noted that, because the above-mentioned information interaction between the server and the N data nodes, the implementation process, and other contents are based on the same concept as that of the embodiment of the method of the present application, specific functions and technical effects thereof may be specifically referred to a part of the embodiment of the method, and details are not described here.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A data grouping statistical method of a distributed database is applied to a data grouping statistical system of the distributed database, the data grouping statistical system comprises a server and N data nodes, N is an integer greater than 1, and is characterized in that a data set to be grouped is distributed in the N data nodes, and the data grouping statistical method comprises the following steps:
the server acquires the data fragment type of the data set to be grouped;
the N data nodes perform one-time grouping statistics or two-time grouping statistics on respective local data sets according to the data fragment types of the data sets to be grouped to obtain N target grouping statistical result sets, wherein the local data set of one data node is the data set of the data set to be grouped distributed in the data node, and one data node corresponds to one target grouping statistical result set;
and the server carries out statistics on the N target grouping statistical result sets to determine a target statistical result.
2. The data grouping statistical method of claim 1, wherein the N data nodes perform one-time grouping statistics or two-time grouping statistics on respective local data sets according to the data fragmentation types of the data sets to be grouped to obtain N target grouping statistical result sets, and the method comprises:
and when the data fragmentation type of the data set to be grouped is Hash fragmentation and the fragmentation field set of the Hash fragmentation of the data set to be grouped belongs to the subset of the target fragmentation field set, the N data nodes carry out grouping statistics on respective local data sets to obtain N target grouping statistical result sets.
3. The data grouping statistical method of claim 1, wherein the N data nodes perform one-time grouping statistics or two-time grouping statistics on respective local data sets according to the data fragmentation types of the data sets to be grouped to obtain N target grouping statistical result sets, and the method comprises:
when the data fragmentation type of the data set to be grouped is not Hash fragmentation or the fragmentation field set of the Hash fragmentation of the data set to be grouped does not belong to the subset of the target fragmentation field set, the N data nodes carry out first grouping statistics on respective local data sets to obtain N initial grouping statistical result sets;
when the total row number of the N initial grouping statistical result sets is larger than a row number threshold value, the N data nodes perform Hash fragmentation on the respective initial grouping statistical result sets according to the target fragmentation field set to obtain respective Hash fragmentation result sets;
and the N data nodes carry out second grouping statistics on the respective Hash fragmentation result sets to obtain the N target grouping statistical result sets.
4. The data packet statistics method of claim 3, characterized in that the data packet statistics method further comprises:
and when the total line number is less than or equal to the line number threshold value, the server carries out grouping statistics on the N initial grouping statistical result sets to obtain a target statistical result.
5. The data grouping statistic method according to claim 4, wherein said performing grouping statistics on said N initial grouping statistic result sets to obtain target statistic results comprises:
sorting each row of data in the N initial grouping statistical result sets according to the ascending order of the fragment field sets;
acquiring N first-line data of the N initial grouping statistical result sets;
grouping the data with the minimum fragment field set of the N first-row data into the same group and carrying out statistics to obtain a group statistical result;
removing the data with the minimum fragment field set from the corresponding initial grouping statistical result set, and taking the data after the data with the minimum fragment field set in the corresponding initial grouping statistical result set as the first row data;
and traversing each row of data in the N initial grouping statistical result sets to obtain all grouping statistical results, namely target statistical results.
6. The data packet statistical method according to claim 3, wherein the hash-slicing, by the N data nodes, the respective initial packet statistical result sets according to the target-sliced field set to obtain respective hash-sliced result sets comprises:
and the N data nodes distribute each row of data in the respective initial grouping statistical result set to the corresponding data nodes by utilizing the Hash fragmentation according to the target fragmentation field set to obtain the respective Hash fragmentation result sets.
7. The data grouping statistical method according to claim 6, wherein the distributing, by the N data nodes, each row of data in the respective initial grouping statistical result set to the corresponding data node by using hash fragmentation according to the target fragmentation field set to obtain the respective hash fragmentation result set comprises:
the N data nodes obtain the hash value of the target data in each row of data of the respective initial grouping statistic result set, the remainder of the hash value of the target data in each row of data of the respective initial grouping statistic result set is calculated, each row of data is distributed to the data nodes corresponding to the remainder in the N data nodes, and the respective hash fragmentation result set is obtained, wherein the target data refers to the data of which the field set in each row of data is the target fragmentation field set.
8. The data packet statistics method of claim 3, further comprising:
and the server acquires the total line number of the N initial grouping statistical result sets and judges whether the total line number is greater than a line number threshold value.
9. A data packet statistics method according to claim 2 or 3, characterized in that the data packet statistics method further comprises:
the server detects whether the data fragmentation type of the data set to be grouped is hash fragmentation or not;
if the data fragmentation type of the data set to be grouped is Hash fragmentation, detecting whether a fragmentation field set of the Hash fragmentation of the data set to be grouped belongs to a subset of a target fragmentation field set or not to obtain a first detection result, and sending the first detection result to the N data nodes;
if the data fragmentation type of the data set to be grouped is not Hash fragmentation, obtaining a second detection result, and sending the second detection result to the N data nodes;
the first detection result includes that the data fragmentation type of the to-be-grouped data set is hash fragmentation and the fragmentation field set of the hash fragmentation of the to-be-grouped data set belongs to a subset of a target fragmentation field set, the data fragmentation type of the to-be-grouped data set is hash fragmentation and the fragmentation field set of the hash fragmentation of the to-be-grouped data set does not belong to the subset of the target fragmentation field set, and the second detection result indicates that the data fragmentation type of the to-be-grouped data set is not hash fragmentation.
10. A data grouping statistical system of a distributed database comprises a server and N data nodes, wherein N is an integer greater than 1, and is characterized in that a data set to be grouped is distributed in the N data nodes;
the server is used for acquiring the data fragment type of the data set to be grouped;
the N data nodes are configured to perform one-time grouping statistics or two-time grouping statistics on respective local data sets according to the data fragment types of the data sets to be grouped to obtain N target grouping statistics result sets, where a local data set of one data node is a data set in which the data sets to be grouped are distributed, and one data node corresponds to one target grouping statistics result set;
and the server is used for counting the N target grouping statistical result sets and determining a target statistical result.
CN202010752066.1A 2020-07-30 2020-07-30 Data grouping statistical method and system for distributed database Pending CN113254493A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010752066.1A CN113254493A (en) 2020-07-30 2020-07-30 Data grouping statistical method and system for distributed database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010752066.1A CN113254493A (en) 2020-07-30 2020-07-30 Data grouping statistical method and system for distributed database

Publications (1)

Publication Number Publication Date
CN113254493A true CN113254493A (en) 2021-08-13

Family

ID=77220104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010752066.1A Pending CN113254493A (en) 2020-07-30 2020-07-30 Data grouping statistical method and system for distributed database

Country Status (1)

Country Link
CN (1) CN113254493A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327261A (en) * 2021-12-06 2022-04-12 神州融安数字科技(北京)有限公司 Data file storage method and data security agent

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327261A (en) * 2021-12-06 2022-04-12 神州融安数字科技(北京)有限公司 Data file storage method and data security agent

Similar Documents

Publication Publication Date Title
US11863439B2 (en) Method, apparatus and storage medium for application identification
CN110505179A (en) A kind of detection method and system of exception flow of network
US11706114B2 (en) Network flow measurement method, network measurement device, and control plane device
WO2021022875A1 (en) Distributed data storage method and system
CN111507479B (en) Feature binning method, device, equipment and computer-readable storage medium
CN113987002A (en) Data exchange method based on mass data analysis platform
CN112543145A (en) Method and device for selecting communication path of equipment node for sending data
CN113254493A (en) Data grouping statistical method and system for distributed database
CN110430138B (en) Data flow forwarding state recording method and network equipment
WO2024083270A1 (en) Photovoltaic device search method, management module, system, and storage medium
CN107483310B (en) Method and system for networking between terminal and forwarding node
CN117294497A (en) Network traffic abnormality detection method and device, electronic equipment and storage medium
CN115296904B (en) Domain name reflection attack detection method and device, electronic equipment and storage medium
CN104133907A (en) Cloud computing data automatic classifying and counting method and system
CN116980281A (en) Node selection method, node selection device, first node, storage medium and program product
CN115331400A (en) Alarm fusion method, system and medium based on distributed optical fiber sensing
CN112019589B (en) Multi-level load balancing data packet processing method
WO2017054515A1 (en) Method and system for detecting pornographic image
CN110061922B (en) Message forwarding method and device
CN110493144A (en) A kind of data processing method and device
CN115086186B (en) Method and device for generating network flow demand data of data center
CN117278660B (en) Protocol analysis method for flow filtering based on DPDK technology
CN116980378B (en) Method and system for marking repeated message of micro-channel group
CN112214290B (en) Log information processing method, edge node, center node and system
CN113542035B (en) Service port identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination