CN111506621A

CN111506621A - Data statistical method and device

Info

Publication number: CN111506621A
Application number: CN202010246298.XA
Authority: CN
Inventors: 杨恒
Original assignee: New H3C Big Data Technologies Co Ltd
Current assignee: New H3C Big Data Technologies Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-08-07
Anticipated expiration: 2040-03-31
Also published as: CN111506621B

Abstract

The invention provides a data statistical method and a device, which convert fields capable of performing label conversion in a service data set extracted from each service system into label fields based on a pre-configured extensible label tree, and perform preliminary aggregation on the extracted label data set by using the parallel processing capacity of a distributed computing framework. And finally, the large data analysis efficiency is improved, and the large data analysis cost is reduced.

Description

Data statistical method and device

Technical Field

The invention relates to the technical field of big data, in particular to a data statistical method and device.

Background

For example, for a power and business platform, if order data needs to be analyzed, the general method is to extract the order data distributed in databases in various regions into a large-scale relational database, and then perform statistical analysis by using the statistical function of the SQ L statement of the relational database, and this way is along with the rapid expansion of data volume, and the response speed is greatly reduced, and in addition, this way needs to modify the table structure of the relational database and modify the SQ L statement for a scenario where a new field or a field value is switched between a single value and multiple values, and thus the expansibility is poor.

Disclosure of Invention

The invention provides a data statistical method and a data statistical device, which are used for solving the technical problems of low speed and poor expansibility when mass data are counted based on a relational database.

Based on the embodiment of the invention, the data statistical method provided by the invention comprises the following steps:

converting data records in a service data set into label records based on a preset label tree, wherein the label tree is in a hierarchical tree structure, takes labels as nodes and is used for establishing a mapping relation between a field value in the service data set and the labels and expressing the hierarchical relation and the field attributes of the field value in the service data set;

carrying out Key Value pair Key-Value preliminary aggregation on the label records by adopting a distributed parallel computing framework, and then loading the label records into a database of a distributed data statistical analysis engine;

and generating a retrieval condition based on the label of the label tree, and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.

Further, the method for primarily aggregating the Key Value pair Key-Value of the tag record comprises:

and merging the label fields in the label records to generate a Key by using the Map stage of the Map-Reduce component of the Hadoop, and aggregating the label records with the same Key value through the Reduce stage of the Map-Reduce component.

Further, the label tree supports multi-valued attributes, and a database of the distributed data statistical analysis engine supports storage and statistics of multi-valued label fields; for the label field with the multi-Value attribute, after Key-Value pair Key-Value preliminary aggregation is carried out on the label records, before the label records are loaded into a database of a distributed data statistical analysis engine, the label records containing the label field with the multi-Value attribute are combined, and therefore the label records containing the multi-Value label field are generated.

Further, the statistical method further comprises a statistical expansion step of:

expanding the label tree and updating a configuration file for converting the data records of the service data set into label records, wherein the expansion comprises expanding any layer of label nodes below the root node;

and generating a statistical condition based on the expanded label tree, and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.

Further, the distributed parallel computing framework is a Map-Reduce component or a Spark component, and the distributed data statistical analysis engine is an elastic search; and loading the label record subjected to Key Value pair Key-Value preliminary polymerization into an ElasticSearch database through an ES-Hadoop connector.

Based on the embodiment of the invention, the invention also provides a data statistical device, which comprises:

the system comprises a conversion module, a storage module and a processing module, wherein the conversion module is used for converting data records in a service data set into label records based on a preset label tree, the label tree is in a hierarchical tree structure, takes labels as nodes and is used for establishing a mapping relation between a field value in the service data set and the labels and expressing the hierarchical relation and the field attributes of the field value in the service data set;

the warehousing module is used for loading the label records into a database of the distributed data statistical analysis engine after Key Value pair Key-Value preliminary aggregation is carried out on the label records by adopting a distributed parallel computing framework;

and the statistical module is used for generating a retrieval condition based on the label of the label tree and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.

Further, the warehousing module merges the label fields in the label records to generate a Key by using the Map stage of the Map-Reduce component of the Hadoop, and then aggregates the label records with the same Key value through the Reduce stage of the Map-Reduce component.

Further, the label tree supports multi-valued attributes, and a database of the distributed data statistical analysis engine supports storage and statistics of multi-valued label fields; for the label field with the multi-Value attribute, after Key-Value pair primary aggregation is carried out on the label records by the warehousing module, the label records containing the label field with the multi-Value attribute are combined before being loaded into a database of the distributed data statistical analysis engine, and therefore the label records containing the multi-Value label field are generated.

Further, the statistical device further comprises:

the configuration updating module is used for expanding the label tree and updating a configuration file for converting the data records of the service data set into the label records, wherein the expansion comprises expanding any layer of label nodes below the root node;

and the statistical module generates statistical conditions based on the expanded label tree and calls a statistical analysis interface of the distributed data statistical analysis engine to obtain statistical results.

Further, the distributed parallel computing framework is a Map-Reduce component or a Spark component, and the distributed data statistical analysis engine is an elastic search; and the warehousing module loads the label records subjected to Key Value pair Key-Value preliminary aggregation into an ElasticSearch database by using an ES-Hadoop connector.

The invention provides a data statistical method and a frame with strong expansibility and rapidness aiming at the statistical analysis requirement of big data, the method converts fields which can be subjected to label conversion in the service data set extracted from each service system into label fields based on a pre-configured extensible label tree, namely, the conversion from the service data set to the tag data set is realized, and then the extracted tag data sets are subjected to preliminary aggregation by utilizing the parallel processing capability of a distributed computing framework such as Map-Reduce, the preliminarily aggregated tag data sets are then loaded into a database of a distributed data analysis engine, for example, in an elastic search database, the fast and flexible statistical analysis of mass data is realized by utilizing the strong searching and aggregating capability of a distributed data statistical analysis engine, so that the analysis efficiency of big data is improved, and the analysis cost of the big data is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments of the present invention or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings may be obtained according to the drawings of the embodiments of the present invention.

FIG. 1 is a flow chart of a data statistics method in one embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a process of loading the converted tag records into a database of a distributed statistical analysis engine after performing preliminary aggregation on the tag records by Map-Reduce in an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a data statistics apparatus according to an embodiment of the present invention.

Detailed Description

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the invention. As used in the examples and claims of the present invention, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used to describe various information in embodiments of the present invention, the information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present invention. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".

ES provides the concept of Index, user query is completed on Index, Shard is the data fragment of ES, an Index may be composed of several data fragments to reach the distributed expandable capability, multiple Shard storing Index are distributed on multiple nodes, each Node runs ES instance, and a Node is a single Server.

The ES may provide an aggregation application program interface (aggregation API), through which statistical analysis capabilities may be injected into the ES, making the user more forgiving in the face of large data extraction statistics. ES provides mainly two categories of aggregation, measuring Metric aggregation and bucket Bucketing aggregation. Bucketing aggregation is used to place data that satisfies a particular rule into a bucket, each bucket being associated with a key, equivalent to grouping data, each group being referred to as a bucket. Metric aggregation is used to make statistics on data that meets certain conditions or within buckets, such as minimum, maximum, sum, average, and so on.

The present invention provides a data statistical method by combining good expandability of a label tree and strong distributed data statistical analysis capability provided by a distributed database, and fig. 1 is a flow chart of the data statistical method provided by the embodiment of the present invention, and the method includes:

step 101, converting data records in a service data set into label records based on a preset label tree;

in an embodiment of the present invention, a front-end device in charge of data acquisition in a statistical system may obtain a service data set from each service system, a field in the service data set is divided into two parts, one part is a field capable of performing label conversion and is collectively referred to as a convertible field, and the other part is a field not requiring label conversion and is collectively referred to as a non-label field. For example, the order generation location field is a convertible field, the expression form in the service data set may be a form of "beijing" or "sienna", and the front-end device may convert the convertible field in the service data set into a label field based on a predefined label tree when extracting data, for example, convert "beijing" into a corresponding label "T02001" in the label tree, convert "sienna" into "T02002001", and the like. The data records in the service data set are subjected to tag conversion to obtain tag records, wherein a part of the tag records are converted from convertible tags, a part of the tag records are non-tag fields in the service data set, and the non-tag fields are fields for expressing specific numerical values, such as time, quantity, price and the like.

For example, the field name of a field for reflecting an order generation position in an original business database may be "L%, a label corresponding to the field of the order generation position in the label tree may be defined as" T02 ", a label value corresponding to each province or direct administration city is defined, for example," beijing city "corresponds to" T02001 ", and" beijing city hai-lake area "corresponds to" T02001001 ", which is similar to administrative division encoding.

The method for converting the data records in the service data set into the label records comprises the following steps: fields in the data record that are convertible into tag fields are each converted to a corresponding tag in the tag tree based on the tag tree. In some cases, the data set extracted from the business system may have many fields, but the statistical analysis may not require such many fields, so that it is necessary to instruct the statistical system to acquire which fields in the business data set by converting the configuration file, the acquired fields corresponding to the label nodes in the label tree, and if a new field needs to be added later, only the configuration file needs to be modified, and the program code does not need to be modified.

102, carrying out Key Value pair Key-Value preliminary aggregation on the label records by adopting a distributed parallel computing framework, and then loading the label records into a database of a distributed data statistical analysis engine;

in an embodiment of the present invention, a method for performing Key Value Key-Value preliminary aggregation on tag records includes: merging the label fields in the label records to generate a Key by using the Map stage of a Map-Reduce component of Hadoop, and aggregating the label records with the same Key value through the Reduce stage of the Map-Reduce component;

in the embodiment of the invention, before loading the label records after Key-Value preliminary aggregation into the database of the distributed data statistical analysis engine, the generated Key word Key needs to be subjected to reverse decomposition to obtain the label field of the label records, and then the label fields and the Value values are loaded into the database of the distributed data statistical analysis engine together.

In an embodiment of the invention, the distributed parallel computing framework is a Map-Reduce or Spark component, and the distributed data statistical analysis engine is an elastic search.

And 103, generating a retrieval condition based on the label of the label tree, and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.

In an embodiment of the present invention, the statistical method further includes a statistical expansion step, where the expansion includes expansion of a tag tree and expansion of a statistical condition:

expanding the label tree and updating a configuration file for converting the data records of the service data set into label records, wherein the expansion comprises expanding any layer of label nodes below the root node; similarly, the extension is also suitable for deleting and modifying the label node;

The data statistical method provided by the invention has good expandability, because the tag tree is organized in a tree structure, the configuration file is formed in a mode of expanding a Markup language (eXtensible Markup L anguage XM L) file and the like, if the field participating in the statistics needs to be expanded, the statistical client equipment loads the tag tree in the memory, and a user can select statistical conditions through the interface, generate the statistical conditions based on the tags of the expanded tag tree and call a statistical analysis interface of a distributed data statistical analysis engine for statistics to obtain statistical results.

In an embodiment of the present invention, the tag tree supports multi-valued attributes, and the distributed data statistical analysis engine supports storage and statistics of multi-valued tag fields. For the label field with the multi-Value attribute, after Key-Value pair Key-Value aggregation is carried out on the label record, before the label record is loaded to a database of a distributed data statistical analysis engine, the label field with the multi-Value attribute is combined, so that the label record containing the multi-Value label field is generated, and then the label record is loaded to the database of the distributed data statistical analysis engine.

Taking an elastic search as an example, the ES database supports storage of multi-value columns, the multi-value columns can be directly stored as an array, each value in the multi-value columns can be counted respectively during statistics, which is different from a relational database, and definition and expansion of a tag field having multi-value attributes in a tag tree can be well supported by combining this characteristic of the ES.

The implementation process of the data statistics method provided by the present invention is described in detail below with reference to specific service examples, and assuming that the embodiment needs to perform order quantity statistics based on the order generation location, the specific process is as follows:

tag fields that need to be counted are first tabularized, such as T01 for member properties, T02 for order generation location, T03 for commodity category, etc. the tag tree can be configured using XM L file, an example of a tag tree for a commodity order is as follows:

RootDimension represents the root node of the label tree, SubDimension represents the child nodes in the label tree, and the node at the lowest layer is called a leaf node. Code represents a tag value of a defined tag, name represents a name corresponding to the tag, name represents an english name of a tag field, and flag represents a multi-value attribute of the tag at the layer, wherein the flag attribute is exemplified as follows:

0: a single value tag, indicating that only one of the sub-tags appears in the tag field.

1: and a multi-value label, which indicates that one or more sub-labels can appear on the label field.

2: a multi-valued tag indicating that each sub-tag in the tag field will appear.

According to the above label tree example, T01, T02, etc. are first-level labels, T011, T010, etc. represent second-level labels, and so on, third-level, fourth-level, etc., and the hierarchical structure is easy to expand or prune at a later stage. Therefore, the label tree has great flexibility, and can support any expansion in the later period and any modification of the multi-value attribute of the label field by combining the query statistical process, so that the complexity of the expansion in the later period is reduced, and the expansibility of the service is further enhanced. The design of the multi-level tags can realize the function of hierarchical statistics, and if the query condition is the second-level tags, the statistical data are distributed on the third-level tags.

In the service embodiment, Hadoop is adopted as a basic framework of a distributed system, and ElasticSearch (ES for short) is used as a distributed data statistical analysis engine. The ES-Hadoop connector connects the storage and deep processing capacity of Hadoop mass data with the ES real-time searching and analyzing capacity. In the embodiment, data is stored in a Hadoop Distributed File System (HDFS), and the data is loaded into a database of an ES cluster by using the parallel processing advantage of a Map-Reduce component.

After the label tree is configured, each front-end data sampling device converts the acquired real data of the service system according to the defined label tree, converts the data records in the service data set into label records, and generates a label record set file, wherein the label record set file can be stored in a file server in a text file manner.

The data conversion program needs to extract the required service data field from the service data set according to the conversion configuration file and convert the service data field into the tag field, and the configuration content of the conversion configuration file is exemplified as follows:

dimension.column＝MEMBER:1,LOCATION:2,TYPE:3…

the configuration file is used for setting the corresponding relationship between the label field and the position of the service data field in the service data set, for example, L OCATION label corresponds to the field in the first column in the service data set.

Firstly, Map-Reduce components are used for obtaining corresponding label values according to the positions of label fields in a Map stage, the obtained label values of the label fields (such as MEMBER, L OCATION, TYPE and the like) of each label record are spliced into a character string in sequence to serve as keys, Key fields and Value fields form Key-vlan records, the records with the same Key values are merged together in a shuffle stage, the Value values with the same Key are summed in a Reduce stage, and finally, the keys are split into the label values of each label field according to the original splicing process before the label records subjected to aggregation processing are loaded into an ES database, and the results are loaded into an ES cluster through ES-doop.

The storage structure in the ES database after data warehousing is shown in table 1 below, for example:

DATE

MEMBER

LOCATION

TYPE

TERMINAL

PERSONNEL

INTERVAL

COUPON

others

COUNTER

20190520

T01101

T02001001

T03001001

T04001001,T04002001

T0501001

T0602

T0702

…

10

20190521

T01103

T02001003

T02001001

T04001006,T04002001

T0501002

T0604

T0703

…

20

20190522

T01101

T02001001

T03001001

T04001001,T04002001

T0501001

T0601

T0702

…

30

20190522

T010

T02002001001

T03001001

T04001001,T04002003

T0501002

T0602

T0702

…

20

20190522

T01103

T02001003

T03002001

T04001002,T04002001

T0501001

T0604

T0703

…

60

…

TABLE 1

The ES database supports the storage and statistics of multi-value tag fields, for example, a plurality of tag values such as "T04001001, T04002001" and the like can be stored in the TERMINA L tag field instead of an integral character string, and when inquiring or counting, the ES can treat each tag value in the multi-value tag field as a single value, and the characteristic is combined with the multi-value tag attribute flag in the tag tree, so that flexible extension can be realized, and the characteristic is also a characteristic which is not possessed by the existing relational database.

Based on the embodiment of the invention, the following statistical capacity can be realized according to the requirements of actual service scenes:

firstly, the distribution situation of the tag values of the data in each tag field can be counted according to the query condition. The query condition can support fuzzy, any condition and, any condition or and other queries.

For example, when the issued query condition is L OCATION-T02001 AND DATE-20190520, it represents that the query time is 20190520, AND the distribution of data on each label field of order information issued in Beijing is counted, because each label field in the label tree defines a multi-level label, a statistical rule can be defined that, for the label field appearing in the statistical condition, if the label value is not a leaf label, the distribution of the next-level label in the label record is counted, if the label value is a leaf label, the distribution of the label value in the label record on the level where the label is located is counted, AND for the dimension not appearing in the statistical condition, the distribution of the label value on the second-level label only needs to be counted.

In the whole statistical process, the tag tree is loaded in the memory, then the tag values needing to participate in the statistics are obtained according to the tag tree and are spliced into statistical conditions, and then the statistical analysis interface of the ES is called to perform analysis statistics.

Second, independent statistical analysis of single or multiple tag fields is performed based on the tag tree.

The data statistical method provided by the invention provides great convenience for later expansion, if the tag field needs to be added and the multi-value attribute of the tag field needs to be changed, only the tag tree needs to be modified, the warehousing code and the query code do not need to be changed, the ES support columns are dynamically increased, each column can store a plurality of values, and each value can be separately counted during the statistical process.

Fig. 3 is a schematic structural diagram of a data statistics apparatus according to an embodiment of the present invention, where the data statistics apparatus 300 includes:

a converting module 301, configured to convert data records in a service data set into tag records based on a preset tag tree, where the tag tree is a hierarchical tree structure, takes a tag as a node, and is used to establish a mapping relationship between a field value in the service data set and the tag, and to express a hierarchical relationship and a field attribute of the field value in the service data set.

And the warehousing module 302 is configured to load the label records into a database of the distributed data statistical analysis engine after Key-Value pair Key-Value preliminary aggregation is performed on the label records by using a distributed parallel computing framework.

And the statistical module 303 is configured to generate a retrieval condition based on the label of the label tree, and call a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.

The warehousing module 302 may merge the tag fields in the tag records to generate a Key using the Map phase of the Map-Reduce component of Hadoop, and then aggregate the tag records with the same Key value through the Reduce phase of the Map-Reduce component.

The tag tree can support multi-value attributes, and the database of the distributed data statistical analysis engine supports storage and statistics of multi-value tag fields. For the tag field with the multi-Value attribute, the warehousing module 302 combines the tag records containing the tag field with the multi-Value attribute after Key-Value pair Key-Value preliminary aggregation is performed on the tag records and before the tag records are loaded into the database of the distributed data statistical analysis engine, so as to generate the tag records containing the multi-Value tag field.

Further, the statistical apparatus 300 further includes:

the statistical module 303 generates statistical conditions based on the expanded tag tree, and invokes a statistical analysis interface of the distributed data statistical analysis engine to obtain statistical results.

The distributed parallel computing framework can be a Map-Reduce component or a Spark component, the distributed data statistical analysis engine is an elastic search, and a database of the distributed data statistical analysis engine is an HDFS.

The warehousing module 302 uses an ES-Hadoop connector to load the label records after Key-Value pair Key-Value preliminary aggregation into an ElasticSearch database.

The above description is only an example of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A method of data statistics, the method comprising:

2. The method according to claim 1, wherein the method for preliminarily aggregating the Key-Value pair Key-Value of the tag record is as follows:

3. The method of claim 1,

the label tree supports multi-value attributes, and a database of the distributed data statistical analysis engine supports storage and statistics of multi-value label fields; for the label field with the multi-Value attribute, after Key-Value pair Key-Value preliminary aggregation is carried out on the label records, before the label records are loaded into a database of a distributed data statistical analysis engine, the label records containing the label field with the multi-Value attribute are combined, and therefore the label records containing the multi-Value label field are generated.

4. The method of claim 1, wherein the statistical method further comprises a statistical expansion step of:

5. The method of claim 1,

the distributed parallel computing framework is a Map-Reduce component or a Spark component, and the distributed data statistical analysis engine is an elastic search;

and loading the label record subjected to Key Value pair Key-Value preliminary polymerization into an ElasticSearch database through an ES-Hadoop connector.

6. A data statistics apparatus, characterized in that the apparatus comprises:

7. The apparatus of claim 6,

and the warehousing module merges the label fields in the label records by using the Map stage of the Map-Reduce component of Hadoop to generate a Key word Key, and then aggregates the label records with the same Key value through the Reduce stage of the Map-Reduce component.

8. The apparatus of claim 6,

the label tree supports multi-value attributes, and a database of the distributed data statistical analysis engine supports storage and statistics of multi-value label fields;

for the label field with the multi-Value attribute, after Key-Value pair primary aggregation is carried out on the label records by the warehousing module, the label records containing the label field with the multi-Value attribute are combined before being loaded into a database of the distributed data statistical analysis engine, and therefore the label records containing the multi-Value label field are generated.

9. The apparatus of claim 6, wherein the statistical means further comprises:

10. The apparatus of claim 6,

and the warehousing module loads the label records subjected to Key Value pair Key-Value preliminary aggregation into an ElasticSearch database by using an ES-Hadoop connector.