CN111506621A - Data statistical method and device - Google Patents

Data statistical method and device Download PDF

Info

Publication number
CN111506621A
CN111506621A CN202010246298.XA CN202010246298A CN111506621A CN 111506621 A CN111506621 A CN 111506621A CN 202010246298 A CN202010246298 A CN 202010246298A CN 111506621 A CN111506621 A CN 111506621A
Authority
CN
China
Prior art keywords
label
value
records
statistical analysis
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010246298.XA
Other languages
Chinese (zh)
Other versions
CN111506621B (en
Inventor
杨恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co Ltd filed Critical New H3C Big Data Technologies Co Ltd
Priority to CN202010246298.XA priority Critical patent/CN111506621B/en
Publication of CN111506621A publication Critical patent/CN111506621A/en
Application granted granted Critical
Publication of CN111506621B publication Critical patent/CN111506621B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data statistical method and a device, which convert fields capable of performing label conversion in a service data set extracted from each service system into label fields based on a pre-configured extensible label tree, and perform preliminary aggregation on the extracted label data set by using the parallel processing capacity of a distributed computing framework. And finally, the large data analysis efficiency is improved, and the large data analysis cost is reduced.

Description

Data statistical method and device
Technical Field
The invention relates to the technical field of big data, in particular to a data statistical method and device.
Background
For example, for a power and business platform, if order data needs to be analyzed, the general method is to extract the order data distributed in databases in various regions into a large-scale relational database, and then perform statistical analysis by using the statistical function of the SQ L statement of the relational database, and this way is along with the rapid expansion of data volume, and the response speed is greatly reduced, and in addition, this way needs to modify the table structure of the relational database and modify the SQ L statement for a scenario where a new field or a field value is switched between a single value and multiple values, and thus the expansibility is poor.
Disclosure of Invention
The invention provides a data statistical method and a data statistical device, which are used for solving the technical problems of low speed and poor expansibility when mass data are counted based on a relational database.
Based on the embodiment of the invention, the data statistical method provided by the invention comprises the following steps:
converting data records in a service data set into label records based on a preset label tree, wherein the label tree is in a hierarchical tree structure, takes labels as nodes and is used for establishing a mapping relation between a field value in the service data set and the labels and expressing the hierarchical relation and the field attributes of the field value in the service data set;
carrying out Key Value pair Key-Value preliminary aggregation on the label records by adopting a distributed parallel computing framework, and then loading the label records into a database of a distributed data statistical analysis engine;
and generating a retrieval condition based on the label of the label tree, and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.
Further, the method for primarily aggregating the Key Value pair Key-Value of the tag record comprises:
and merging the label fields in the label records to generate a Key by using the Map stage of the Map-Reduce component of the Hadoop, and aggregating the label records with the same Key value through the Reduce stage of the Map-Reduce component.
Further, the label tree supports multi-valued attributes, and a database of the distributed data statistical analysis engine supports storage and statistics of multi-valued label fields; for the label field with the multi-Value attribute, after Key-Value pair Key-Value preliminary aggregation is carried out on the label records, before the label records are loaded into a database of a distributed data statistical analysis engine, the label records containing the label field with the multi-Value attribute are combined, and therefore the label records containing the multi-Value label field are generated.
Further, the statistical method further comprises a statistical expansion step of:
expanding the label tree and updating a configuration file for converting the data records of the service data set into label records, wherein the expansion comprises expanding any layer of label nodes below the root node;
and generating a statistical condition based on the expanded label tree, and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.
Further, the distributed parallel computing framework is a Map-Reduce component or a Spark component, and the distributed data statistical analysis engine is an elastic search; and loading the label record subjected to Key Value pair Key-Value preliminary polymerization into an ElasticSearch database through an ES-Hadoop connector.
Based on the embodiment of the invention, the invention also provides a data statistical device, which comprises:
the system comprises a conversion module, a storage module and a processing module, wherein the conversion module is used for converting data records in a service data set into label records based on a preset label tree, the label tree is in a hierarchical tree structure, takes labels as nodes and is used for establishing a mapping relation between a field value in the service data set and the labels and expressing the hierarchical relation and the field attributes of the field value in the service data set;
the warehousing module is used for loading the label records into a database of the distributed data statistical analysis engine after Key Value pair Key-Value preliminary aggregation is carried out on the label records by adopting a distributed parallel computing framework;
and the statistical module is used for generating a retrieval condition based on the label of the label tree and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.
Further, the warehousing module merges the label fields in the label records to generate a Key by using the Map stage of the Map-Reduce component of the Hadoop, and then aggregates the label records with the same Key value through the Reduce stage of the Map-Reduce component.
Further, the label tree supports multi-valued attributes, and a database of the distributed data statistical analysis engine supports storage and statistics of multi-valued label fields; for the label field with the multi-Value attribute, after Key-Value pair primary aggregation is carried out on the label records by the warehousing module, the label records containing the label field with the multi-Value attribute are combined before being loaded into a database of the distributed data statistical analysis engine, and therefore the label records containing the multi-Value label field are generated.
Further, the statistical device further comprises:
the configuration updating module is used for expanding the label tree and updating a configuration file for converting the data records of the service data set into the label records, wherein the expansion comprises expanding any layer of label nodes below the root node;
and the statistical module generates statistical conditions based on the expanded label tree and calls a statistical analysis interface of the distributed data statistical analysis engine to obtain statistical results.
Further, the distributed parallel computing framework is a Map-Reduce component or a Spark component, and the distributed data statistical analysis engine is an elastic search; and the warehousing module loads the label records subjected to Key Value pair Key-Value preliminary aggregation into an ElasticSearch database by using an ES-Hadoop connector.
The invention provides a data statistical method and a frame with strong expansibility and rapidness aiming at the statistical analysis requirement of big data, the method converts fields which can be subjected to label conversion in the service data set extracted from each service system into label fields based on a pre-configured extensible label tree, namely, the conversion from the service data set to the tag data set is realized, and then the extracted tag data sets are subjected to preliminary aggregation by utilizing the parallel processing capability of a distributed computing framework such as Map-Reduce, the preliminarily aggregated tag data sets are then loaded into a database of a distributed data analysis engine, for example, in an elastic search database, the fast and flexible statistical analysis of mass data is realized by utilizing the strong searching and aggregating capability of a distributed data statistical analysis engine, so that the analysis efficiency of big data is improved, and the analysis cost of the big data is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments of the present invention or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings may be obtained according to the drawings of the embodiments of the present invention.
FIG. 1 is a flow chart of a data statistics method in one embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a process of loading the converted tag records into a database of a distributed statistical analysis engine after performing preliminary aggregation on the tag records by Map-Reduce in an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data statistics apparatus according to an embodiment of the present invention.
Detailed Description
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the invention. As used in the examples and claims of the present invention, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used to describe various information in embodiments of the present invention, the information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present invention. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".
ES provides the concept of Index, user query is completed on Index, Shard is the data fragment of ES, an Index may be composed of several data fragments to reach the distributed expandable capability, multiple Shard storing Index are distributed on multiple nodes, each Node runs ES instance, and a Node is a single Server.
The ES may provide an aggregation application program interface (aggregation API), through which statistical analysis capabilities may be injected into the ES, making the user more forgiving in the face of large data extraction statistics. ES provides mainly two categories of aggregation, measuring Metric aggregation and bucket Bucketing aggregation. Bucketing aggregation is used to place data that satisfies a particular rule into a bucket, each bucket being associated with a key, equivalent to grouping data, each group being referred to as a bucket. Metric aggregation is used to make statistics on data that meets certain conditions or within buckets, such as minimum, maximum, sum, average, and so on.
The present invention provides a data statistical method by combining good expandability of a label tree and strong distributed data statistical analysis capability provided by a distributed database, and fig. 1 is a flow chart of the data statistical method provided by the embodiment of the present invention, and the method includes:
step 101, converting data records in a service data set into label records based on a preset label tree;
in an embodiment of the present invention, a front-end device in charge of data acquisition in a statistical system may obtain a service data set from each service system, a field in the service data set is divided into two parts, one part is a field capable of performing label conversion and is collectively referred to as a convertible field, and the other part is a field not requiring label conversion and is collectively referred to as a non-label field. For example, the order generation location field is a convertible field, the expression form in the service data set may be a form of "beijing" or "sienna", and the front-end device may convert the convertible field in the service data set into a label field based on a predefined label tree when extracting data, for example, convert "beijing" into a corresponding label "T02001" in the label tree, convert "sienna" into "T02002001", and the like. The data records in the service data set are subjected to tag conversion to obtain tag records, wherein a part of the tag records are converted from convertible tags, a part of the tag records are non-tag fields in the service data set, and the non-tag fields are fields for expressing specific numerical values, such as time, quantity, price and the like.
For example, the field name of a field for reflecting an order generation position in an original business database may be "L%, a label corresponding to the field of the order generation position in the label tree may be defined as" T02 ", a label value corresponding to each province or direct administration city is defined, for example," beijing city "corresponds to" T02001 ", and" beijing city hai-lake area "corresponds to" T02001001 ", which is similar to administrative division encoding.
The method for converting the data records in the service data set into the label records comprises the following steps: fields in the data record that are convertible into tag fields are each converted to a corresponding tag in the tag tree based on the tag tree. In some cases, the data set extracted from the business system may have many fields, but the statistical analysis may not require such many fields, so that it is necessary to instruct the statistical system to acquire which fields in the business data set by converting the configuration file, the acquired fields corresponding to the label nodes in the label tree, and if a new field needs to be added later, only the configuration file needs to be modified, and the program code does not need to be modified.
102, carrying out Key Value pair Key-Value preliminary aggregation on the label records by adopting a distributed parallel computing framework, and then loading the label records into a database of a distributed data statistical analysis engine;
in an embodiment of the present invention, a method for performing Key Value Key-Value preliminary aggregation on tag records includes: merging the label fields in the label records to generate a Key by using the Map stage of a Map-Reduce component of Hadoop, and aggregating the label records with the same Key value through the Reduce stage of the Map-Reduce component;
in the embodiment of the invention, before loading the label records after Key-Value preliminary aggregation into the database of the distributed data statistical analysis engine, the generated Key word Key needs to be subjected to reverse decomposition to obtain the label field of the label records, and then the label fields and the Value values are loaded into the database of the distributed data statistical analysis engine together.
In an embodiment of the invention, the distributed parallel computing framework is a Map-Reduce or Spark component, and the distributed data statistical analysis engine is an elastic search.
And 103, generating a retrieval condition based on the label of the label tree, and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.
In an embodiment of the present invention, the statistical method further includes a statistical expansion step, where the expansion includes expansion of a tag tree and expansion of a statistical condition:
expanding the label tree and updating a configuration file for converting the data records of the service data set into label records, wherein the expansion comprises expanding any layer of label nodes below the root node; similarly, the extension is also suitable for deleting and modifying the label node;
and generating a statistical condition based on the expanded label tree, and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.
The data statistical method provided by the invention has good expandability, because the tag tree is organized in a tree structure, the configuration file is formed in a mode of expanding a Markup language (eXtensible Markup L anguage XM L) file and the like, if the field participating in the statistics needs to be expanded, the statistical client equipment loads the tag tree in the memory, and a user can select statistical conditions through the interface, generate the statistical conditions based on the tags of the expanded tag tree and call a statistical analysis interface of a distributed data statistical analysis engine for statistics to obtain statistical results.
In an embodiment of the present invention, the tag tree supports multi-valued attributes, and the distributed data statistical analysis engine supports storage and statistics of multi-valued tag fields. For the label field with the multi-Value attribute, after Key-Value pair Key-Value aggregation is carried out on the label record, before the label record is loaded to a database of a distributed data statistical analysis engine, the label field with the multi-Value attribute is combined, so that the label record containing the multi-Value label field is generated, and then the label record is loaded to the database of the distributed data statistical analysis engine.
Taking an elastic search as an example, the ES database supports storage of multi-value columns, the multi-value columns can be directly stored as an array, each value in the multi-value columns can be counted respectively during statistics, which is different from a relational database, and definition and expansion of a tag field having multi-value attributes in a tag tree can be well supported by combining this characteristic of the ES.
The implementation process of the data statistics method provided by the present invention is described in detail below with reference to specific service examples, and assuming that the embodiment needs to perform order quantity statistics based on the order generation location, the specific process is as follows:
tag fields that need to be counted are first tabularized, such as T01 for member properties, T02 for order generation location, T03 for commodity category, etc. the tag tree can be configured using XM L file, an example of a tag tree for a commodity order is as follows:
Figure BDA0002434067680000071
Figure BDA0002434067680000081
RootDimension represents the root node of the label tree, SubDimension represents the child nodes in the label tree, and the node at the lowest layer is called a leaf node. Code represents a tag value of a defined tag, name represents a name corresponding to the tag, name represents an english name of a tag field, and flag represents a multi-value attribute of the tag at the layer, wherein the flag attribute is exemplified as follows:
0: a single value tag, indicating that only one of the sub-tags appears in the tag field.
1: and a multi-value label, which indicates that one or more sub-labels can appear on the label field.
2: a multi-valued tag indicating that each sub-tag in the tag field will appear.
According to the above label tree example, T01, T02, etc. are first-level labels, T011, T010, etc. represent second-level labels, and so on, third-level, fourth-level, etc., and the hierarchical structure is easy to expand or prune at a later stage. Therefore, the label tree has great flexibility, and can support any expansion in the later period and any modification of the multi-value attribute of the label field by combining the query statistical process, so that the complexity of the expansion in the later period is reduced, and the expansibility of the service is further enhanced. The design of the multi-level tags can realize the function of hierarchical statistics, and if the query condition is the second-level tags, the statistical data are distributed on the third-level tags.
In the service embodiment, Hadoop is adopted as a basic framework of a distributed system, and ElasticSearch (ES for short) is used as a distributed data statistical analysis engine. The ES-Hadoop connector connects the storage and deep processing capacity of Hadoop mass data with the ES real-time searching and analyzing capacity. In the embodiment, data is stored in a Hadoop Distributed File System (HDFS), and the data is loaded into a database of an ES cluster by using the parallel processing advantage of a Map-Reduce component.
After the label tree is configured, each front-end data sampling device converts the acquired real data of the service system according to the defined label tree, converts the data records in the service data set into label records, and generates a label record set file, wherein the label record set file can be stored in a file server in a text file manner.
The data conversion program needs to extract the required service data field from the service data set according to the conversion configuration file and convert the service data field into the tag field, and the configuration content of the conversion configuration file is exemplified as follows:
dimension.column=MEMBER:1,LOCATION:2,TYPE:3…
the configuration file is used for setting the corresponding relationship between the label field and the position of the service data field in the service data set, for example, L OCATION label corresponds to the field in the first column in the service data set.
Firstly, Map-Reduce components are used for obtaining corresponding label values according to the positions of label fields in a Map stage, the obtained label values of the label fields (such as MEMBER, L OCATION, TYPE and the like) of each label record are spliced into a character string in sequence to serve as keys, Key fields and Value fields form Key-vlan records, the records with the same Key values are merged together in a shuffle stage, the Value values with the same Key are summed in a Reduce stage, and finally, the keys are split into the label values of each label field according to the original splicing process before the label records subjected to aggregation processing are loaded into an ES database, and the results are loaded into an ES cluster through ES-doop.
The storage structure in the ES database after data warehousing is shown in table 1 below, for example:
DATE MEMBER LOCATION TYPE TERMINAL PERSONNEL INTERVAL COUPON others COUNTER
20190520 T01101 T02001001 T03001001 T04001001,T04002001 T0501001 T0602 T0702 10
20190521 T01103 T02001003 T02001001 T04001006,T04002001 T0501002 T0604 T0703 20
20190522 T01101 T02001001 T03001001 T04001001,T04002001 T0501001 T0601 T0702 30
20190522 T010 T02002001001 T03001001 T04001001,T04002003 T0501002 T0602 T0702 20
20190522 T01103 T02001003 T03002001 T04001002,T04002001 T0501001 T0604 T0703 60
TABLE 1
The ES database supports the storage and statistics of multi-value tag fields, for example, a plurality of tag values such as "T04001001, T04002001" and the like can be stored in the TERMINA L tag field instead of an integral character string, and when inquiring or counting, the ES can treat each tag value in the multi-value tag field as a single value, and the characteristic is combined with the multi-value tag attribute flag in the tag tree, so that flexible extension can be realized, and the characteristic is also a characteristic which is not possessed by the existing relational database.
Based on the embodiment of the invention, the following statistical capacity can be realized according to the requirements of actual service scenes:
firstly, the distribution situation of the tag values of the data in each tag field can be counted according to the query condition. The query condition can support fuzzy, any condition and, any condition or and other queries.
For example, when the issued query condition is L OCATION-T02001 AND DATE-20190520, it represents that the query time is 20190520, AND the distribution of data on each label field of order information issued in Beijing is counted, because each label field in the label tree defines a multi-level label, a statistical rule can be defined that, for the label field appearing in the statistical condition, if the label value is not a leaf label, the distribution of the next-level label in the label record is counted, if the label value is a leaf label, the distribution of the label value in the label record on the level where the label is located is counted, AND for the dimension not appearing in the statistical condition, the distribution of the label value on the second-level label only needs to be counted.
In the whole statistical process, the tag tree is loaded in the memory, then the tag values needing to participate in the statistics are obtained according to the tag tree and are spliced into statistical conditions, and then the statistical analysis interface of the ES is called to perform analysis statistics.
Second, independent statistical analysis of single or multiple tag fields is performed based on the tag tree.
The data statistical method provided by the invention provides great convenience for later expansion, if the tag field needs to be added and the multi-value attribute of the tag field needs to be changed, only the tag tree needs to be modified, the warehousing code and the query code do not need to be changed, the ES support columns are dynamically increased, each column can store a plurality of values, and each value can be separately counted during the statistical process.
Fig. 3 is a schematic structural diagram of a data statistics apparatus according to an embodiment of the present invention, where the data statistics apparatus 300 includes:
a converting module 301, configured to convert data records in a service data set into tag records based on a preset tag tree, where the tag tree is a hierarchical tree structure, takes a tag as a node, and is used to establish a mapping relationship between a field value in the service data set and the tag, and to express a hierarchical relationship and a field attribute of the field value in the service data set.
And the warehousing module 302 is configured to load the label records into a database of the distributed data statistical analysis engine after Key-Value pair Key-Value preliminary aggregation is performed on the label records by using a distributed parallel computing framework.
And the statistical module 303 is configured to generate a retrieval condition based on the label of the label tree, and call a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.
The warehousing module 302 may merge the tag fields in the tag records to generate a Key using the Map phase of the Map-Reduce component of Hadoop, and then aggregate the tag records with the same Key value through the Reduce phase of the Map-Reduce component.
The tag tree can support multi-value attributes, and the database of the distributed data statistical analysis engine supports storage and statistics of multi-value tag fields. For the tag field with the multi-Value attribute, the warehousing module 302 combines the tag records containing the tag field with the multi-Value attribute after Key-Value pair Key-Value preliminary aggregation is performed on the tag records and before the tag records are loaded into the database of the distributed data statistical analysis engine, so as to generate the tag records containing the multi-Value tag field.
Further, the statistical apparatus 300 further includes:
the configuration updating module is used for expanding the label tree and updating a configuration file for converting the data records of the service data set into the label records, wherein the expansion comprises expanding any layer of label nodes below the root node;
the statistical module 303 generates statistical conditions based on the expanded tag tree, and invokes a statistical analysis interface of the distributed data statistical analysis engine to obtain statistical results.
The distributed parallel computing framework can be a Map-Reduce component or a Spark component, the distributed data statistical analysis engine is an elastic search, and a database of the distributed data statistical analysis engine is an HDFS.
The warehousing module 302 uses an ES-Hadoop connector to load the label records after Key-Value pair Key-Value preliminary aggregation into an ElasticSearch database.
The above description is only an example of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (10)

1. A method of data statistics, the method comprising:
converting data records in a service data set into label records based on a preset label tree, wherein the label tree is in a hierarchical tree structure, takes labels as nodes and is used for establishing a mapping relation between a field value in the service data set and the labels and expressing the hierarchical relation and the field attributes of the field value in the service data set;
carrying out Key Value pair Key-Value preliminary aggregation on the label records by adopting a distributed parallel computing framework, and then loading the label records into a database of a distributed data statistical analysis engine;
and generating a retrieval condition based on the label of the label tree, and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.
2. The method according to claim 1, wherein the method for preliminarily aggregating the Key-Value pair Key-Value of the tag record is as follows:
and merging the label fields in the label records to generate a Key by using the Map stage of the Map-Reduce component of the Hadoop, and aggregating the label records with the same Key value through the Reduce stage of the Map-Reduce component.
3. The method of claim 1,
the label tree supports multi-value attributes, and a database of the distributed data statistical analysis engine supports storage and statistics of multi-value label fields; for the label field with the multi-Value attribute, after Key-Value pair Key-Value preliminary aggregation is carried out on the label records, before the label records are loaded into a database of a distributed data statistical analysis engine, the label records containing the label field with the multi-Value attribute are combined, and therefore the label records containing the multi-Value label field are generated.
4. The method of claim 1, wherein the statistical method further comprises a statistical expansion step of:
expanding the label tree and updating a configuration file for converting the data records of the service data set into label records, wherein the expansion comprises expanding any layer of label nodes below the root node;
and generating a statistical condition based on the expanded label tree, and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.
5. The method of claim 1,
the distributed parallel computing framework is a Map-Reduce component or a Spark component, and the distributed data statistical analysis engine is an elastic search;
and loading the label record subjected to Key Value pair Key-Value preliminary polymerization into an ElasticSearch database through an ES-Hadoop connector.
6. A data statistics apparatus, characterized in that the apparatus comprises:
the system comprises a conversion module, a storage module and a processing module, wherein the conversion module is used for converting data records in a service data set into label records based on a preset label tree, the label tree is in a hierarchical tree structure, takes labels as nodes and is used for establishing a mapping relation between a field value in the service data set and the labels and expressing the hierarchical relation and the field attributes of the field value in the service data set;
the warehousing module is used for loading the label records into a database of the distributed data statistical analysis engine after Key Value pair Key-Value preliminary aggregation is carried out on the label records by adopting a distributed parallel computing framework;
and the statistical module is used for generating a retrieval condition based on the label of the label tree and calling a statistical analysis interface of the distributed data statistical analysis engine to obtain a statistical result.
7. The apparatus of claim 6,
and the warehousing module merges the label fields in the label records by using the Map stage of the Map-Reduce component of Hadoop to generate a Key word Key, and then aggregates the label records with the same Key value through the Reduce stage of the Map-Reduce component.
8. The apparatus of claim 6,
the label tree supports multi-value attributes, and a database of the distributed data statistical analysis engine supports storage and statistics of multi-value label fields;
for the label field with the multi-Value attribute, after Key-Value pair primary aggregation is carried out on the label records by the warehousing module, the label records containing the label field with the multi-Value attribute are combined before being loaded into a database of the distributed data statistical analysis engine, and therefore the label records containing the multi-Value label field are generated.
9. The apparatus of claim 6, wherein the statistical means further comprises:
the configuration updating module is used for expanding the label tree and updating a configuration file for converting the data records of the service data set into the label records, wherein the expansion comprises expanding any layer of label nodes below the root node;
and the statistical module generates statistical conditions based on the expanded label tree and calls a statistical analysis interface of the distributed data statistical analysis engine to obtain statistical results.
10. The apparatus of claim 6,
the distributed parallel computing framework is a Map-Reduce component or a Spark component, and the distributed data statistical analysis engine is an elastic search;
and the warehousing module loads the label records subjected to Key Value pair Key-Value preliminary aggregation into an ElasticSearch database by using an ES-Hadoop connector.
CN202010246298.XA 2020-03-31 2020-03-31 Data statistical method and device Active CN111506621B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010246298.XA CN111506621B (en) 2020-03-31 2020-03-31 Data statistical method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010246298.XA CN111506621B (en) 2020-03-31 2020-03-31 Data statistical method and device

Publications (2)

Publication Number Publication Date
CN111506621A true CN111506621A (en) 2020-08-07
CN111506621B CN111506621B (en) 2023-03-31

Family

ID=71869072

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010246298.XA Active CN111506621B (en) 2020-03-31 2020-03-31 Data statistical method and device

Country Status (1)

Country Link
CN (1) CN111506621B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100211A (en) * 2020-09-27 2020-12-18 北京有竹居网络技术有限公司 Data storage method and device, electronic equipment and computer readable medium
CN112100159A (en) * 2020-09-27 2020-12-18 北京有竹居网络技术有限公司 Data processing method and device, electronic equipment and computer readable medium
CN112527881A (en) * 2020-12-16 2021-03-19 国家电网有限公司客户服务中心 Hive-based data aggregation method
CN112818048A (en) * 2021-01-28 2021-05-18 北京软通智慧城市科技有限公司 Hierarchical construction method and device of data warehouse, electronic equipment and storage medium
CN113569200A (en) * 2021-08-03 2021-10-29 北京金山云网络技术有限公司 Data statistics method and device and server
CN114186137A (en) * 2021-12-14 2022-03-15 聚好看科技股份有限公司 Server and media asset mixing recommendation method
CN114201545A (en) * 2022-02-16 2022-03-18 希维科技(广州)有限公司 Data processing method and device, terminal equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160127569A1 (en) * 2014-11-01 2016-05-05 Somos, Inc. Real time, machine-based routing table creation and enhancement for toll-free telecommunications
CN105930446A (en) * 2016-04-20 2016-09-07 重庆重邮汇测通信技术有限公司 Telecommunication customer tag generation method based on Hadoop distributed technology
CN107066328A (en) * 2017-05-19 2017-08-18 成都四象联创科技有限公司 The construction method of large-scale data processing platform
CN107391752A (en) * 2017-08-16 2017-11-24 四川长虹电器股份有限公司 A kind of method based on hadoop platform construction user tag information
CN108183931A (en) * 2017-12-04 2018-06-19 中国电子科技集团公司第三十研究所 A kind of distribution subscription matching process based on demand management tree shape model
CN108416620A (en) * 2018-02-08 2018-08-17 杭州浮云网络科技有限公司 A kind of intelligent social advertisement launching platform of the representation data based on big data
CN109726209A (en) * 2018-09-07 2019-05-07 网联清算有限公司 Log aggregation method and device
CN110019078A (en) * 2019-02-25 2019-07-16 贵州格物数据有限公司 A kind of DNS log analysis aid decision-making system and method based on big data

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160127569A1 (en) * 2014-11-01 2016-05-05 Somos, Inc. Real time, machine-based routing table creation and enhancement for toll-free telecommunications
CN105930446A (en) * 2016-04-20 2016-09-07 重庆重邮汇测通信技术有限公司 Telecommunication customer tag generation method based on Hadoop distributed technology
CN107066328A (en) * 2017-05-19 2017-08-18 成都四象联创科技有限公司 The construction method of large-scale data processing platform
CN107391752A (en) * 2017-08-16 2017-11-24 四川长虹电器股份有限公司 A kind of method based on hadoop platform construction user tag information
CN108183931A (en) * 2017-12-04 2018-06-19 中国电子科技集团公司第三十研究所 A kind of distribution subscription matching process based on demand management tree shape model
CN108416620A (en) * 2018-02-08 2018-08-17 杭州浮云网络科技有限公司 A kind of intelligent social advertisement launching platform of the representation data based on big data
CN109726209A (en) * 2018-09-07 2019-05-07 网联清算有限公司 Log aggregation method and device
CN110019078A (en) * 2019-02-25 2019-07-16 贵州格物数据有限公司 A kind of DNS log analysis aid decision-making system and method based on big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHANGLI ZHANG; JINJIN ZHANG; MAODE YAN: ""Cluster tree based hybrid semantic similarity measure for social tagging systems"", 《2010 IEEE INTERNATIONAL CONFERENCE ON PROGRESS IN INFORMATICS AND COMPUTING》 *
孙涛: ""面向半结构化数据的数据模型和数据挖掘方法研究"", 《中国优秀博士论文全文数据库》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100211A (en) * 2020-09-27 2020-12-18 北京有竹居网络技术有限公司 Data storage method and device, electronic equipment and computer readable medium
CN112100159A (en) * 2020-09-27 2020-12-18 北京有竹居网络技术有限公司 Data processing method and device, electronic equipment and computer readable medium
CN112100211B (en) * 2020-09-27 2023-06-27 北京有竹居网络技术有限公司 Data storage method, apparatus, electronic device, and computer readable medium
CN112527881A (en) * 2020-12-16 2021-03-19 国家电网有限公司客户服务中心 Hive-based data aggregation method
CN112818048A (en) * 2021-01-28 2021-05-18 北京软通智慧城市科技有限公司 Hierarchical construction method and device of data warehouse, electronic equipment and storage medium
CN113569200A (en) * 2021-08-03 2021-10-29 北京金山云网络技术有限公司 Data statistics method and device and server
CN114186137A (en) * 2021-12-14 2022-03-15 聚好看科技股份有限公司 Server and media asset mixing recommendation method
CN114201545A (en) * 2022-02-16 2022-03-18 希维科技(广州)有限公司 Data processing method and device, terminal equipment and storage medium
CN114201545B (en) * 2022-02-16 2022-04-22 希维科技(广州)有限公司 Data processing method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN111506621B (en) 2023-03-31

Similar Documents

Publication Publication Date Title
CN111506621B (en) Data statistical method and device
US10585913B2 (en) Apparatus and method for distributed query processing utilizing dynamically generated in-memory term maps
US10210236B2 (en) Storing and retrieving data of a data cube
US8171029B2 (en) Automatic generation of ontologies using word affinities
CN106294695A (en) A kind of implementation method towards the biggest data search engine
CN104239377A (en) Platform-crossing data retrieval method and device
CN106095951B (en) Data space multi-dimensional indexing method based on load balancing and inquiry log
CN111708805A (en) Data query method and device, electronic equipment and storage medium
CN106777343A (en) increment distributed index system and method
CN104462161A (en) Structural data query method based on distributed database
CN114564482A (en) Multi-entity-oriented label system and processing method
CN113779349A (en) Data retrieval system, apparatus, electronic device, and readable storage medium
Álvarez-García et al. Compact and efficient representation of general graph databases
KR20180077830A (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method
CN112214494B (en) Retrieval method and device
CN116049193A (en) Data storage method and device
CN116257636A (en) Unified management method and device for enumerated data dictionary, electronic equipment and storage medium
CN111984647B (en) Intelligent merging, displaying and storing method for table element
CN113868138A (en) Method, system, equipment and storage medium for acquiring test data
CN110609926A (en) Data tag storage management method and device
CN114443728B (en) Detection report searching method and device based on Elasticissearch
CN116644084B (en) Method, apparatus, device and storage medium for processing three-dimensional model member data
Chen et al. Vertical-Intersection-Based Top-Down Algorithm for Frequent Itemset Mining on MapReduce
CN115827700A (en) Common report extraction method and device
CN117149818A (en) Data searching method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant