CN111367956A - Data statistical method and device - Google Patents
Data statistical method and device Download PDFInfo
- Publication number
- CN111367956A CN111367956A CN201811589609.1A CN201811589609A CN111367956A CN 111367956 A CN111367956 A CN 111367956A CN 201811589609 A CN201811589609 A CN 201811589609A CN 111367956 A CN111367956 A CN 111367956A
- Authority
- CN
- China
- Prior art keywords
- data
- field
- index
- statistical
- field value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a data statistical method and a device, wherein the method comprises the following steps: constructing a reverse index and a forward index according to the source data; when receiving the data statistical conditions, acquiring query conditions from the data statistical conditions; inquiring field values meeting the inquiry conditions from the inverted index, and determining a data ID set according to the inquired data IDs to which the field values belong; acquiring a polymerization condition from the data statistical condition; and inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and performing aggregation statistics to obtain a statistical result. The method comprises the steps of constructing a reverse index and a forward index for source data stored in a big data platform, inputting data statistical conditions to the big data platform when statistical analysis is needed, obtaining statistical results by inquiring the reverse index and the forward index, solving the problem of application requirements of large data volume and real-time statistics, simultaneously, writing the statistical results into a cache, and avoiding the problems of large system load and loss of the statistical results.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data statistics method and apparatus.
Background
With the rapid development of cloud computing and artificial intelligence, massive data, namely big data, can be generated in various fields, and the value of the big data is deeply mined and utilized in various industries. At present, a timing statistical mode is often adopted for the statistical requirements with large data volume or complex statistical process and time consumption; for simple and short-time statistical demands, a real-time statistical mode based on flow calculation is often adopted.
However, timing statistics and real-time statistics cannot meet the application requirements of large data volume and real-time statistics.
Disclosure of Invention
In view of this, the present application provides a data statistics method and apparatus to solve the problem that the related art cannot meet the application requirements of large data volume and real-time statistics.
According to a first aspect of embodiments of the present application, there is provided a data statistics method, the method including:
constructing a reverse index and a forward index according to stored source data, wherein the reverse index records a data ID to which each field value belongs, and the forward index records field values of each field contained in each data ID;
when receiving a data statistical condition, acquiring a query condition from the data statistical condition, wherein the query condition comprises at least one field value condition;
inquiring field values meeting the inquiry condition from the inverted index, and determining a data ID set according to the inquired data ID to which each field value belongs;
acquiring an aggregation condition from the data statistical condition, wherein the aggregation condition at least comprises a field to be aggregated;
and inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and performing aggregation statistics to obtain a statistical result.
According to a second aspect of embodiments of the present application, there is provided a data statistics apparatus, the apparatus including:
the device comprises a construction unit and a processing unit, wherein the construction unit is used for constructing a reverse index and a forward index according to stored source data, the reverse index records a data ID to which each field value belongs, and the forward index records a field value of each field contained in each data ID;
the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a query condition from data statistical conditions when the data statistical conditions are received, and the query condition comprises at least one field value condition;
the query unit is used for querying the field values meeting the query conditions from the inverted index and determining a data ID set according to the queried data IDs to which the field values belong;
a second obtaining unit, configured to obtain an aggregation condition from the data statistics condition, where the aggregation condition at least includes a field to be aggregated;
and the counting unit is used for inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index and performing aggregation counting to obtain a counting result.
According to a third aspect of embodiments herein, there is provided an electronic device, the device comprising a readable storage medium and a processor;
wherein the readable storage medium is configured to store machine executable instructions;
the processor is configured to read the machine executable instructions on the readable storage medium and execute the instructions to implement the steps of the method according to the first aspect.
By applying the embodiment of the application, the reverse index and the forward index can be constructed according to the stored source data, and when the data statistical conditions are received, the query conditions (the query conditions comprise at least one field value condition) are obtained from the data statistical conditions; if the data ID is acquired, inquiring a data ID corresponding to the field value meeting the inquiry condition from the reverse index, determining a data ID set according to the inquired data ID, then acquiring an aggregation condition from the data statistical condition (the aggregation condition at least comprises a field to be aggregated), and if the data ID is acquired, inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and performing aggregation statistics to obtain a statistical result.
Based on the above description, it can be known that by constructing the reverse index and the forward index for the source data stored in the big data platform, when statistical analysis is needed, data statistical conditions are input to the big data platform, and statistical results are obtained by querying the reverse index and the forward index.
Drawings
FIG. 1 is a flow chart illustrating an embodiment of a data statistics method according to an exemplary embodiment of the present application;
FIG. 2 is a diagram illustrating a hardware configuration of a server according to an exemplary embodiment of the present application;
fig. 3 is a block diagram of an embodiment of a data statistics apparatus according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
Because the existing timing statistical method and real-time statistical mode belong to customized demand statistics and directly count original data and write statistical results into a cache for retrieval, the method cannot meet the application requirements of large data volume and real-time statistics, and also can cause the system load problem and the problem of statistical result loss when writing statistical results into the cache for retrieval.
In order to solve the above problems, the present application provides a data statistics method, which may first construct a reverse index and a forward index according to stored source data, and when receiving a data statistics condition, obtain a query condition from the data statistics condition (the query condition includes at least one field value condition); if the data ID is acquired, inquiring a data ID corresponding to the field value meeting the inquiry condition from the reverse index, determining a data ID set according to the inquired data ID, then acquiring an aggregation condition from the data statistical condition (the aggregation condition at least comprises a field to be aggregated), and if the data ID is acquired, inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and performing aggregation statistics to obtain a statistical result.
Based on the above description, it can be known that when statistical analysis is needed, a reverse index and a forward index are constructed for source data stored in a big data platform, and a data statistical condition is input to the big data platform, so as to obtain a statistical result by respectively querying the reverse index and the forward index, thereby solving the problem of application requirements that the data volume is large and real-time statistics is needed, and the statistical result does not need to be written into a cache, can be returned to a user terminal in real time, and can also avoid the problem of excessive system load and the problem of loss of the statistical result.
In the environment of the internet of things, an RFID (Radio Frequency Identification) chip is arranged on an electric vehicle, a bicycle, a tractor, a person or other carriers, acquisition equipment for receiving data of the RFID chip is arranged in different places, and the data of the RFID chip around the acquisition equipment can be collected through the acquisition equipment. Assuming that an RFID chip is mounted on a carrier and collected N times a day, N pieces of RFID chip data are collected each day. If an RFID chip is mounted on each carrier, a plurality of pieces of RFID chip data are collected every day; if the data of the RFID chips can be deeply mined, the convenience and the intelligence of life of people can be greatly improved, for example, the lost relatives can be tracked, the lost electric vehicle can be recovered, and the like, so that the method is very important for accurately and quickly counting the mass data.
The technical scheme of the application is elaborated in detail by taking RFID chip data collected in the environment of the Internet of things as an example.
Fig. 1 is a flowchart illustrating an embodiment of a data statistics method according to an exemplary embodiment of the present application, where the data statistics method may be applied to a server based on a big data platform. As shown in fig. 1, the data statistics method includes the following steps:
step 101: and constructing a reverse index and a forward index according to the stored source data.
Before step 101 is executed, the source data needs to be collected, and the collection process may be: the method comprises the steps of receiving an acquisition equipment ID, a chip ID, acquisition time and chip data reported by acquisition equipment, then obtaining a time category label corresponding to the chip ID from stored record registration information, storing the acquisition equipment ID, the chip ID, the acquisition time, the time category label and the chip data as a piece of source data, and setting a data ID for uniquely identifying the piece of source data for the piece of source data.
Taking the RFID chip as an example, the acquisition device may acquire and report data of the RFID chip at a certain frequency to the server (including chip ID, acquisition time, chip data, and the like), where the chip ID refers to a unique ID randomly set for the chip by a chip manufacturer according to a relevant protocol during chip manufacturing, and the chip data may be data of a current geographical location of the chip, and the like. To enable complex and uncertain statistical analysis, RFID chips may be tagged with categories, which may be made of numbers or characters or a combination of numbers and characters. For example, 51 represents an electric vehicle, 52 represents a bicycle, 53 represents a pet, 54 represents a person, and 55 represents another object, so that when registration is performed after an RFID chip is mounted on a certain carrier, it is possible to combine the registration time and the category as a time category tag, and to record the time category tag and the chip ID of the RFID as one piece of registration information. For example, a registration time category label 18080351, which represents an electric vehicle registered on day 8/3 in 2018, has a chip ID of 12345, which represents a chip ID mounted on the electric vehicle. Based on this, after receiving the collecting device ID, the chip ID, the collecting time and the chip data reported by the collecting device, the corresponding time category tag can be obtained in the record registration information through the chip ID, and the corresponding time category tag and the reported data are combined into a complete source data. If the chip ID does not exist in the docket registration database, the time category label is set to 00000055. Based on the characteristic of large data volume, source data can be stored in a distributed storage system, such as HBASE, wherein each source data is composed of field values of different fields, and each source data corresponds to a data ID (primary key) for uniquely identifying the source data.
It should be noted that, in order to increase the query speed, the docketing registration information may also be stored in the memory of each acquisition device to increase the query speed, thereby implementing the near-real-time storage of data.
In an embodiment, in a process of constructing an inverted index according to stored source data, a field value included in a designated field may be acquired from the stored source data, for each acquired field value, whether the field value exists is searched from an existing inverted index, and if not, a data ID corresponding to the field value and the field value is stored as an inverted index; if the data ID exists, the data ID corresponding to the field value is added to the inverted index where the field value is located.
In order to reduce the occupation of the index data on the storage space and improve the query performance, a part of fields can be designated according to the application requirements of the user to construct the inverted index. Because the inverted index records are the corresponding relation between each field value and the data ID, complete source data can be inquired from the storage system through the data ID, and if the field values are multiple words, word segmentation can be carried out through the word segmentation device. During statistical analysis, it is usually not necessary to obtain complete source data, for example, a statistical total, and only the total of data IDs corresponding to field values needs to be calculated, so that query performance can be improved by querying the inverted index. In addition, the inverted index may also record the frequency TF of occurrence and the position POS of occurrence of each field value in the belonging data ID.
In an embodiment, for the process of constructing the forward index according to the stored source data, for each data ID included in the stored source data, a field value of a specified field included in the data ID may be obtained, and the data ID and the obtained field value may be stored as a forward index.
For the aggregation statistical condition of the non-total statistical analysis, statistics can be better realized only by combining the reverse index and the forward index, and therefore the forward index needs to be established for the specified field to improve the aggregation statistical performance through the forward index. While the forward index is being built, an offset of a field value of a specified field in the forward index may also be stored.
It should be noted that the time for constructing the reverse index and the forward index may be determined by combining the size of each piece of source data, the total amount of source data collected each day, and the size of the storage medium, for example, a new reverse index and a new forward index may be constructed at regular intervals, or a reverse index and a forward index may be newly constructed when index data in the current reverse index or forward index reaches a certain threshold. Each constructed reverse index and forward index can be stored in an SSD (solid state Disk) Disk to improve statistical performance.
In an exemplary scenario, it is assumed that fields included in the collected RFID chip data include a collection device ID, a chip ID, collection time, a time category tag, and chip data, and it is further assumed that designated fields for constructing the reverse index and the forward index are a collection device ID (dev _ no), a chip ID (RFID _ ID), collection time (collection _ time), and a time category tag (time _ type). As shown in table 1, an exemplary table of inverted index structures is shown, where the term ID is 1, and the corresponding field value is 18080351; the data IDs containing the field value are 1, 2, and 5, respectively; the frequency of occurrence in each data ID is 1; the position of occurrence is the first position. As shown in table 2, an exemplary table of forward index structures is shown, where the data ID is 1, four term values are included, 18080351, 12345, 1533268800000, and devNo01, respectively, and each term value occurs 1 time. As shown in table 3, an offset of a field value of a field in the forward index, where the offset of the field value of the field time _ type is 0, indicates that the first value in the forward index is a field value of time _ type; the offset of the field value of field rfid _ id is 1, indicating that the second value in the forward index is the field value of rfid _ id, and so on.
Term ID | Field value | The index data (DocID, TF,<POS>) |
1 | 18080351 | (1,1,<1>),(5,1,<1>) |
2 | 18090152 | (3,1,<1>),(4,1,<1>),(6,1,<2>) |
3 | 12345 | (1,1,<2>),(2,1,<2>),(5,1,<2>) |
4 | 54321 | (3,1,<2>),(4,1,<2>),(6,1,<2>) |
5 | 1533268800000 | (1,1,<3>),(2,1,<3>) |
6 | 1535774400000 | (3,1,<3>),(4,1,<3>),(6,1,<3>) |
7 | devNo01 | (1,1,<4>),(2,1,<4>),(5,1,<4>) |
8 | devNo02 | (3,1,<4>),(4,1,<4>),(6,1,<4>) |
TABLE 1
TABLE 2
Field(s) | Offset amount |
time_type | 0 |
rfid_id | 1 |
collect_time | 2 |
dev_no | 3 |
TABLE 3
Step 102: when the data statistical condition is received, acquiring a query condition from the data statistical condition, wherein the query condition comprises at least one field value condition.
In one embodiment, a matching rule of the fuzzy query may be formulated in advance, and when the data statistical condition is received, the query condition is obtained from the data statistical condition based on the matching rule.
The formulated matching rule may be: a plurality of placeholders are indicated by "+", namely, the character is matched in the query condition; with "? "represents a placeholder, i.e. in the query condition"? "indicates that a character is matched. "# and"? "may be combined arbitrarily, for example: the received data statistics are: counting the total number of all electric vehicles registered in the 8 th month in 2018, collected in the time from 1 st day in the 9 th month to 10 th day in the 9 th month in the 2018 year, and obtaining the query conditions as follows: collection _ time: [1535731200000,1536595199000] and time _ type:1808 × 51, contains two field value conditions, one is the field value condition 1808 × 51 of the time category tag field, and one is the field value condition [1535731200000,1536595199000] of the collection time field. The query conditions obtained from the data statistics are satisfied by the query correlation protocol of the open source component ELASTICSEARCH.
In an exemplary scenario, based on the collected source data, assuming that the data statistics condition is to count all electric vehicles registered in 2018, the total amount of electric vehicle data collected by each collection device in all collection devices ranges from 8/1/2018 to 8/31/2018. Through analysis, the query conditions of all electric vehicles registered in 2018 are as follows: time _ type:18 x 51; the query conditions in the one-month range from 8/1/2018 to 8/31/2018 are: the collection _ time is [1533052800000,1535731199000], wherein 1533052800000 is a timestamp corresponding to 1/8/2018, 1535731199000 is a timestamp corresponding to 31/8/2018, and the intersection of the two query conditions is: collect _ time: [1533052800000,1535731199000] and time _ type:18 × 51.
Step 103: and inquiring the field value which is in accordance with the inquiry condition from the inverted index, and determining a data ID set according to the inquired data ID to which each field value belongs.
In an embodiment, in the process of determining the data ID set according to the data ID to which each queried field value belongs, if the query condition is one, the data ID to which each field value belongs may be determined as one data ID set, and if the query condition is multiple, predicate operation needs to be performed on the data ID to which each field value inquired by each query condition belongs to obtain the data ID set.
Based on the scenario shown in step 102, there are 2 query conditions, and in combination with the inverted index shown in table 1, the query condition collection _ time [1533052800000,1535731199000], the data IDs to which the field values meeting the query conditions belong are 1, 2, 3, 4, and 6. From table 1, it can be found that the data ID to which the field value 18080351 meeting the query condition belongs has 1, 5 by query condition time _ type:18 × 51; the data IDs to which the field value 18090152 belongs are 3, 4, and 6. Since the statistical conditions are that all electric vehicles registered in 2018 are counted, the electric vehicles collected by the collecting device in a month range from 8/1/2018 to 8/31/2018 are in an and relationship between the two query conditions, and therefore the intersection of the data IDs of the two query conditions is 1, 3, 4 and 6.
As will be understood by those skilled in the art, if the statistical condition becomes to count all electric vehicles registered in 2018 or electric vehicles collected by the collection device in a month range from 8/1/2018 to 8/31/2018, a yes or relationship is established between the two inquiry conditions, and thus the union of the data IDs of the two inquiry conditions is 1, 2, 3, 4, 5, 6. If the statistical condition becomes to count all electric vehicles registered in 2018, but does not include electric vehicles collected by the collection device in a one-month range from 8/1/2018 to 8/31/2018, the two inquiry conditions are irrelevant, and thus the difference between the data IDs of the two inquiry conditions is 2.
Step 104: and acquiring an aggregation condition from the data statistical condition, wherein the aggregation condition at least comprises a field to be aggregated.
Based on the scenario shown in step 102, the data statistics condition is to count all electric vehicles registered in 2018, and the total amount of electric vehicle data collected by each collection device in all collection devices is within a month range from 8/1/2018 to 8/31/2018. Through analysis, in all the acquisition devices, the field to be aggregated of the total amount of the electric vehicle data acquired by each acquisition device is as follows: field: "dev _ no".
Step 105: and inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and performing aggregation statistics to obtain a statistical result.
In an embodiment, in order to query a field value of a field to be aggregated included in each data ID in a data ID set from a forward index and perform an aggregation statistics process, a forward index item of each data ID in the data ID set may be queried from the forward index, the forward index item of each data ID is determined as a subset, then one forward index item is selected from the subset, the field value of the field to be aggregated is found from the selected forward index item, a statistical value is set for the field value, an initial value of the statistical value is a preset value, the selected forward index item is deleted from the subset, the field value of the field to be aggregated is found from the forward index item for each remaining forward index item in the subset, and the found field value is compared with the field value provided with the statistical value; if the statistical value is consistent with the preset statistical value, adding 1 to the statistical value, and deleting the forward index entry from the subset; and judging whether the subset is empty, if not, continuing to execute the step of selecting a positive index item from the subset.
The field to be aggregated refers to a field requiring statistics in the data statistics condition. The field value of the field to be aggregated can be queried by the offset of the field to be aggregated in the forward index. By determining a subset from the forward index and combining the subset and storing the subset in the memory for aggregation operation, the number of the subsets to be queried is small, and the query speed in the memory is higher than that in the hard disk, so that the aggregation efficiency can be improved.
Based on the scenarios shown in step 103 and step 104, the data ID sets are 1, 3, 4, and 6. First, a subset of the positive index entries for each data ID in the set of data IDs is filtered out of Table 2. And if the field to be aggregated is dev _ no, randomly selecting a forward index item from the subset, assuming that the selected forward index item with the data ID of 1, knowing that the offset of the field to be aggregated dev _ no in the forward index is 3 according to the table 3, thereby finding out that the actual value of dev _ no which can be found in the forward index item is devNo01, setting a statistical value for devNo01, setting the statistical value to be 1, and deleting the forward index item with the data ID of 1 in the subset. And aiming at each remaining positive-row index entry in the subset, because the field values of dev _ no in the remaining positive-row index entries are devNo02 and are inconsistent with devNo01, the finally obtained statistical value of devNo01 is 1. Then, randomly selecting a forward index item from the subset, assuming that the selected forward index item with the data ID of 3, and in the same way, the field value of the dev _ no field is devNo02, and setting a statistical value for devNo02, where the statistical value is 1, deleting the forward index item with the data ID of 3 in the subset, traversing each remaining forward index item in the subset to know that the field values of the forward index item with the data ID of 4 and the forward index item with the data ID of 6 are both devNo02, obtaining the statistical value of devNo02 of 3, deleting the forward index items with the data IDs of 4 and 6 from the subset, where the subset is empty, and ending the statistics. The statistical results obtained were: the devNo01 acquisition device acquires 1 piece of data, and the devNo02 acquisition device acquires 3 pieces of data.
Those skilled in the art can understand that after the statistical results are obtained, the statistical results can be further sorted, and a preset number of statistical results sorted in the front can be fed back to the user for viewing. For example, if the data statistics condition requires that top10 of the total amount of the electric vehicle data collected by each collection device is returned, after the total amount of the electric vehicle data collected by each collection device is obtained, sorting the electric vehicle data from large to small, and selecting the total amount of the electric vehicle data collected by the first 10 collection devices for feedback.
It should be noted that, each time a data statistical condition is received, the data statistical condition may be analyzed to generate a statistical analysis model conforming to a predetermined protocol; then, executing the statistical analysis model based on the constructed reverse index and the forward index; finally, the statistical analysis model outputs the statistical result. Wherein, the statistical analysis model can be described in JSON format.
It should be further explained that, if the query condition is obtained but the aggregation condition is not obtained, after the data ID set is obtained, the number of the data IDs in the set is counted; and if the query condition is not obtained but the aggregation condition is obtained, directly carrying out aggregation statistics on field values contained in each data ID of the top-level index records by using the aggregation condition.
In the embodiment of the application, a reverse index and a forward index may be constructed according to stored source data, and when a data statistical condition is received, a query condition is obtained from the data statistical condition (the query condition includes at least one field value condition); if the data ID is acquired, inquiring a data ID corresponding to the field value according to the inquiry condition from the reverse index, determining a data ID set according to the inquired data ID, then acquiring an aggregation condition from the data statistical condition (the aggregation condition at least comprises a field to be aggregated), and if the data ID is acquired, inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and performing aggregation statistics to obtain a statistical result.
Based on the above description, it can be known that when statistical analysis is needed, a reverse index and a forward index are constructed for source data stored in a big data platform, and a data statistical condition is input to the big data platform, so as to obtain a statistical result by respectively querying the reverse index and the forward index, thereby solving the problem of application requirements that the data volume is large and real-time statistics is needed, and the statistical result does not need to be written into a cache, can be returned to a user terminal in real time, and can also avoid the problem of excessive system load and the problem of loss of the statistical result.
Fig. 2 is a hardware block diagram of a server according to an exemplary embodiment of the present application, where the server includes: a communication interface 201, a processor 202, a machine-readable storage medium 203, and a bus 204; wherein the communication interface 201, the processor 202 and the machine-readable storage medium 203 communicate with each other via a bus 204. The processor 202 may execute the above-described data statistics method by reading and executing machine executable instructions corresponding to the control logic of the data statistics method in the machine readable storage medium 203, and the details of the method are described in the above embodiments, which will not be described herein again.
The machine-readable storage medium 203 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: volatile memory, non-volatile memory, or similar storage media. In particular, the machine-readable storage medium 203 may be a RAM (random Access Memory), a flash Memory, a storage drive (e.g., a hard drive), any type of storage disk (e.g., an optical disk, a DVD, etc.), or similar storage medium, or a combination thereof.
Fig. 3 is a block diagram of an embodiment of a data statistics apparatus according to an exemplary embodiment of the present application, and as shown in fig. 3, the data statistics apparatus includes:
a constructing unit 310, configured to construct a reverse index and a forward index according to stored source data, where the reverse index records a data ID to which each field value belongs, and the forward index records a field value of each field included in each data ID;
a first obtaining unit 320, configured to, when a data statistic condition is received, obtain a query condition from the data statistic condition, where the query condition includes at least one field value condition;
the query unit 330 is configured to query field values meeting the query condition from the inverted index, and determine a data ID set according to the queried data ID to which each field value belongs;
a second obtaining unit 340, configured to obtain an aggregation condition from the data statistics condition, where the aggregation condition at least includes a field to be aggregated;
the counting unit 350 is configured to query, from the forward index, field values of the fields to be aggregated included in each data ID in the data ID set, and perform aggregation statistics to obtain a statistical result.
In an optional implementation manner, the constructing unit 310 is specifically configured to, in a process of constructing an inverted index according to stored source data, obtain a field value included in a specified field from the stored source data; aiming at each obtained field value, searching whether the field value exists in the existing inverted index; if the data ID does not exist, the field value and the data ID corresponding to the field value are stored as an inverted index; if the data ID exists, the data ID corresponding to the field value is added to the inverted index where the field value is located.
In an optional implementation manner, the constructing unit 310 is specifically configured to, in a process of constructing the forward index according to stored source data, obtain, for each data ID included in the stored source data, a field value of a specified field included in the data ID; the data ID and the obtained field value are stored as a forward index.
In an alternative implementation, the apparatus further comprises (not shown in fig. 3):
the data collection unit is used for receiving the collection equipment ID, the chip ID, the collection time and the chip data reported by the collection equipment; the chip data refers to data sent by a chip corresponding to the chip ID received by the acquisition equipment; acquiring a time category label corresponding to the chip ID from the stored record registration information, wherein the time category label is used for indicating record registration time of a carrier where the chip is located and the category of the carrier; and storing the collection equipment ID, the chip ID, the collection time, the time class label and the chip data as a piece of source data, and setting a data ID for uniquely identifying the piece of source data for the piece of source data.
In an optional implementation manner, the statistics unit 350 is specifically configured to query the forward index entry of each data ID in the data ID set from the forward index, and determine the forward index entry of each data ID as a subset; selecting a positive index item from the subset, finding the field value of the field to be aggregated from the selected positive index item, setting a statistical value for the field value, setting the initial value of the statistical value as a preset value, and deleting the selected positive index item from the subset; for each remaining positive index item in the subset, finding a field value of the field to be aggregated from the positive index item, and comparing the found field value with a field value provided with a statistical value; if the statistical value is consistent with the preset statistical value, adding 1 to the statistical value, and deleting the forward index entry from the subset; judging whether the subset is empty; if not, continuing to execute the step of selecting a positive index item from the subset.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.
Claims (11)
1. A method of data statistics, the method comprising:
constructing a reverse index and a forward index according to stored source data, wherein the reverse index records a data ID to which each field value belongs, and the forward index records field values of each field contained in each data ID;
when receiving a data statistical condition, acquiring a query condition from the data statistical condition, wherein the query condition comprises at least one field value condition;
inquiring field values meeting the inquiry condition from the inverted index, and determining a data ID set according to the inquired data ID to which each field value belongs;
acquiring an aggregation condition from the data statistical condition, wherein the aggregation condition at least comprises a field to be aggregated;
and inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and performing aggregation statistics to obtain a statistical result.
2. The method of claim 1, wherein constructing the inverted index from the stored source data comprises:
acquiring field values contained in the designated fields from the stored source data;
aiming at each obtained field value, searching whether the field value exists in the existing inverted index;
if the data ID does not exist, the field value and the data ID corresponding to the field value are stored as an inverted index;
if the data ID exists, the data ID corresponding to the field value is added to the inverted index where the field value is located.
3. The method of claim 1, wherein constructing the forward index from the stored source data comprises:
aiming at each data ID contained in the stored source data, acquiring a field value of a specified field contained in the data ID;
the data ID and the obtained field value are stored as a forward index.
4. A method according to claim 2 or 3, wherein the source data is collected by:
receiving an acquisition equipment ID, a chip ID, acquisition time and chip data reported by acquisition equipment; the chip data refers to data sent by a chip corresponding to the chip ID received by the acquisition equipment;
acquiring a time category label corresponding to the chip ID from the stored record registration information, wherein the time category label is used for indicating record registration time of a carrier where the chip is located and the category of the carrier;
and storing the collection equipment ID, the chip ID, the collection time, the time class label and the chip data as a piece of source data, and setting a data ID for uniquely identifying the piece of source data for the piece of source data.
5. The method of claim 1, wherein querying field values of the fields to be aggregated included in each data ID in the data ID set from the forward index and performing aggregation statistics comprises:
inquiring a forward index item of each data ID in the data ID set from the forward index, and determining the forward index item of each data ID as a subset;
selecting a positive index item from the subset, finding the field value of the field to be aggregated from the selected positive index item, setting a statistical value for the field value, setting the initial value of the statistical value as a preset value, and deleting the selected positive index item from the subset;
for each remaining positive index item in the subset, finding a field value of the field to be aggregated from the positive index item, and comparing the found field value with a field value provided with a statistical value; if the statistical value is consistent with the preset statistical value, adding 1 to the statistical value, and deleting the forward index entry from the subset;
judging whether the subset is empty;
if not, continuing to execute the step of selecting a positive index item from the subset.
6. A data statistics apparatus, characterized in that the apparatus comprises:
the device comprises a construction unit and a processing unit, wherein the construction unit is used for constructing a reverse index and a forward index according to stored source data, the reverse index records a data ID to which each field value belongs, and the forward index records a field value of each field contained in each data ID;
the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a query condition from data statistical conditions when the data statistical conditions are received, and the query condition comprises at least one field value condition;
the query unit is used for querying the field values meeting the query conditions from the inverted index and determining a data ID set according to the queried data IDs to which the field values belong;
a second obtaining unit, configured to obtain an aggregation condition from the data statistics condition, where the aggregation condition at least includes a field to be aggregated;
and the counting unit is used for inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index and performing aggregation counting to obtain a counting result.
7. The apparatus according to claim 6, wherein the constructing unit is specifically configured to, in constructing the inverted index according to the stored source data, obtain a field value included in the specified field from the stored source data; aiming at each obtained field value, searching whether the field value exists in the existing inverted index; if the data ID does not exist, the field value and the data ID corresponding to the field value are stored as an inverted index; if the data ID exists, the data ID corresponding to the field value is added to the inverted index where the field value is located.
8. The apparatus according to claim 6, wherein the constructing unit is specifically configured to, in the process of constructing the forward index according to the stored source data, obtain, for each data ID included in the stored source data, a field value of a specified field included in the data ID; the data ID and the obtained field value are stored as a forward index.
9. The apparatus of claim 7 or 8, further comprising:
the data collection unit is used for receiving the collection equipment ID, the chip ID, the collection time and the chip data reported by the collection equipment; the chip data refers to data sent by a chip corresponding to the chip ID received by the acquisition equipment; acquiring a time category label corresponding to the chip ID from the stored record registration information, wherein the time category label is used for indicating record registration time of a carrier where the chip is located and the category of the carrier; and storing the collection equipment ID, the chip ID, the collection time, the time class label and the chip data as a piece of source data, and setting a data ID for uniquely identifying the piece of source data for the piece of source data.
10. The apparatus according to claim 6, wherein the statistical unit is specifically configured to query the forward index entry of each data ID in the set of data IDs from the forward index, and determine the forward index entry of each data ID as a subset; selecting a positive index item from the subset, finding the field value of the field to be aggregated from the selected positive index item, setting a statistical value for the field value, setting the initial value of the statistical value as a preset value, and deleting the selected positive index item from the subset; for each remaining positive index item in the subset, finding a field value of the field to be aggregated from the positive index item, and comparing the found field value with a field value provided with a statistical value; if the statistical value is consistent with the preset statistical value, adding 1 to the statistical value, and deleting the forward index entry from the subset; judging whether the subset is empty; if not, continuing to execute the step of selecting a positive index item from the subset.
11. An electronic device, characterized in that the device comprises a readable storage medium and a processor;
wherein the readable storage medium is configured to store machine executable instructions;
the processor configured to read the machine executable instructions on the readable storage medium and execute the instructions to implement the steps of the method of any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811589609.1A CN111367956B (en) | 2018-12-25 | 2018-12-25 | Data statistics method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811589609.1A CN111367956B (en) | 2018-12-25 | 2018-12-25 | Data statistics method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111367956A true CN111367956A (en) | 2020-07-03 |
CN111367956B CN111367956B (en) | 2023-09-26 |
Family
ID=71207858
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811589609.1A Active CN111367956B (en) | 2018-12-25 | 2018-12-25 | Data statistics method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111367956B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112199463A (en) * | 2020-10-21 | 2021-01-08 | 新华三信息安全技术有限公司 | Data query method, device and equipment |
CN112818013A (en) * | 2021-01-27 | 2021-05-18 | 北京百度网讯科技有限公司 | Time sequence database query optimization method, device, equipment and storage medium |
CN114265849A (en) * | 2022-02-28 | 2022-04-01 | 杭州广立微电子股份有限公司 | Data aggregation method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100205172A1 (en) * | 2009-02-09 | 2010-08-12 | Robert Wing Pong Luk | Method for using dual indices to support query expansion, relevance/non-relevance models, blind/relevance feedback and an intelligent search interface |
CN103914543A (en) * | 2014-04-03 | 2014-07-09 | 北京百度网讯科技有限公司 | Search result displaying method and device |
EP2833278A1 (en) * | 2013-07-31 | 2015-02-04 | Linkedin Corporation | Method and apparatus for real-time indexing of data for analytics |
US20160378769A1 (en) * | 2015-06-23 | 2016-12-29 | Microsoft Technology Licensing, Llc | Preliminary ranker for scoring matching documents |
CN108595489A (en) * | 2018-03-15 | 2018-09-28 | 北京雷石天地电子技术有限公司 | A kind of data retrieval method and device |
-
2018
- 2018-12-25 CN CN201811589609.1A patent/CN111367956B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100205172A1 (en) * | 2009-02-09 | 2010-08-12 | Robert Wing Pong Luk | Method for using dual indices to support query expansion, relevance/non-relevance models, blind/relevance feedback and an intelligent search interface |
EP2833278A1 (en) * | 2013-07-31 | 2015-02-04 | Linkedin Corporation | Method and apparatus for real-time indexing of data for analytics |
CN103914543A (en) * | 2014-04-03 | 2014-07-09 | 北京百度网讯科技有限公司 | Search result displaying method and device |
US20160378769A1 (en) * | 2015-06-23 | 2016-12-29 | Microsoft Technology Licensing, Llc | Preliminary ranker for scoring matching documents |
CN108595489A (en) * | 2018-03-15 | 2018-09-28 | 北京雷石天地电子技术有限公司 | A kind of data retrieval method and device |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112199463A (en) * | 2020-10-21 | 2021-01-08 | 新华三信息安全技术有限公司 | Data query method, device and equipment |
CN112818013A (en) * | 2021-01-27 | 2021-05-18 | 北京百度网讯科技有限公司 | Time sequence database query optimization method, device, equipment and storage medium |
CN112818013B (en) * | 2021-01-27 | 2023-07-21 | 北京百度网讯科技有限公司 | Time sequence database query optimization method, device, equipment and storage medium |
CN114265849A (en) * | 2022-02-28 | 2022-04-01 | 杭州广立微电子股份有限公司 | Data aggregation method and system |
CN114265849B (en) * | 2022-02-28 | 2022-06-10 | 杭州广立微电子股份有限公司 | Data aggregation method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111367956B (en) | 2023-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110659282B (en) | Data route construction method, device, computer equipment and storage medium | |
CN111367956B (en) | Data statistics method and device | |
CN110737821B (en) | Similar event query method, device, storage medium and terminal equipment | |
CN104573130A (en) | Entity resolution method based on group calculation and entity resolution device based on group calculation | |
CN112632405A (en) | Recommendation method, device, equipment and storage medium | |
CN107180064A (en) | A kind of logistics information processing method, device and logistic information systems | |
Christen et al. | Adaptive temporal entity resolution on dynamic databases | |
CN113094388B (en) | Method and related device for detecting user workplace and residence | |
CN108090086A (en) | Paging query method and device | |
US20220229854A1 (en) | Constructing ground truth when classifying data | |
CN107391532A (en) | The method and apparatus of data filtering | |
CN106844320B (en) | Financial statement integration method and equipment | |
CN112148760B (en) | Big data screening method and device | |
CN111488385A (en) | Data processing method and device based on artificial intelligence and computer equipment | |
CN111078512A (en) | Alarm record generation method and device, alarm equipment and storage medium | |
US11609897B2 (en) | Methods and systems for improved search for data loss prevention | |
CN109902129B (en) | Insurance agent classifying method and related equipment based on big data analysis | |
CN116228374A (en) | Logistics industry market single data early warning method, device, equipment and storage medium | |
CN110069575A (en) | A kind of dynamic data statistical method and system based on multidimensional data mark | |
CN106528575A (en) | Data connection method and device | |
CN111368616B (en) | Slave vehicle identification method, device and equipment | |
CN104463627A (en) | Data processing method and device | |
CN112131215B (en) | Bottom-up database information acquisition method and device | |
CN112446673A (en) | Trademark change judgment method, system, equipment and readable storage medium | |
CN109086309A (en) | A kind of index dimensional relationships define method, server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |