CN111367956B - Data statistics method and device - Google Patents

Data statistics method and device Download PDF

Info

Publication number
CN111367956B
CN111367956B CN201811589609.1A CN201811589609A CN111367956B CN 111367956 B CN111367956 B CN 111367956B CN 201811589609 A CN201811589609 A CN 201811589609A CN 111367956 B CN111367956 B CN 111367956B
Authority
CN
China
Prior art keywords
data
field value
statistics
field
forward index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811589609.1A
Other languages
Chinese (zh)
Other versions
CN111367956A (en
Inventor
李聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201811589609.1A priority Critical patent/CN111367956B/en
Publication of CN111367956A publication Critical patent/CN111367956A/en
Application granted granted Critical
Publication of CN111367956B publication Critical patent/CN111367956B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data statistics method and a device, wherein the method comprises the following steps: constructing an inverted index and a forward index according to the source data; when receiving the data statistics conditions, acquiring inquiry conditions from the data statistics conditions; inquiring field values meeting inquiry conditions from the inverted index, and determining a data ID set according to the data ID to which each field value belongs; acquiring an aggregation condition from the data statistics condition; and inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and carrying out aggregation statistics to obtain a statistical result. By constructing the inverted index and the forward index for the stored source data in the big data platform, when statistical analysis is needed, data statistical conditions are input to the big data platform, and statistical results are obtained by inquiring the inverted index and the forward index, so that the application demand problem that the data quantity is large and real-time statistics is needed can be solved, meanwhile, the statistical results are not required to be written into a cache, and the problem that the system load is large and the problem that the statistical results are lost can be avoided.

Description

Data statistics method and device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data statistics method and apparatus.
Background
With the rapid development of cloud computing and artificial intelligence, massive data, i.e., big data, are generated in various fields, and the value of the big data is being deeply mined and utilized in various industries. At present, a timing statistical mode is often adopted for the statistical requirements of large data quantity or complex and time-consuming statistical flow; for simple and time-consuming statistics, a real-time statistics mode based on stream calculation is often adopted.
However, the timing statistics and the real-time statistics cannot meet the application requirements of large data volume and real-time statistics.
Disclosure of Invention
In view of this, the present application provides a data statistics method and apparatus, so as to solve the problem that the related art cannot meet the application requirement of large data size and requiring real-time statistics.
According to a first aspect of an embodiment of the present application, there is provided a data statistics method, the method including:
constructing an inverted index and a forward index according to stored source data, wherein the inverted index records a data ID (identity) to which each field value belongs, and the forward index records field values of various fields contained in each data ID;
when receiving a data statistics condition, acquiring a query condition from the data statistics condition, wherein the query condition comprises at least one field value condition;
inquiring field values meeting the inquiring conditions from the inverted index, and determining a data ID set according to the data ID to which each field value belongs;
acquiring an aggregation condition from the data statistics condition, wherein the aggregation condition at least comprises a field to be aggregated;
and inquiring a field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and carrying out aggregation statistics to obtain a statistical result.
According to a second aspect of an embodiment of the present application, there is provided a data statistics apparatus, the apparatus comprising:
a construction unit, configured to construct an inverted index and a forward index according to stored source data, where the inverted index records a data ID to which each field value belongs, and the forward index records a field value of each field included in each data ID;
a first obtaining unit, configured to obtain a query condition from a data statistics condition when the data statistics condition is received, where the query condition includes at least one field value condition;
the query unit is used for querying the field values meeting the query conditions from the inverted index and determining a data ID set according to the data ID to which each field value belongs;
the second acquisition unit is used for acquiring aggregation conditions from the data statistics conditions, wherein the aggregation conditions at least comprise fields to be aggregated;
and the statistics unit is used for inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index and carrying out aggregation statistics to obtain a statistics result.
According to a third aspect of embodiments of the present application, there is provided an electronic device comprising a readable storage medium and a processor;
wherein the readable storage medium is for storing machine executable instructions;
the processor is configured to read the machine executable instructions on the readable storage medium and execute the instructions to implement the steps of the method of the first aspect.
By applying the embodiment of the application, the inverted index and the forward index can be constructed according to the stored source data, and when the data statistics condition is received, the query condition (comprising at least one field value condition) is obtained from the data statistics condition; if the data ID is obtained, inquiring the data ID corresponding to the field value meeting the inquiring condition from the inverted index, determining a data ID set according to the inquired data ID, then obtaining an aggregating condition (the aggregating condition at least comprises a field to be aggregated) from the data statistics condition, and if the data ID is obtained, inquiring the field value of the field to be aggregated, which is contained in each data ID in the data ID set, from the forward index, and carrying out aggregation statistics to obtain a statistics result.
Based on the description, the inverted index and the forward index are constructed for the stored source data in the big data platform, when statistical analysis is needed, data statistical conditions are input to the big data platform, and statistical results are obtained by inquiring the inverted index and the forward index.
Drawings
FIG. 1 is a flow chart illustrating an embodiment of a data statistics method according to an exemplary embodiment of the present application;
FIG. 2 is a hardware block diagram of a server according to an exemplary embodiment of the present application;
fig. 3 is a block diagram illustrating an embodiment of a data statistics apparatus according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
Because the current timing statistics method and the real-time statistics method belong to customized demand statistics and directly count the original data, and the statistics result is written into a cache for retrieval, the method cannot meet the application demand that the data volume is large and the real-time statistics is required, and the statistics result is written into the cache for retrieval, and the system load problem and the problem that the statistics result is lost are caused.
In order to solve the above problems, the present application proposes a data statistics method, which can construct an inverted index and a forward index according to stored source data, and when receiving a data statistics condition, acquire a query condition from the data statistics condition (the query condition includes at least one field value condition); if the data ID is obtained, inquiring the data ID corresponding to the field value meeting the inquiring condition from the inverted index, determining a data ID set according to the inquired data ID, then obtaining an aggregating condition (the aggregating condition at least comprises a field to be aggregated) from the data statistics condition, and if the data ID is obtained, inquiring the field value of the field to be aggregated, which is contained in each data ID in the data ID set, from the forward index, and carrying out aggregation statistics to obtain a statistics result.
Based on the description, when statistical analysis is needed, the inverted index and the forward index are constructed on the source data stored in the big data platform, and data statistical conditions are input to the big data platform, so that statistical results are obtained by respectively inquiring the inverted index and the forward index, the application demand problem that the data volume is large and real-time statistics is needed is solved, the statistical results are not needed to be written into a cache, the real-time return to the user side is realized, and the problem of overlarge system load and the problem of statistical result loss can be avoided.
In the environment of the internet of things, an RFID (Radio Frequency Identification ) chip is arranged on an electric vehicle, a bicycle, a tractor, a person or other carriers, and then acquisition equipment for receiving RFID chip data is arranged at different places, so that the RFID chip data positioned around the acquisition equipment can be collected through the acquisition equipment. Assuming that one RFID chip is mounted on a carrier and then collected N times a day, N pieces of RFID chip data are collected each day. If RFID chips are arranged on each carrier, a plurality of pieces of RFID chip data are collected every day; if the RFID chip data can be deeply mined, the convenience and the intelligence of life of people can be greatly improved, for example, the tracking of lost relatives, the pursuit of lost electric vehicles and the like can be realized, so that how to accurately and rapidly count the mass data is important.
The technical scheme of the application is described in detail below by taking RFID chip data collected in the environment of the Internet of things as an example.
FIG. 1 is a flow chart of an embodiment of a data statistics method according to an exemplary embodiment of the present application, which may be applied to servers based on large data platforms. As shown in fig. 1, the data statistics method includes the following steps:
step 101: and constructing an inverted index and a forward index according to the stored source data.
Before step 101 is performed, source data needs to be collected, and the collection process may be: and receiving the acquisition equipment ID, the chip ID, the acquisition time and the chip data reported by the acquisition equipment, acquiring a time type tag corresponding to the chip ID from the stored record registration information, storing the acquisition equipment ID, the chip ID, the acquisition time, the time type tag and the chip data as one piece of source data, and setting a data ID for uniquely identifying the piece of source data for the piece of source data.
Taking an RFID chip as an example, the collecting device may collect RFID chip data at a certain frequency and report the data to the server (including a chip ID, a collecting time, chip data, etc.), where the chip ID refers to a unique ID set by a chip manufacturer randomly for the chip according to a related protocol when the chip is manufactured, and the chip data may be data such as a geographical location where the chip is currently located. To achieve complex and uncertain statistical analysis, class tags may be added to the RFID chip, which may be made of numbers or characters or a combination of numbers and characters. For example, 51 represents an electric vehicle, 52 represents a bicycle, 53 represents a pet, 54 represents a person, and 55 represents another object, so that when registering a record is performed after an RFID chip is mounted on a certain carrier, the registered time and the category can be combined as a time category tag, and the time category tag and the chip ID of the RFID can be registered as one record registration information. For example, a time class tag 18080351 is registered, which represents an electric vehicle registered on the 3 th 8 th 2018, and the chip ID is 12345, which represents the chip ID mounted on the electric vehicle. Based on the above, after receiving the acquisition equipment ID, the chip ID, the acquisition time and the chip data reported by the acquisition equipment, the corresponding time category label can be obtained in the record registration information through the chip ID, and the time category label and the reported data are combined into a piece of complete source data. If the chip ID does not exist in the docket registration database, the time category tag is set to 00000055. Based on the characteristic of large data volume, source data can be stored in a distributed storage system, such as HBASE, wherein each source data is composed of field values of different fields, and each source data corresponds to a data ID (primary key) for uniquely identifying the source data.
It should be noted that, in order to increase the query speed, the record registration information may also be stored in the memory of each acquisition device, so as to increase the query speed, thereby realizing near real-time data storage.
In an embodiment, for a process of constructing an inverted index according to stored source data, a field value contained in a specified field may be obtained from the stored source data, for each obtained field value, whether the field value exists is searched for from the existing inverted index, and if not, the field value and a data ID corresponding to the field value are stored as an inverted index; if the field value exists, the data ID corresponding to the field value is added to the inverted index where the field value exists.
In order to reduce occupation of index data to storage space and improve query performance, the inverted index can be constructed by designating part of fields according to application requirements of users. Because the inverted index records the corresponding relation between each field value and the data ID, the complete source data can be queried from the storage system through the data ID, and if the field value is a multi-word, word segmentation can be performed through a word segmentation device. In the statistical analysis, it is generally unnecessary to obtain complete source data, for example, the total number of statistics, and only the total number of data IDs corresponding to the field values needs to be calculated, so that the query performance can be improved by querying the inverted index. In addition, the frequency TF and the position POS at which each field value appears in the associated data ID may also be recorded in the inverted index.
In an embodiment, for the process of constructing a forward index according to stored source data, for each data ID included in the stored source data, a field value of a specified field included in the data ID is obtained, and the data ID and the obtained field value are stored as a forward index.
For the aggregate statistics condition of non-total statistics analysis, the statistics can be better realized by combining the inverted index and the forward index, so that the forward index is also required to be established for the designated field, and the aggregate statistics performance is improved through the forward index. The forward index may be constructed while also storing the offset of the field value of the specified field in the forward index.
It should be noted that, the time for constructing the inverted index and the forward index may be determined by integrating the size of each source data, the total amount of source data collected every day, and the size of the storage medium, for example, a new inverted index and a new forward index may be constructed at regular intervals, or an inverted index and a forward index may be newly constructed when the index data in the current inverted index or the forward index reaches a certain threshold. Each constructed reverse index and forward index can be stored in an SSD (Solid State Disk) Disk to improve statistical performance.
In an exemplary scenario, the collected RFID chip data is assumed to contain fields of an acquisition device ID, a chip ID, an acquisition time, a time category label, and chip data, and the designated fields for constructing the inverted index and the forward index are assumed to be an acquisition device ID (dev_no), a chip ID (RFID), an acquisition time (collection_time), and a time category label (time_type). As shown in table 1, an exemplary inverted index structure table is shown, wherein the term ID is 1, and the corresponding field value is 18080351; the data ID containing the field value is 1, 2 and 5 respectively; the frequency of occurrence in each data ID is 1; the location of occurrence is the first location. As shown in table 2, an exemplary forward index structure table is shown, in which the data ID is 1, four term values are included, which are 18080351, 12345, 1533268800000, devNo01, respectively, and each term value appears 1 time. As shown in table 3, the offset of the field value of the field in the forward index is 0, which indicates that the first value in the forward index is the field value of the time_type; the offset of the field value of field rfid_id is 1, indicating that the second value in the forward index is the field value of rfid_id, and so on.
Term ID Field value The index data (DocID, TF,<POS>)
1 18080351 (1,1,<1>),(5,1,<1>)
2 18090152 (3,1,<1>),(4,1,<1>),(6,1,<2>)
3 12345 (1,1,<2>),(2,1,<2>),(5,1,<2>)
4 54321 (3,1,<2>),(4,1,<2>),(6,1,<2>)
5 1533268800000 (1,1,<3>),(2,1,<3>)
6 1535774400000 (3,1,<3>),(4,1,<3>),(6,1,<3>)
7 devNo01 (1,1,<4>),(2,1,<4>),(5,1,<4>)
8 devNo02 (3,1,<4>),(4,1,<4>),(6,1,<4>)
TABLE 1
TABLE 2
Fields Offset amount
time_type 0
rfid_id 1
collect_time 2
dev_no 3
TABLE 3 Table 3
Step 102: upon receiving a data statistics condition, a query condition is obtained from the data statistics condition, the query condition comprising at least one field value condition.
In one embodiment, a matching rule of the fuzzy query may be formulated in advance, and when the data statistics condition is received, the query condition is obtained from the data statistics condition based on the matching rule.
The formulated matching rule may be: a plurality of placeholders are represented by "x", i.e., a plurality of characters are matched by "x" in the query condition; with? "represents a placeholder, i.e.,? "means matching a character. Wherein "? "can be arbitrarily combined, for example: the statistical conditions of the received data are: counting the total number of all electric vehicles registered in 8 months in 2018, wherein the total number is collected in the period from 1 day in 9 months to 10 days in 2018, and the query conditions are as follows: collect_time [1535731200000,1536595199000]and time_type:1808*51 ] contains two field value conditions, one is the field value condition 1808 x 51 of the time class tag field and one is the field value condition of the acquisition time field [1535731200000,1536595199000]. The query conditions obtained from the data statistics conditions are those satisfying the query correlation protocol of the open source component ELASTICSEARCH.
In an exemplary scenario, based on the above collected source data, it is assumed that the data statistics condition is to count all electric vehicles registered in 2018, within a month range of 1 st in 2018 to 31 st in 2018, the total amount of electric vehicle data collected by each collection device among all collection devices. Analysis shows that the query conditions of all electric vehicles registered in 2018 are as follows: time_type:18 x 51; the query conditions for a month range from month 1 of 2018, 8, 31 of 2018 are: the collection_time is [1533052800000,1535731199000], wherein 1533052800000 is a timestamp corresponding to 1 st 8 th 2018, 1535731199000 is a timestamp corresponding to 31 st 8 th 2018, and the intersection of two comprehensive query conditions is: and [1533052800000,1535731199000]and time_type:18*51 ] of the collect_time.
Step 103: and inquiring the field value meeting the inquiry condition from the inverted index, and determining a data ID set according to the data ID of each inquired field value.
In an embodiment, for the process of determining the data ID set according to the data ID to which each field value queried belongs, if the query condition is one, the data ID to which each field value belongs may be determined as one data ID set, and if the query condition is multiple, predicate operation needs to be performed on the data ID to which each field value queried by each query condition belongs, to obtain the data ID set.
Based on the scenario shown in step 102, the number of query conditions is 2, and in combination with the inverted index shown in the above table 1, from the query condition collect_time: [1533052800000,1535731199000], 1, 2, 3, 4 and 6 data IDs belonging to field values meeting the query conditions can be queried from the table 1. From the query condition time_type of 18×51, 1 and 5 data IDs belonging to the field value 18080351 meeting the query condition can be queried from table 1; the data ID to which the field value 18090152 belongs is 3, 4, 6. Since the statistical condition is that all electric vehicles registered in 2018 are counted, electric vehicles collected by the equipment are collected within a month range from 1 in 2018 to 31 in 2018, and the two query conditions are in a correlation, the intersection of the data IDs of the two query conditions is 1, 3, 4 and 6.
It will be appreciated by those skilled in the art that if the statistical condition becomes a statistic of all electric vehicles registered in 2018, or electric vehicles collected by the collection device in a month range from 1 in 2018 to 31 in 2018, then the two query conditions are or are related, and thus the union of the data IDs of the two query conditions is 1, 2, 3, 4, 5, 6. If the statistical condition becomes all electric vehicles registered in 2018 but does not include electric vehicles collected by the collecting device in a month range from 1 st 8 th 2018 to 31 st 8 th 2018, the two inquiry conditions are non-related, and thus the difference set of the data IDs of the two inquiry conditions is 2.
Step 104: and acquiring an aggregation condition from the data statistics condition, wherein the aggregation condition at least comprises a field to be aggregated.
Based on the scenario shown in step 102, the data statistics condition is to count all electric vehicles registered in 2018, and in a month range from 1 st 8 th 2018 to 31 st 8 th 2018, the total amount of electric vehicle data acquired by each acquisition device is calculated. As can be seen from analysis, the fields to be aggregated of the total amount of the electric vehicle data collected by each collection device are: field: "dev no".
Step 105: and inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index, and carrying out aggregation statistics to obtain a statistical result.
In an embodiment, for the process of querying the field value of the field to be aggregated included in each data ID in the data ID set from the forward index and performing aggregation statistics, the forward index item of each data ID in the data ID set may be queried from the forward index, the forward index item of each data ID is determined to be a subset, then one forward index item is selected from the subset, the field value of the field to be aggregated is queried from the selected forward index item, a statistics value is set for the field value, the initial value of the statistics value is set to a preset value, the selected forward index item is deleted from the subset, and then for each of the forward index items remaining in the subset, the field value of the field to be aggregated is queried from the forward index item, and the queried field value is compared with the field value with the statistics value; if the statistics are consistent, adding 1 to the statistics, and deleting the forward index item from the subset; judging whether the subset is empty, if not, continuing to execute the step of selecting a forward index item from the subset.
The field to be aggregated refers to a field requiring statistics in the data statistics condition. The field value of the field to be aggregated can be queried through the offset of the field to be aggregated in the forward index. By determining a subset from the forward index and placing the subset in the memory for aggregation operation, the aggregation efficiency can be improved because the number of the subset queried each time is smaller and the query speed in the memory is higher than the query speed in the hard disk.
The data ID sets are 1, 3, 4, 6 based on the scene shown in step 103 and step 104. First, a subset of the forward index entries for each data ID in the set of data IDs is filtered from Table 2. The field to be aggregated is dev_no, a forward index item is randomly selected from the subset, the forward index item with the data ID of 1 is assumed to be selected, the offset of the field to be aggregated dev_no in the forward index item is 3 according to the table 3, so that the actual value of the dev_no which can be found in the forward index item is devNo01, a statistical value is set for devNo01, the statistical value is set to be 1, and the forward index item with the data ID of 1 in the subset is deleted. And for each of the remaining forward index entries in the subset, since the field values of dev_no in the remaining forward index entries are devNo02 and inconsistent with devNo01, the statistical value of devNo01 is finally obtained to be 1. Then, randomly selecting a forward index item from the subset, assuming that the forward index item with the data ID of 3 is selected, setting the field value of the dev_no field as devNo02, setting the statistic value for devNo02, setting the statistic value as 1, deleting the forward index item with the data ID of 3 in the subset, traversing each of the remaining forward index items in the subset, knowing that the field value of dev_no in the forward index item with the data ID of 4 and the forward index item with the data ID of 6 is devNo02, obtaining the statistic value of devNo02, deleting the forward index items with the data IDs of 4 and 6 from the subset, at the moment, the subset is empty, and ending statistics. The statistical result obtained is: the dev No01 collecting device collects 1 piece of data, and the dev No02 collecting device collects 3 pieces of data.
It will be appreciated by those skilled in the art that after the statistics are obtained, the statistics may be further ranked, and a preset number of statistics with a top ranking may be fed back to the user for viewing. For example, assuming that the data statistics condition requires to return to top10 of the total amount of the electric vehicle data collected by each collecting device, sorting from large to small after obtaining the total amount of the electric vehicle data collected by each collecting device, and selecting the total amount of the electric vehicle data collected by the first 10 collecting devices for feedback.
It should be noted that, each time a data statistics condition is received, the data statistics condition may be analyzed to generate a statistical analysis model conforming to a predetermined protocol; then, executing the statistical analysis model based on the constructed inverted index and the forward index; finally, the statistical analysis model outputs a statistical result. The statistical analysis model may be described in JSON format.
It should be further noted that, if the query condition is acquired but the aggregation condition is not acquired, after the data ID set is obtained, counting the number of data IDs in the set; if the query condition is not acquired but the aggregation condition is acquired, the aggregation condition is directly utilized to carry out aggregation statistics on the field value contained in each data ID of the forward index record.
In the embodiment of the application, an inverted index and a forward index can be constructed according to stored source data, and when a data statistics condition is received, a query condition is obtained from the data statistics condition (the query condition comprises at least one field value condition); if the data ID is obtained, inquiring the data ID corresponding to the field value meeting the inquiring condition from the inverted index, determining a data ID set according to the inquired data ID, then obtaining an aggregating condition (the aggregating condition at least comprises a field to be aggregated) from the data statistics condition, and if the data ID is obtained, inquiring the field value of the field to be aggregated, which is contained in each data ID in the data ID set, from the forward index, and carrying out aggregation statistics to obtain a statistics result.
Based on the description, when statistical analysis is needed, the inverted index and the forward index are constructed on the source data stored in the big data platform, and data statistical conditions are input to the big data platform, so that statistical results are obtained by respectively inquiring the inverted index and the forward index, the application demand problem that the data volume is large and real-time statistics is needed is solved, the statistical results are not needed to be written into a cache, the real-time return to the user side is realized, and the problem of overlarge system load and the problem of statistical result loss can be avoided.
Fig. 2 is a hardware configuration diagram of a server according to an exemplary embodiment of the present application, the server including: a communication interface 201, a processor 202, a machine-readable storage medium 203, and a bus 204; wherein the communication interface 201, the processor 202, and the machine-readable storage medium 203 communicate with each other via a bus 204. The processor 202 may perform the data statistics method described above by reading and executing machine executable instructions in the machine readable storage medium 203 corresponding to the control logic of the data statistics method, the details of which are referred to in the above embodiments and will not be further described herein.
The machine-readable storage medium 203 of the present application may be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, and the like. For example, a machine-readable storage medium may be: volatile memory, nonvolatile memory, or similar storage medium. In particular, the machine-readable storage medium 203 may be RAM (Radom Access Memory, random access memory), flash memory, a storage drive (e.g., hard drive), any type of storage disk (e.g., optical disk, DVD, etc.), or a similar storage medium, or a combination thereof.
Fig. 3 is a block diagram of an embodiment of a data statistics apparatus according to an exemplary embodiment of the present application, and as shown in fig. 3, the data statistics apparatus includes:
a construction unit 310, configured to construct an inverted index and a forward index according to stored source data, where the inverted index records a data ID to which each field value belongs, and the forward index records a field value of each field included in each data ID;
a first obtaining unit 320, configured to obtain, when a data statistics condition is received, a query condition from the data statistics condition, where the query condition includes at least one field value condition;
a query unit 330, configured to query field values meeting the query condition from the inverted index, and determine a data ID set according to the data IDs to which each field value belongs;
a second obtaining unit 340, configured to obtain an aggregation condition from the data statistics condition, where the aggregation condition at least includes a field to be aggregated;
and a statistics unit 350, configured to query the field value of the field to be aggregated included in each data ID in the data ID set from the forward index, and perform aggregation statistics to obtain a statistics result.
In an optional implementation manner, the construction unit 310 is specifically configured to obtain, from the stored source data, a field value included in the specified field in a process of constructing the inverted index according to the stored source data; for each acquired field value, searching whether the field value exists in the existing inverted index; if not, the field value and the data ID corresponding to the field value are stored as an inverted index; if the field value exists, the data ID corresponding to the field value is added to the inverted index where the field value exists.
In an optional implementation manner, the construction unit 310 is specifically configured to, in constructing the forward index according to the stored source data, obtain, for each data ID included in the stored source data, a field value of a specified field included in the data ID; the data ID and the acquired field value are stored as a forward index.
In an alternative implementation, the apparatus further comprises (not shown in fig. 3):
the data collection unit is used for receiving the acquisition equipment ID, the chip ID, the acquisition time and the chip data reported by the acquisition equipment; the chip data refers to the data sent by the chip corresponding to the chip ID received by the acquisition equipment; acquiring a time category label corresponding to the chip ID from stored record registration information, wherein the time category label is used for indicating record registration time of a carrier where the chip is located and the category of the carrier; and storing the acquisition equipment ID, the chip ID, the acquisition time, the time category label and the chip data as one piece of source data, and setting a data ID for uniquely identifying the piece of source data for the piece of source data.
In an optional implementation manner, the statistics unit 350 is specifically configured to query the forward index item of each data ID in the data ID set from the forward index, and determine the forward index item of each data ID as a subset; selecting a forward index item from the subset, searching a field value of the field to be aggregated in the selected forward index item, setting a statistical value for the field value, setting an initial value of the statistical value as a preset value, and deleting the selected forward index item from the subset; for each of the forward index entries remaining in the subset, searching for the field value of the field to be aggregated from the forward index entries, and comparing the searched field value with the field value provided with the statistical value; if the statistics are consistent, adding 1 to the statistics, and deleting the forward index item from the subset; judging whether the subset is empty or not; if not, continuing to execute the step of selecting one forward index item from the subset.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims (9)

1. A method of data statistics, the method comprising:
constructing an inverted index and a forward index according to stored source data, wherein the inverted index records a data ID (identity) to which each field value belongs, and the forward index records field values of various fields contained in each data ID;
when receiving a data statistics condition, acquiring a query condition from the data statistics condition, wherein the query condition comprises at least one field value condition;
inquiring field values meeting the inquiring conditions from the inverted index, and determining a data ID set according to the data ID to which each field value belongs;
acquiring an aggregation condition from the data statistics condition, wherein the aggregation condition at least comprises a field to be aggregated;
inquiring field values of the fields to be aggregated contained in each data ID in the data ID set from the forward index, and carrying out aggregation statistics to obtain a statistical result;
the step of querying the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index and performing aggregation statistics includes:
querying a forward index item of each data ID in the data ID set from the forward index, and determining the forward index item of each data ID as a subset;
selecting a forward index item from the subset, searching a field value of the field to be aggregated in the selected forward index item, setting a statistical value for the field value, setting an initial value of the statistical value as a preset value, and deleting the selected forward index item from the subset;
for each of the forward index entries remaining in the subset, searching for the field value of the field to be aggregated from the forward index entries, and comparing the searched field value with the field value provided with the statistical value; if the statistics are consistent, adding 1 to the statistics, and deleting the forward index item from the subset;
judging whether the subset is empty or not;
if not, continuing to execute the step of selecting one forward index item from the subset.
2. The method of claim 1, wherein constructing an inverted index from stored source data comprises:
acquiring a field value contained in a specified field from stored source data;
for each acquired field value, searching whether the field value exists in the existing inverted index;
if not, the field value and the data ID corresponding to the field value are stored as an inverted index;
if the field value exists, the data ID corresponding to the field value is added to the inverted index where the field value exists.
3. The method of claim 1, wherein constructing the forward index from the stored source data comprises:
for each data ID contained in the stored source data, acquiring a field value of a specified field contained in the data ID;
the data ID and the acquired field value are stored as a forward index.
4. A method according to claim 2 or 3, characterized in that the source data is collected by:
receiving acquisition equipment ID, chip ID, acquisition time and chip data reported by acquisition equipment; the chip data refers to the data sent by the chip corresponding to the chip ID received by the acquisition equipment;
acquiring a time category label corresponding to the chip ID from stored record registration information, wherein the time category label is used for indicating record registration time of a carrier where the chip is located and the category of the carrier;
and storing the acquisition equipment ID, the chip ID, the acquisition time, the time category label and the chip data as one piece of source data, and setting a data ID for uniquely identifying the piece of source data for the piece of source data.
5. A data statistics apparatus, the apparatus comprising:
a construction unit, configured to construct an inverted index and a forward index according to stored source data, where the inverted index records a data ID to which each field value belongs, and the forward index records a field value of each field included in each data ID;
a first obtaining unit, configured to obtain a query condition from a data statistics condition when the data statistics condition is received, where the query condition includes at least one field value condition;
the query unit is used for querying the field values meeting the query conditions from the inverted index and determining a data ID set according to the data ID to which each field value belongs;
the second acquisition unit is used for acquiring aggregation conditions from the data statistics conditions, wherein the aggregation conditions at least comprise fields to be aggregated;
the statistics unit is used for inquiring the field value of the field to be aggregated contained in each data ID in the data ID set from the forward index and carrying out aggregation statistics to obtain a statistics result;
the statistics unit is specifically configured to query a forward index item of each data ID in the data ID set from the forward index, and determine the forward index item of each data ID as a subset; selecting a forward index item from the subset, searching a field value of the field to be aggregated in the selected forward index item, setting a statistical value for the field value, setting an initial value of the statistical value as a preset value, and deleting the selected forward index item from the subset; for each of the forward index entries remaining in the subset, searching for the field value of the field to be aggregated from the forward index entries, and comparing the searched field value with the field value provided with the statistical value; if the statistics are consistent, adding 1 to the statistics, and deleting the forward index item from the subset; judging whether the subset is empty or not; if not, continuing to execute the step of selecting one forward index item from the subset.
6. The apparatus according to claim 5, wherein the construction unit is specifically configured to obtain, from the stored source data, a field value included in the specified field in constructing the inverted index from the stored source data; for each acquired field value, searching whether the field value exists in the existing inverted index; if not, the field value and the data ID corresponding to the field value are stored as an inverted index; if the field value exists, the data ID corresponding to the field value is added to the inverted index where the field value exists.
7. The apparatus according to claim 5, wherein the construction unit is specifically configured to, in constructing the forward index from the stored source data, obtain, for each data ID included in the stored source data, a field value of a specified field included in the data ID; the data ID and the acquired field value are stored as a forward index.
8. The apparatus according to claim 6 or 7, characterized in that the apparatus further comprises:
the data collection unit is used for receiving the acquisition equipment ID, the chip ID, the acquisition time and the chip data reported by the acquisition equipment; the chip data refers to the data sent by the chip corresponding to the chip ID received by the acquisition equipment; acquiring a time category label corresponding to the chip ID from stored record registration information, wherein the time category label is used for indicating record registration time of a carrier where the chip is located and the category of the carrier; and storing the acquisition equipment ID, the chip ID, the acquisition time, the time category label and the chip data as one piece of source data, and setting a data ID for uniquely identifying the piece of source data for the piece of source data.
9. An electronic device comprising a readable storage medium and a processor;
wherein the readable storage medium is for storing machine executable instructions;
the processor is configured to read the machine-executable instructions on the readable storage medium and execute the instructions to implement the steps of the method of any of claims 1-4.
CN201811589609.1A 2018-12-25 2018-12-25 Data statistics method and device Active CN111367956B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811589609.1A CN111367956B (en) 2018-12-25 2018-12-25 Data statistics method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811589609.1A CN111367956B (en) 2018-12-25 2018-12-25 Data statistics method and device

Publications (2)

Publication Number Publication Date
CN111367956A CN111367956A (en) 2020-07-03
CN111367956B true CN111367956B (en) 2023-09-26

Family

ID=71207858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811589609.1A Active CN111367956B (en) 2018-12-25 2018-12-25 Data statistics method and device

Country Status (1)

Country Link
CN (1) CN111367956B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199463A (en) * 2020-10-21 2021-01-08 新华三信息安全技术有限公司 Data query method, device and equipment
CN112818013B (en) * 2021-01-27 2023-07-21 北京百度网讯科技有限公司 Time sequence database query optimization method, device, equipment and storage medium
CN114265849B (en) * 2022-02-28 2022-06-10 杭州广立微电子股份有限公司 Data aggregation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914543A (en) * 2014-04-03 2014-07-09 北京百度网讯科技有限公司 Search result displaying method and device
EP2833278A1 (en) * 2013-07-31 2015-02-04 Linkedin Corporation Method and apparatus for real-time indexing of data for analytics
CN108595489A (en) * 2018-03-15 2018-09-28 北京雷石天地电子技术有限公司 A kind of data retrieval method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8620900B2 (en) * 2009-02-09 2013-12-31 The Hong Kong Polytechnic University Method for using dual indices to support query expansion, relevance/non-relevance models, blind/relevance feedback and an intelligent search interface
US10242071B2 (en) * 2015-06-23 2019-03-26 Microsoft Technology Licensing, Llc Preliminary ranker for scoring matching documents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2833278A1 (en) * 2013-07-31 2015-02-04 Linkedin Corporation Method and apparatus for real-time indexing of data for analytics
CN103914543A (en) * 2014-04-03 2014-07-09 北京百度网讯科技有限公司 Search result displaying method and device
CN108595489A (en) * 2018-03-15 2018-09-28 北京雷石天地电子技术有限公司 A kind of data retrieval method and device

Also Published As

Publication number Publication date
CN111367956A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN111367956B (en) Data statistics method and device
CN111459985B (en) Identification information processing method and device
CN110334111B (en) Multidimensional track analysis method and device
US9009157B2 (en) Apparatus and method for processing a data stream
CN109446253B (en) Data query control method, device, computer equipment and storage medium
CN104978324B (en) Data processing method and device
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN105117442B (en) A kind of big data querying method based on probability
CN111090822A (en) Business object pushing method and device
CN107832333B (en) Method and system for constructing user network data fingerprint based on distributed processing and DPI data
CN106844320B (en) Financial statement integration method and equipment
CN110727756A (en) Management method and device of space-time trajectory data
CN111078512A (en) Alarm record generation method and device, alarm equipment and storage medium
CN112579593A (en) Population database sorting method and device
CN113918622A (en) Information tracing method and system based on block chain
CN108664605B (en) Model evaluation method and system
CN106383897B (en) Database volume computational methods and device
CN109902129A (en) Insurance agent&#39;s classifying method and relevant device based on big data analysis
CN113792084A (en) Data heat analysis method, device, equipment and storage medium
CN113094388A (en) Method and related device for detecting user workplace and residence
CN107025567A (en) A kind of data processing method and device
CN116595262A (en) Travel scheme recommendation method and device, electronic equipment and computer storage medium
CN110765221A (en) Management method and device of space-time trajectory data
CN113360551B (en) Method and system for storing and rapidly counting time sequence data in shooting range
CN112131215B (en) Bottom-up database information acquisition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant