CN113420263B

CN113420263B - Data statistics method, device, equipment and storage medium

Info

Publication number: CN113420263B
Application number: CN202110739424.XA
Authority: CN
Inventors: 彭阳; 杨浩; 封磊; 廖伟达; 严海林; 芦华楠
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-08-04
Anticipated expiration: 2041-06-30
Also published as: CN113420263A

Abstract

The disclosure provides a data statistics method, a device, equipment and a storage medium, relates to the technical field of data processing, and particularly relates to the information flow technology. The specific implementation scheme is as follows: sequentially determining numerical codes corresponding to all the identification information in the pulled service data stream; storing each numerical value coding bitmap to obtain target bitmap data; and determining the identification information quantity of at least one statistical period according to the target bitmap data. According to the technology disclosed by the invention, the accuracy and the statistics efficiency of the statistics result are improved, and meanwhile, the occupied amount of the data space is reduced.

Description

Data statistics method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to information flow technologies.

Background

With the development of information technology, various business data depending on the Internet are exploded, and the business data are very important, so that the normal operation of related services on the Internet is directly affected, and meanwhile, the method has a certain guiding significance for the operation and management of the related services.

Disclosure of Invention

The present disclosure provides a data statistics method, apparatus, device and storage medium.

According to an aspect of the present disclosure, there is provided a data statistics method including:

sequentially determining numerical codes corresponding to all the identification information in the pulled service data stream;

storing each numerical value coding bitmap to obtain target bitmap data;

and determining the identification information quantity of at least one statistical period according to the target bitmap data.

According to another aspect of the present disclosure, there is also provided a data statistics apparatus including:

the numerical code determining module is used for sequentially determining numerical codes corresponding to all the identification information in the pulled service data stream;

the target bitmap data obtaining module is used for storing each numerical value coding bitmap to obtain target bitmap data;

and the identification information statistics module is used for determining the identification information quantity of at least one statistical period according to the target bitmap data.

According to another aspect of the present disclosure, there is also provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the data statistics methods provided by the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform any one of the data statistics methods provided by the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements any of the data statistics methods provided by the embodiments of the present disclosure.

According to the technology disclosed by the invention, the accuracy and the statistics efficiency of the statistics result are improved, and meanwhile, the occupied amount of the data space is reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a data statistics method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of another data statistics method provided by an embodiment of the present disclosure;

FIG. 3 is a flow chart of another data statistics method provided by an embodiment of the present disclosure;

FIG. 4 is a flow chart of another data statistics method provided by an embodiment of the present disclosure;

FIG. 5 is a flow chart of another data statistics method provided by an embodiment of the present disclosure;

FIG. 6A is a block diagram of a data statistics system provided by an embodiment of the present disclosure;

FIG. 6B is a diagram of a numerical encoding process for representing information provided by an embodiment of the present disclosure;

FIG. 7 is a block diagram of a data statistics apparatus provided by an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a data statistics method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the prior art, when traffic data generated in the internet is counted, it is generally required to store a total amount of detail data, and count the traffic data by means of high-precision calculation or approximate calculation based on the stored detail data. However, the high-precision calculation mode has low calculation efficiency, and the approximate calculation mode is not used for part of calculation precision, so that the requirement of timeliness is met. Therefore, the scheme in the prior art cannot achieve both calculation accuracy and timeliness, and meanwhile, when data statistics is performed, occupation of a large amount of storage space is required, so that waste of storage resources is caused.

The data statistics method and device provided by the embodiment of the disclosure are suitable for the situation of carrying out statistics on the real-time flow of the service data generated in the Internet, so that the purpose of considering accuracy and timeliness of the statistics result is achieved, and meanwhile, the occupied amount of storage space is reduced. The data statistics methods provided by the present disclosure may be performed by a data statistics apparatus, which may be implemented in software and/or hardware, and specifically configured in an electronic device. The electronic device may be a single server device or a cluster of servers constructed from at least two servers.

For ease of understanding, the present disclosure first describes each data statistics method in detail.

Referring to fig. 1, a data statistics method includes:

s101, sequentially determining numerical codes corresponding to all the identification information in the pulled service data stream.

The service data stream is usually real-time stream data generated upstream, and may be, for example, service data generated by a monitored website, an application program, or the like.

In order to effectively solve the data congestion situation in the data peak period, the real-time stream data generated upstream can be stored in the message middleware so as to realize decoupling of the data generating end and the data processing end. Correspondingly, each message middleware can be directly used as a data input source for pulling the service data flow, so that the processing pressure of the data peak period is effectively realized, and the pressure control of the subsequent data processing process is realized. Message middleware may include, but is not limited to, KAFKA (KAFKA), HDFS (Hadoop Distributed File System, distributed file system), and the like.

The identification information may be an account identification, which is used to uniquely characterize the data producer, typically a character string of a set length. The numerical code is a number having a set size expressed by a set numerical value, and may be, for example, a decimal number.

It can be appreciated that by generating a one-to-one correspondence of the respective identification information to the numerical code, the numerical code is also provided with a unique characteristic of the data producer.

If the data statistics method is executed by the server cluster, in an alternative embodiment, at least one preset primary server may pull the service data stream from each data input source, and distribute the pulled data stream to at least one secondary server in the server cluster, so as to execute subsequent operations such as numerical code generation. It can be appreciated that by dividing each server in the server cluster into a primary server and a secondary server by function, both perform their own roles, which helps to improve overall processing efficiency.

If the data statistics method is executed by the server cluster, in another alternative embodiment, each server in the server cluster may respectively pull the service data stream from each data input source, and perform subsequent operations such as numerical code generation. It will be appreciated that management and maintenance of servers is facilitated by the indiscriminate treatment of each server in a server cluster.

Since the same identification information may generate a plurality of pieces of service data, the service data of the same identification information is distributed to different servers, which will bring about an increase in calculation amount, affect the calculation efficiency, and also affect the accuracy of the statistical result of the subsequent identification information due to repeated statistics of the data. In order to avoid the above situation, in an alternative embodiment, each piece of service data in the service data stream pulled for a single time may be further grouped according to the identification information, so that service data with the same identification information are located in the same group; the same set of service data is distributed to the executors of the same server to determine the numerical code of the corresponding identification information.

Optionally, whether the identification information of each service data in the service data stream is the same or not may be determined by comparing the identification information with each other, so as to implement grouping of each service data in the service data stream. In order to simplify the calculation, optionally, hash values of identification information of each piece of service data in the service data stream can be calculated, and whether the identification information of each piece of service data is the same or not is determined by comparing the hash values, so that grouping of each piece of service data in the service data stream is realized.

The determining of the numerical code corresponding to each identification information in the service data stream may be, for example, a global identification generator, where each identification information is encoded to obtain the numerical code, so as to avoid the follow-up target bitmap data from being too discrete. The global identification generator can be realized based on a self-adding tool of a preset database. The preset database may be set or adjusted by a skilled person according to needs or experience values, which is not limited in this disclosure. For example, a SEQUENCE object provided based on Oracle, or a main key auto_increment based on MySQL (relational database management system), etc.

S102, storing each numerical value coding bitmap to obtain target bitmap data.

Because the numerical code is numerical data, the storage space of the service data stream can be obviously reduced by a bitmap storage mode, and the numerical statistics accuracy and the statistics efficiency of the identification information can be improved.

For example, the access identifier may be preset, and when the numerical code exists, the data bit corresponding to the numerical code in the target bitmap data is marked as the access identifier. Wherein the access identifier is used for marking the access condition of the numeric code corresponding to the identification information. For ease of computation, access case flags are typically made in a binary fashion, with access being indicated by a "1" and no access being indicated by a "0".

S103, determining the identification information quantity of at least one statistic period according to the target bitmap data.

The statistical period may be preset or adjusted by a technician according to the requirement or an empirical value.

It can be understood that the access condition of the identification information corresponding to the numerical code to the service function is performed in the target bitmap data by setting the access identification, that is, whether the service data corresponding to the service function is generated. Therefore, the determination of the number of the identification information can be performed in a bitmap statistical mode, so that the calculation is simplified, and the accuracy of a calculation result is improved.

For example, if the arrival of the query opportunity is identified, the number of identification information of at least one statistical period is determined according to the target bitmap data. The query time can be set or adjusted by a technician according to the requirement or an experience value.

Optionally, a queriable time may be preset, and if the current time reaches the queriable time, the reaching query time is determined. Or alternatively, if the query request is identified as being received, determining that the query opportunity is reached.

In a specific implementation manner, the query request may further include a query field, and accordingly, based on the query field, the identification information amount of at least one statistical period is determined according to the target bitmap data.

According to the embodiment of the disclosure, the identification information in the service data stream is converted into the numerical code, the bitmap is stored, and the target bitmap data is generated, so that the identification information quantity of at least one statistical period is determined according to the target bitmap data, the storage of detail data in the service data stream is not needed, and the occupied amount of the data storage space is reduced. Meanwhile, the number of the identification information is counted through the target bitmap data, so that the occurrence of the condition of data missing counting or error counting is avoided, and the accuracy of the counting result is ensured. In addition, the method of replacing complex calculation by adopting the target bitmap data statistics simplifies the data operation amount in the data statistics process and considers the statistics efficiency.

Based on the above technical solutions, the present disclosure further provides an optional embodiment, in which the numerical code generation process of the identification information is optimized and improved. In the parts of the disclosure not described in detail, reference may be made to the foregoing embodiments, which are not described in detail herein.

Referring to fig. 2, a data statistics method includes:

s201, sequentially acquiring numerical codes in locally stored self-increasing number segments aiming at identification information of each piece of service data in the service data stream, and taking the numerical codes as the numerical codes of the identification information of the piece of service data.

The self-increasing number section is a numerical code sequence with a set length read in advance from a preset database. The set length may be set or adjusted by a technician as needed or as experienced, or may be determined repeatedly through a number of tests. The same or different set lengths may be set for the execution devices of different data statistics methods, which is not limited in this disclosure.

Numerical coding is carried out based on a preset database, and when processing is carried out on each piece of business data, numerical coding pulling is carried out from the preset database. When the service data flow increases suddenly, the generation efficiency of the numerical code is seriously reduced, and the timeliness of the data statistics result is further affected. Therefore, the local storage of the execution equipment of the data statistics method is realized by a mode of reading numerical codes with set lengths from a preset database in advance, and a self-increasing number segment is formed. Correspondingly, in the data statistics process, the load is greatly reduced by sequentially reading the self-increment Duan Zhongshu value codes stored locally. Meanwhile, most of the generation process of the numerical code is migrated to the local content of the execution equipment, so that the generation efficiency of the numerical code of each execution equipment is improved, and further the data statistics efficiency is improved.

For example, if the set length is m, the frequency of reading the numerical codes from the preset database is reduced from 1 to 1/m, so that the burden of the preset database is reduced, the data statistics efficiency is improved, and the requirement of real-time stream timeliness is ensured.

In an alternative embodiment, after the self-increasing number segment stored locally is at least partially consumed, the subsequent reading of the self-increasing number segment can be performed from a preset database, so that the problem that the number of the reading process of the self-increasing number segment is blocked to influence the generation efficiency of the numerical code is avoided. Preferably, when the self-increasing number segment is consumed to half, the new self-increasing number segment is read from the preset database and stored locally for subsequent use.

Because the service data stream has aging characteristics and mass characteristics, in order to avoid repetition of numerical codes, in an alternative embodiment, a preset buffer area may be further set, and identification information of the processed service data is stored correspondingly to the generated numerical codes. Correspondingly, aiming at the identification information of each piece of service data in the service data stream, searching whether a numerical code corresponding to the identification information of the piece of service data exists in a preset cache area; if yes, directly encoding the found numerical value as the identification information of the piece of service data; otherwise, triggering and executing the generation operation of the numerical code.

The preset cache area may include a local cache area of the execution device, or may include a cache database in a preset cache system associated with the execution device.

It should be noted that, by storing the corresponding relation between the identification information and the numerical codes in the preset cache area, the duplication elimination processing of the numerical codes is realized, the occurrence of the generation condition of different numerical codes for the same identification information is avoided, the global uniqueness of the numerical codes is ensured, and a foundation is laid for the accuracy of the data statistics result.

In an alternative embodiment, the service data flow may comprise real-time service data of at least one service function. In order to realize simultaneous processing of different service data and multiplexing of the self-number-increasing segments in the same execution device, different target bitmap data can be set for different service functions.

S202, storing each numerical value coding bitmap to obtain target bitmap data.

S203, determining the identification information quantity of at least one statistic period according to the target bitmap data.

According to the embodiment of the disclosure, through a numerical code generation process, the numerical codes in the self-increasing number segments stored locally are sequentially acquired by refining the numerical code generation process into identification information of each piece of service data in the service data stream, and the numerical codes are used as the numerical codes of the identification information of the piece of service data; the self-increasing number section is a numerical code sequence with a preset length read in the preset database in advance, so that the frequency of reading data from the preset database is reduced, and the burden of the preset database is lightened. Meanwhile, most of the generation process of the numerical codes is migrated to the execution equipment of the data statistics method to be locally realized, so that the generation efficiency of the numerical codes is improved, and further the data statistics efficiency is improved.

Based on the technical schemes, the present disclosure also provides an alternative embodiment. In the alternative embodiment, the target bitmap data is refined into the conventional bitmap data and the disordered bitmap data, so that the data statistics result is repaired under the condition that the traffic data arrives in a disordered manner, and the accuracy of the data statistics result is further improved.

Referring to fig. 3, a data statistics method includes:

s301, sequentially determining numerical codes corresponding to all the identification information in the pulled service data stream.

S302, for each service data in the service data stream, the generation time stamp of the service data is corresponding to the conventional bitmap data, and the numerical code of the service data is recorded.

Wherein conventional bitmap data is used to record the access identity of each numerical code to the data reflecting the generation time stamp, thereby forming a data record of the numerical code. Correspondingly, the out-of-order bitmap data is used for carrying out data record backup on the numerical codes of the service data which arrive before the out-of-order data so as to carry out statistic result correction.

Aiming at each service data in the service data stream, whether the service data is disordered data is not required to be concerned, and corresponding data records are directly formed in the conventional bitmap data, so that data omission is avoided.

And S303, if the service data is disordered data, copying the data record of the historical service data which is positioned after the time stamp is generated and reaches the adjacent arrival of the service data in the conventional bitmap data in the disordered bitmap data.

Out of order data may be understood as traffic data with arrival time stamps exceeding the generation time stamp. Correspondingly, for any business data, whether the business data is out-of-order data can be determined according to the generation time stamp and the arrival time stamp of the business data.

In the process of data statistics, part of statistical results are affected by the generation time stamp of the service data, for example, the first visit amount statistical scene of the service function is performed. In order to avoid influence of environmental factors such as network delay and the like, which cause that service data reaches disorder to influence the accuracy of a subsequent statistical result, the present disclosure further sets disorder bitmap data for disorder data, and is used for correcting statistical errors caused by disorder data.

S304, for each statistic period, determining the identification information quantity of the statistic period according to the difference set of the regular bitmap data and the disordered bitmap data of the statistic period.

For a certain statistics period, the conventional bitmap data records the numerical codes of the service data, and in the disordered bitmap data, the record condition of the numerical codes recorded in advance, which are associated with the disordered bitmap data, is backed up, so that the record condition of the numerical codes which are not repeatedly recorded in the statistics period can be determined by determining the difference set of the conventional bitmap data and the disordered bitmap data, and the counted identification information quantity is corrected.

For example, if account A clicks on the setup application at 8:00, the setup application is clicked again at 8:45. For network reasons, account A generates traffic data at 8:00 later than traffic data generated at 8:45. At this time, the situation that the business data generated by 8:45 is processed first and then the business data generated by 8:00 is processed later will occur. If no out-of-order correction is performed, the access amount of the account A is 1 at 8:00-8:30, 1 at 8:30-9:00 and 2 at 8:00-9:00. When the method is adopted for disorder correction, the numerical codes generated by 8:00 and 8:45 are recorded in the conventional bitmap data of the account A, and the numerical codes generated by 8:45 are recorded in the disorder bitmap data, so that the access amount of the account A at 8:00-8:30 can be determined to be 1, the access amount at 8:30-9:00 is 0, and the access amount at 8:00-9:00 is 1, and the statistical result of the access amount is corrected.

According to the embodiment of the disclosure, the target bitmap data is refined to comprise the conventional bitmap data and the disordered bitmap data, the numerical codes of the service data are recorded in the conventional bitmap data corresponding to the generation time stamp of the service data, the data records of the historical service data which are positioned after the generation time stamp of the service data and are adjacent to the service data in the conventional bitmap data are copied in the disordered bitmap data, and the conventional bitmap data records are corrected, so that the situation that repeated statistics occurs in the identification information statistical result is avoided, and the accuracy of the identification information statistical result is improved.

On the basis of the technical schemes, the present disclosure also provides another alternative embodiment. In this embodiment, the storage condition of the target bitmap data is optimally improved.

Referring to fig. 4, a data statistics method includes:

s401, sequentially determining numerical codes corresponding to all the identification information in the pulled service data stream.

S402, determining barrel identifications corresponding to the numerical codes according to preset barrel division numbers and identification information pre-estimated data.

S403, according to the bucket identification, storing each numerical value coding sequence into a corresponding storage bucket to obtain target bitmap data.

S404, determining the identification information quantity of at least one statistic period according to the target bitmap data.

The preset barrel dividing number is the preset number of storage barrels. The identification information pre-estimation data, that is, the number of the identification information of the preset statistical period estimated in advance, can be obtained by pre-estimating the identification information data of the preset association history period. The set association history period may be set by a technician as needed or as experienced, or may be determined by a number of trial and error adjustments. For example, if the statistical period is set as a certain period in daily activity service data, the set association history period may be the same period data in monthly activity service data or quaternary activity service data. It should be noted that, the determination of the identification information estimated data based on the service data setting the association history period may be implemented by at least one determination method in the prior art, which is not limited in this disclosure.

The bucket may be a data storage structure in a distributed cluster that supports bitmap storage, and may include MySQL, PALO (hundred data warehouse), and the like, for example.

In an alternative embodiment, according to the preset barrel number and the estimated data of the identification information, the barrel identification corresponding to each numerical code is determined, which may be: determining the ratio of the identification information estimated data to the preset barrel dividing number, and determining the numerical code storage number in each storage barrel according to the ratio result; the minimum value code which is not stored takes the storage quantity as a step length, and the value code section of each storage barrel is determined; and determining the barrel identification corresponding to each numerical code according to the belonged relation between each numerical code and the numerical code section.

In another alternative embodiment, according to the preset barrel number and the estimated data of the identification information, determining the barrel identification corresponding to each numerical code may be: determining the ratio of the identification information estimated data to the preset barrel number; for each numerical code, determining the ratio of the numerical code to the ratio result; and rounding the ratio to obtain the barrel mark corresponding to the numerical code.

For example, if the numerical code is 0-5 and the number of the storage barrels is 3, when the barrel mark is determined to be 0 in the above manner, the corresponding stored numerical code is 0 and 1; when the barrel mark is 1, the corresponding stored numerical codes are 2 and 3; when the bucket mark is 2, the corresponding stored numerical codes are 4 and 5.

It can be understood that in an application scenario in which at least two execution devices perform numerical code generation, the barrel identification is determined by adopting the method, and a general calculation method is directly adopted without paying attention to which execution setting the numerical code is executed, so that the occurrence of disorder of barrel separation is avoided, the follow-up data statistics is facilitated, and the statistics efficiency is improved.

It should be noted that, by means of bucket identification calculation, each numerical code is sequentially stored in a corresponding bucket, so that the numerical codes stored in each bucket are as continuous as possible. Because the continuous numerical coding query efficiency is higher in the bitmap structure, the adoption of the technical scheme is beneficial to improving the data statistics efficiency when the identification information of the service data with the setting function is counted.

Based on the above technical solutions, the present disclosure further provides an optional embodiment, in which a storage validation identifier is introduced when storing the numerical code, to indicate validity of the stored numerical code. It should be noted that, in the parts of the disclosure not described in detail, reference may be made to the description of the foregoing embodiments, and the description is omitted here.

Referring to fig. 5, a data statistics method includes:

s501, sequentially determining numerical codes corresponding to all the identification information in the pulled service data stream.

S502, generating a storage effective identifier for the service data flow pulled once.

The effective identification is stored to indicate the validity of each numerical code in the pull service data stream.

It should be noted that the storage validation identifier may be used to distinguish service data flows pulled by different words, and the specific generation mode of the storage validation identifier is not limited in this disclosure. That is, the same storage effective identifier is generated for each identification information in the single-pulled service data stream; and generating different storage effective identifiers aiming at the identification information in the service data streams pulled at different times.

S503, after the numerical codes of the identification information in the service data stream pulled for the time are stored, storing the effective storage identification corresponding to the numerical codes so as to obtain the target bitmap data.

It should be noted that, for each pulled service data stream, only after the numerical code of each identification information in the pulled service data stream is stored, the corresponding storage of the storage validation identification is performed. When the stored numerical codes correspondingly store the storage effective identifiers, the numerical codes of the identification information of the service data flow which is characterized by the pulling are effective; and when the stored numerical code corresponds to the stored effective identification, characterizing that each piece of identification information in the service data stream pulled for the time is invalid or temporarily invalid.

S504, determining the identification information quantity of at least one statistic period according to the target bitmap data.

And determining the identification information reserves of at least one statistical period according to each numerical code of the stored effective part in the target bitmap data.

According to the embodiment of the disclosure, the storage effective identification is generated for each pulled service data stream, and when numerical codes of all identification information in the pulled service data stream are stored, the corresponding storage of the storage effective identification is carried out only after all service data pulled for the time are stored, so that the situation that part of data is omitted in the data storage process is avoided, and the accuracy of statistical results is improved.

Based on the above technical solutions, if the data statistics system is abnormal and it is identified that there is a numerical code in the storage bucket that does not include the storage validation identifier, it indicates that the service data stream pulled in the previous time is not completely stored, and the service data stream pulled in the previous time may be directly discarded, or the pulling operation of the service data stream pulled in the previous time may be re-executed. The data statistics system anomalies may include, but are not limited to, equipment outages, downtime, and restarting anomalies.

However, the service data stream pulled in the previous time is directly abandoned, which inevitably leads to the situation that the service data storage pulled in the previous time is missed; the pulling operation of the service data stream pulled in the previous time is directly re-executed, and the repeated storage of partial stored data can occur. Therefore, both the above two methods can affect the accuracy of the data statistics result under the condition of abnormal data statistics system.

In an alternative embodiment, if the data statistics system is abnormal and it is identified that the storage bucket has a numerical code that does not include a storage validation identifier, the numerical code determining operation of each identifier information in the previous service data stream is performed in a rollback mode, so that repeated storage of stored data and omission of storage of non-stored data are avoided, and accuracy of the data statistics result is guaranteed under the condition that the data statistics system is abnormal.

In a specific implementation manner, the service data stream can be re-pulled and consumed based on a Checkpoint (Checkpoint) mechanism, and the repeated consumption of the data is avoided by introducing a storage effective identifier and using idempotent, so that the data consistency of the data input source and the data in the target bitmap data is ensured.

Based on the above technical solutions, the present disclosure further provides a preferred embodiment for implementing the data statistics method. The preferred embodiment is particularly applicable to the scenario of counting the number of UV (Unique visitors) that first access the setup function in the setup application.

For ease of understanding, referring to fig. 6A, a detailed description will be first given of the specific structure of the data statistics system employed in this embodiment.

The data statistics system includes a data input source 10, a real-time computation engine 20, a caching system 30, and a bitmap storage system 40.

The data input source 10 is provided with at least one message middleware (for example, may include KAFKA, HDFS, etc.) for sequentially storing service data generated by the data generator, so as to implement decoupling of the real-time computing engine 20 from the data generator, so as to cope with data congestion of data peak noise.

The real-time computing engine 20 is provided with at least one executor (exector) for pulling the service data stream from the data input source 10 and processing each service data in the service data stream.

The processing of the service data may be, for example, extracting, cleaning, format converting, etc. the service data, so as to select the service data and the identification information of the corresponding dimension required by the service party.

For each actuator, the service data are grouped according to the hash value of each identification information in the service data stream, so that the service data corresponding to the same identification information are positioned in the same group; the same set of business data is distributed to the same executors in the real-time computing engine 20.

Alternatively, the hash value of each identification information may be modulo according to the number of actuators in the real-time computing engine 20, and each service data may be grouped according to the result of the computation.

Wherein, the cache system 30 is provided with a first database (such as MySQL) and a second database (such as Redis). Wherein the first database cooperates with each actuator in the real-time computing engine 20 to generate a numerical code (i.e., global ID) of each identification information, and the second database is used for storing the corresponding relation between the generated numerical code and the identification information.

For each piece of service data, the executor queries the identification information of the service data in the local cache and the second database, and if the numerical code corresponding to the identification information of the service data is not queried, the executor triggers the generation operation of executing the numerical code.

Wherein, the bitmap storage system 40 is provided with at least one storage bucket supporting a bitmap structure for sequentially storing the generated numerical codes.

The executor in the real-time computing engine 20 also performs statistics of the identification information from the bitmap storage system according to the query requirement.

For ease of understanding, in an alternative embodiment, the numerical encoding process of the identification information will be described in detail in connection with fig. 6B.

The executor is provided with a number-taking thread, and the number-increasing section with a set length is read from the first database through the number-taking thread and is stored locally. For each piece of service data, the executor sequentially reads the numerical codes in the number increasing section from the local so as to establish the corresponding relation of the identification information of the service data. When the unconsumed part of the self-increasing number segment in the executor meets the set proportion (for example, less than 1/2 of the set length), calling the number taking thread, and continuously reading the self-increasing number segment with the set length from the first database.

For example, in a multi-service statistics scenario, multiplexing of the same numerical codes may also be implemented for different services.

In an alternative embodiment, the details of the per-value coded sub-bucket storage process will be described with continued reference to FIG. 6A.

Illustratively, the bucket identity for each numerical code is determined using the following formula:

tid=rounding (V/(M/N));

wherein tid is bucket identification, V is the numerical code of identification information, M is the estimated identification information quantity, and N is the storage bucket quantity.

Correspondingly, each numerical code is stored in the storage bucket corresponding to the determined bucket identifier, so that the numerical codes are tightly and continuously stored.

Because the service data is real-time stream data, when the data statistics system is crashed or restarted, the accuracy of the data statistics result is affected. In order to effectively cope with the abnormal condition of the system, an executor adopts a Checkpoint mechanism to rollback and execute the numerical code determining operation of each identification information in the service data stream processed in the previous time. Meanwhile, when the numerical code storage of the identification information is carried out on the pulled service data stream, a storage validation identifier is introduced, and if and only after the numerical code of the pulled service stream data is completely stored, the storage validation identifier of the service data stream is correspondingly stored, and the full numerical code of the pulled service stream data is indicated to be validated. Therefore, the repeated consumption under the abnormal condition of the system is ensured through idempotent, and the consistency of the end-to-end data is ensured. It should be noted that, when the system is abnormal, the stored numerical code which does not contain the storage validation identifier will be deleted.

Due to the influence of environmental factors such as network delay, the situation that the service data reaching the executor is disordered, namely the service data generated in advance arrives later, can influence the accuracy of the statistical result. In view of this, in an alternative embodiment, at the time of numerical code storage, a conventional bitmap storage is provided for storing numerical codes of respective service data correspondingly; the method is also provided with an out-of-order bitmap storage for correcting data of out-of-order service data, so that the statistical result in an out-of-order scene is corrected, and the accuracy of the statistical result is further improved.

For each piece of service data, after determining the data code of the piece of service data, the executor codes the data at the corresponding generated time stamp position of the conventional bitmap storage part for data recording; if the piece of business data is out-of-order data, the data record of the historical business data which is positioned after the generation time stamp of the piece of business data and is up to the adjacent arrival of the business data in the conventional bitmap storage part is copied in the out-of-order bitmap storage part.

Correspondingly, when the identification information quantity of the set statistical time period is counted, the identification information quantity is counted according to the difference set of the conventional bitmap storage part and the disordered bitmap storage part in the set statistical time period.

Based on the above technical solutions, the present disclosure further provides an optional embodiment of an apparatus for implementing the above data statistics method.

Referring to fig. 7, a data statistics apparatus 700 includes: a numerical code determination module 701, a target bitmap data obtaining module 702, and an identification information statistics module 703. Wherein,,

the numerical code determining module 701 is configured to sequentially determine numerical codes corresponding to each piece of identification information in the pulled service data stream;

a target bitmap data obtaining module 702, configured to store each numerical value encoding bitmap, so as to obtain target bitmap data;

The identification information statistics module 703 is configured to determine, according to the target bitmap data, an identification information amount of at least one statistics period.

In an alternative embodiment, the numerical code determination module 701 includes:

the numerical code determining unit is used for sequentially acquiring numerical codes in the locally stored self-increasing number segments aiming at the identification information of each piece of service data in the service data stream, and taking the numerical codes as the numerical codes of the identification information of the piece of service data;

the self-increasing number section is a numerical code sequence with a set length read in advance from a preset database.

In an alternative embodiment, the apparatus further comprises:

the numerical code searching module is used for searching whether numerical codes corresponding to the identification information of each piece of service data exist in a preset cache area aiming at the identification information of each piece of service data in the service data flow;

the numerical code generation module is used for directly taking the searched numerical code as the identification information of the business data if the numerical code is positive; otherwise, triggering and executing the generation operation of the numerical code.

In an alternative embodiment, the target bitmap data includes regular bitmap data and out-of-order bitmap data;

the target bitmap data obtaining module 702 includes:

a conventional bitmap data generating unit, configured to record, for each service data in the service data stream, a numerical code of the service data in the conventional bitmap data corresponding to a generation timestamp of the service data;

the disordered bitmap data generating unit is used for copying the data record of the history service data which is positioned after the time stamp is generated and reaches the adjacent arrival of the service data in the conventional bitmap data if the service data is disordered data;

the identification information statistics module 703 includes:

and the identification information statistics unit is used for determining the identification information quantity of each statistics period according to the difference set of the conventional bitmap data and the disordered bitmap data of the statistics period.

In an alternative embodiment, the target bitmap data obtaining module 702 includes:

the barrel identification generating unit is used for determining barrel identifications corresponding to the numerical codes according to preset barrel dividing numbers and identification information pre-estimated data;

and the barrel dividing storage unit is used for sequentially storing the numerical codes into the corresponding storage barrels according to the barrel identifications so as to obtain target bitmap data.

the storage effective identifier generation unit is used for generating a storage effective identifier for the single pulled service data stream;

the storage effective identification storage unit is used for storing the storage effective identification and each numerical code after the numerical codes of each identification information in the service data stream pulled for the time are stored, so as to obtain target bitmap data;

In an alternative embodiment, the target bitmap data obtaining module 702 further includes:

and the abnormal rollback unit is used for rollback executing the numerical code determining operation of each identification information in the previous service data stream if the data statistics system is abnormal and the numerical code which does not contain the storage effective identification exists in the storage barrel.

a service data stream grouping unit, configured to group each piece of service data in the service data stream pulled for a single time according to the identification information, so that service data with the same identification information are located in the same group;

and the service data distribution unit is used for distributing the service data of the same group to the same executor so as to determine the numerical code of the corresponding identification information.

The data statistics device can execute the data statistics method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of executing the data statistics method.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related business data flow all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and service data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/service data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a data statistical method. For example, in some embodiments, the data statistics method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM802 and/or communication unit 809. When a computer program is loaded into RAM803 and executed by computing unit 801, one or more steps of the data statistics method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the data statistics method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive traffic data and instructions from, and transmit traffic data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable business data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a business data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system may be interconnected by any form or medium of digital business data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of data statistics, comprising:

storing each numerical value coding bitmap to obtain target bitmap data;

determining the number of identification information of at least one statistical period according to the target bitmap data;

wherein the target bitmap data comprises regular bitmap data and out-of-order bitmap data;

storing each numerical coding bitmap to obtain target bitmap data, wherein the method comprises the following steps:

For each service data in the service data stream, corresponding to the generation time stamp of the service data in the conventional bitmap data, recording the numerical code of the service data;

if the service data is out-of-order data, copying the data record of the history service data which is positioned after the generation time stamp and reaches the adjacent arrival of the service data in the conventional bitmap data in the out-of-order bitmap data;

the determining the number of the identification information of at least one statistical period according to the target bitmap data comprises the following steps:

for each statistical period, determining the number of the identification information of the statistical period according to the difference set of the regular bitmap data and the out-of-order bitmap data of the statistical period;

the conventional bitmap data is used for recording access identifiers of each numerical code in the corresponding generation time stamp so as to form a data record of the corresponding numerical code; the disordered bitmap data is used for carrying out data record backup on the numerical codes of the service data which arrive before the disordered bitmap data so as to correct the statistical result; the out-of-order data is business data with arrival time stamps exceeding the generation time stamp.

2. The method of claim 1, wherein the sequentially determining the numerical codes corresponding to the identification information in the pulled service data stream includes:

Sequentially acquiring numerical codes in a locally stored self-increasing number section aiming at the identification information of each piece of service data in the service data stream, wherein the numerical codes are used as the numerical codes of the identification information of the piece of service data;

the self-increasing number section is a numerical code sequence with a set length which is read in advance from a preset database.

3. The method of claim 2, further comprising:

for the identification information of each piece of service data in the service data stream, searching whether a numerical code corresponding to the identification information of the piece of service data exists in a preset cache area;

if yes, directly encoding the found numerical value as the identification information of the piece of service data; otherwise, triggering and executing the generation operation of the numerical code.

4. A method according to any one of claims 1-3, wherein storing each numerical encoding bitmap to obtain target bitmap data comprises:

determining barrel identifications corresponding to the numerical codes according to preset barrel dividing numbers and identification information pre-estimated data;

and according to the bucket identification, sequentially storing the numerical codes into corresponding storage buckets to obtain the target bitmap data.

5. A method according to any one of claims 1-3, wherein storing each numerical encoding bitmap to obtain target bitmap data comprises:

Generating a storage effective identifier aiming at the single pulled service data stream;

after the numerical codes of all the identification information in the service data stream pulled for the time are stored, storing the storage effective identification corresponding to each numerical code so as to obtain the target bitmap data;

wherein the storage validation identifier is used for indicating the validity of each numerical code in the pull service data stream.

6. The method of claim 5, further comprising:

if the data statistics system is abnormal and the storage barrel is identified to have the numerical code which does not contain the storage effective identification, the numerical code determining operation of each identification information in the previous service data flow is carried out in a rolling mode.

7. A method according to any one of claims 1 to 3, wherein said sequentially determining numerical codes corresponding to respective identification information in the pulled service data stream comprises:

grouping each piece of service data in the service data stream pulled for one time according to the identification information so that the service data with the same identification information are positioned in the same group;

the same set of service data is assigned to the same actuator to determine the numerical code of the corresponding identification information.

8. A data statistics apparatus, comprising:

the identification information statistics module is used for determining the identification information quantity of at least one statistics period according to the target bitmap data;

the target bitmap data obtaining module includes:

a regular bitmap data generating unit, configured to record, for each service data in the service data stream, a numerical code of the service data in the regular bitmap data corresponding to a generation timestamp of the service data;

the disordered bitmap data generating unit is used for copying the data record of the history business data which is positioned after the generating time stamp and reaches the adjacent arrival of the business data in the conventional bitmap data in the disordered bitmap data if the business data is disordered data;

the identification information statistics module comprises:

an identification information statistics unit, configured to determine, for each statistics period, the number of identification information of the statistics period according to a difference set of the regular bitmap data and the out-of-order bitmap data of the statistics period;

9. The apparatus of claim 8, wherein the numerical code determination module comprises:

10. The apparatus of claim 9, further comprising:

11. The apparatus according to any one of claims 8-10, wherein the target bitmap data obtaining module comprises:

the barrel identification generating unit is used for determining barrel identifications corresponding to the numerical codes according to preset barrel division numbers and identification information pre-estimated data;

and the barrel dividing storage unit is used for sequentially storing the numerical codes into corresponding storage barrels according to the barrel identification so as to obtain the target bitmap data.

12. The apparatus according to any one of claims 8-10, wherein the target bitmap data obtaining module comprises:

the storage effective identification storage unit is used for storing the storage effective identification and each numerical code after the numerical codes of each identification information in the service data stream pulled for the time are stored, so as to obtain the target bitmap data;

13. The apparatus of claim 12, wherein the target bitmap data obtaining module further comprises:

14. The apparatus of any of claims 8-10, wherein the numerical code determination module comprises:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a data statistics method according to any of claims 1-7.

16. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a data statistics method according to any of claims 1-7.