CN110851758A

CN110851758A - Webpage visitor number statistical method and device

Info

Publication number: CN110851758A
Application number: CN201911044278.8A
Authority: CN
Inventors: 卢道和; 罗锶; 陈晓峰; 胡思文; 李勇
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-28
Anticipated expiration: 2039-10-30
Also published as: CN110851758B; WO2021082936A1

Abstract

The invention discloses a statistical method and a statistical device for the number of webpage visitors, wherein the method comprises the following steps: acquiring first front-end page behavior data of a webpage in a preset period; determining M bloom filters of the N bloom filters; the M bloom filters are: the bloom filter to which the plurality of pieces of data are mapped in the N bloom filters according to the mapping relation; determining the number of newly added visitors in the preset period according to the M bloom filters and the first front-end page behavior data; the newly added visitor quantity is the visitor quantity which is not counted by the N bloom filters. When the method is applied to financial technology (Fintech), the number of required bloom filters is reduced, the storage occupation is reduced, and the cost is also reduced.

Description

Webpage visitor number statistical method and device

Technical Field

The invention relates to the field of computer software of financial technology (Fintech), in particular to a webpage visitor number statistical method and device.

Background

With the development of computer technology, more and more technologies (big data, distributed, Blockchain (Blockchain), artificial intelligence, etc.) are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech). At present, in the field of financial science and technology, financial products, financial policies and the like are often shown in a webpage form, and the number of visitors of the webpage represents the interest of users, the value of products and the like to a certain extent. Therefore, it is very necessary to count the number of visitors to a financial web page.

The current statistical method is to extract and analyze the log data of the web page, load a bloom filter, and judge whether the data has appeared by judging whether each log data is mapped to the bit of the bloom filter, thereby counting the number of web page visitors. However, when the data size is large, a large bloom filter is also required to count the number of visitors, thereby causing a large memory occupation. Therefore, in the prior art, when the number of visitors is counted, the memory of the bloom filter occupies a large amount, which results in large energy consumption and high cost, and is a problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a webpage visitor number statistical method and device, and solves the problem that the memory of a bloom filter occupies a large amount when the visitor number is counted in the prior art.

In a first aspect, an embodiment of the present application provides a method for counting webpage visitors, including: acquiring first front-end page behavior data of a webpage in a preset period; the first front-end page behavior data is recorded data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying the visitor, and the value of each preset field is within a preset value range; determining M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to preset fields; the N bloom filters establish mapping relations with preset fields with values within the preset value range in advance; each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are: the bloom filter to which the plurality of pieces of data are mapped in the N bloom filters according to the mapping relation; determining the number of newly added visitors in the preset period according to the M bloom filters and the first front-end page behavior data; the newly added visitor quantity is the visitor quantity which is not counted by the N bloom filters.

In the method, the acquired behavior data of the front-end page includes a plurality of pieces of data, and since the values of the preset fields are all within the preset value range, the N bloom filters are obtained by dividing the preset value range of the preset fields into N segments according to the set rule in advance, and each preset field uniquely corresponds to one bloom filter, the plurality of pieces of data can certainly determine M bloom filters in the N bloom filters according to the set rule. Because the N bloom filters record the preset fields of the visitors who have visited the front-end page, the number of newly added visitors in the preset period can be determined only by M bloom filters without considering the bloom filters to which a plurality of pieces of data are not mapped, and obviously, M is less than or equal to N.

In an optional implementation, the determining, according to the M bloom filters, the number of newly added visitors in the first front-end page behavior data includes: dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; aiming at each group of data in the multiple groups of data, determining the number of newly added visitors of the group of data according to the bloom filter mapped to the group of data in the N bloom filters; and taking the sum of the number of newly added visitors of each bloom filter in the M bloom filters as the number of newly added visitors in the first front-end page behavior data.

In the method, the data are divided into a plurality of groups of data, the data mapped to the same bloom filter in the M bloom filters are in one group, aiming at each group of data in the plurality of groups of data, the number of newly added visitors of the group of data is determined according to the bloom filter mapped to the group of data in the N bloom filters, and meanwhile, the number of the newly added visitors of the plurality of groups of data is counted, so that the efficiency of the number of the newly added visitors in the first front-end page behavior data is improved.

In an alternative embodiment, the N bloom filters are numbered sequentially from 0 to N-1; the dividing the plurality of pieces of data into a plurality of groups of data includes: acquiring a hash value of each piece of data according to a preset field of each piece of data in the plurality of pieces of data; and calculating the remainder of the hash value of each piece of data in the plurality of pieces of data for N, and taking the data with the same remainder as a group to map the data in each group with the bloom filter with the same number as the remainder.

In the method, the hash value of each piece of data is obtained according to the preset field of each piece of data in the plurality of pieces of data, and the hash value of each piece of data in the plurality of pieces of data is subjected to remainder calculation on N, and the same group of data with the same remainder is also random, so that the plurality of pieces of data are uniformly and randomly mapped with the bloom filter, and the method for mapping the plurality of pieces of data with the bloom filter is provided.

In an alternative embodiment, the number of visitors of the counted visitors is stored in a database; after determining the number of newly added visitors in the preset period according to the M bloom filters, the method further comprises the following steps: taking the number of the counted visitors and the number of the newly added visitors as the number of the real-time counted visitors; the real-time statistical visitor number is the visitor number which is currently counted by the N bloom filters; and storing the real-time statistical visitor number into the database.

In the mode, the visitor number of the counted visitor is stored in the database, so that the visitor number of the counted visitor can be stored for a long time, after the visitor number of the counted visitor is obtained, the visitor number of the counted visitor is stored in the database, the updated visitor number of the counted visitor can be stored in the database persistently, and the condition that the visitor number of the counted visitor is lost due to the fact that data in a memory of a computer is lost is prevented.

In an alternative embodiment, the database further stores a first amount of position offset of the message middleware; the message middleware is used for storing front-end page behavior data acquired from the webpage; the first position offset is used for indicating the position of second front-end page behavior data acquired from the message middleware in the message middleware; the acquiring of the first front-end page behavior data of the webpage in the preset period includes: obtaining the first position offset amount from the database; starting from the data after the first position offset of the message middleware, taking front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data; and updating the first position offset amount into a second position offset amount according to the first front-end page behavior data, and storing the second position offset amount in the database.

In this way, the message middleware stores the front-end page behavior data acquired from the web page, so that the acquisition of the front-end page behavior data is facilitated, the first position offset is stably and persistently stored in the database, and then the front-end page behavior data acquired from the message middleware in the preset period is used as the first front-end page behavior data from the data after the first position offset of the message middleware, so that the position of the front-end page behavior data acquired from the message middleware in the message middleware can be more stably and accurately recorded, the front-end page behavior data can be further accurately acquired, and the real-time visitor number can be accurately judged and counted.

In an alternative embodiment, the first amount of positional offset is stored in a first data table of the database; after updating the first location offset amount to the second location offset amount and before storing the real-time statistical visitor number in the database, the method further includes: storing the second amount of position offset in a second data table of the database; after the storing the real-time statistical visitor number into the database, the method further includes: storing the second position offset amount in the second data table in the first data table.

In the above manner, the first location offset amount is stored in a first data table of the database, and after the first location offset amount is updated to a second location offset amount and before the real-time statistical visitor number is stored in the database, the second location offset amount is first stored in a second data table of the database; storing the second position offset in the second data table in the first data table after the real-time statistical visitor number is stored in the database; the situation that the acquisition of the number of visitors is failed in real time statistics and the first position offset amount is covered is avoided, so that the position of the front-end page behavior data acquired from the message middleware in the message middleware can be further stably and accurately recorded.

In a second aspect, the present application provides a device for counting the number of visitors to a webpage, including: the acquisition module is used for acquiring first front-end page behavior data of a webpage in a preset period; the first front-end page behavior data is recorded data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying the visitor, and the value of each preset field is within a preset value range; a processing module for determining M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to preset fields; the N bloom filters establish mapping relations with preset fields with values within the preset value range in advance; each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are: the bloom filter to which the plurality of pieces of data are mapped in the N bloom filters according to the mapping relation; the number of newly added visitors in the preset period is determined according to the M bloom filters and the first front-end page behavior data; the newly added visitor quantity is the visitor quantity which is not counted by the N bloom filters.

In an optional implementation manner, the processing module is specifically configured to: dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; aiming at each group of data in the multiple groups of data, determining the number of newly added visitors of the group of data according to the bloom filter mapped to the group of data in the N bloom filters; and taking the sum of the number of newly added visitors of each bloom filter in the M bloom filters as the number of newly added visitors in the first front-end page behavior data.

In an alternative embodiment, the N bloom filters are numbered sequentially from 0 to N-1; the processing module is specifically configured to: acquiring a hash value of each piece of data according to a preset field of each piece of data in the plurality of pieces of data; and calculating the remainder of the hash value of each piece of data in the plurality of pieces of data for N, and taking the data with the same remainder as a group to map the data in each group with the bloom filter with the same number as the remainder.

In an alternative embodiment, the number of visitors of the counted visitors is stored in a database; the processing module is further configured to: taking the number of the counted visitors and the number of the newly added visitors as the number of the real-time counted visitors; the real-time statistical visitor number is the visitor number which is currently counted by the N bloom filters; and storing the real-time statistical visitor number into the database.

In an alternative embodiment, the database further stores a first amount of position offset of the message middleware; the message middleware is used for storing front-end page behavior data acquired from the webpage; the first position offset is used for indicating the position of second front-end page behavior data acquired from the message middleware in the message middleware; the acquisition module is specifically configured to: obtaining the first position offset amount from the database; starting from the data after the first position offset of the message middleware, taking front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data; and updating the first position offset amount into a second position offset amount according to the first front-end page behavior data, and storing the second position offset amount in the database.

In an alternative embodiment, the first amount of positional offset is stored in a first data table of the database; the processing module is further configured to: after updating the first position offset amount to a second position offset amount, and before storing the real-time statistical visitor number into the database, storing the second position offset amount in a second data table of the database; storing the second location offset in the second data table in the first data table after storing the real-time statistical visitor number in the database.

For the advantages of the second aspect and the embodiments of the second aspect, reference may be made to the advantages of the first aspect and the embodiments of the first aspect, which are not described herein again.

In a third aspect, an embodiment of the present application provides a computer device, which includes a program or instructions, and when the program or instructions are executed, the computer device is configured to perform the method of each embodiment of the first aspect and the first aspect.

In a fourth aspect, an embodiment of the present application provides a storage medium, which includes a program or instructions, and when the program or instructions are executed, the program or instructions are configured to perform the method of the first aspect and the embodiments of the first aspect.

Drawings

Fig. 1 is a schematic flowchart illustrating steps of a method for counting number of web page visitors according to an embodiment of the present application;

FIG. 2 is a block diagram illustrating a method for performing statistics on the number of visitors to a web page according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart illustrating steps of the SparkStreaming-based UV real-time calculation module 204 according to an embodiment of the present application;

fig. 4 is a schematic flowchart illustrating specific steps from step 302 to step 305 according to an embodiment of the present disclosure;

fig. 5 is a flowchart illustrating specific steps in step 404 according to an embodiment of the present disclosure;

fig. 6 is a schematic flowchart illustrating specific steps of single-segment data according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a device for counting visitor number in a web page according to an embodiment of the present disclosure.

Detailed Description

In order to better understand the technical solutions, the technical solutions will be described in detail below with reference to the drawings and the specific embodiments of the specification, and it should be understood that the specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, but not limitations of the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.

Abbreviations and key terms appearing in the following application are first introduced.

Message middleware: the method is used for storing, publishing and subscribing massive offline data messages. For example, the message middleware may be Kafka, Flume, etc.

Big Data Platform (Big Data Platform, BDP): a platform for big data storage and analysis. The system comprises a distributed computing framework (such as hadoop), a data warehouse (such as HBase), a data warehouse tool (such as hive), message middleware, a distributed application program coordination service (such as zookeeper) and the like. It is worth mentioning that the data warehouse is an important component of a large data platform. For example, the data warehouse is HBase, which is a high-reliability, high-performance, column-oriented, scalable distributed storage system, and large-scale structured storage clusters can be built on cheap servers by using HBase technology. Belonging to Hadoop ecosphere. The distributed key value pair database is used for storing and querying mass data.

Independent visitor (UV): refers to a natural person who accesses and browses a web page through the internet. Obviously, the browsing volume is not equal to the number of independent visitors, the browsing volume refers to how many times the web page is browsed, and the number of independent visitors refers to the number of different independent visitors accessing the web page. An independent visitor is only recorded as an independent visitor no matter how many times the visitor browses. It should be noted that in the embodiment of the present application, whether the visitor is an independent visitor may be distinguished according to a preset field, for example, the visitor is distinguished by an Internet Protocol (IP) address, and a visitor browsing the same IP address of a web page is regarded as an independent visitor.

A streaming processing system: for processing big data. For example, SparkStreaming is a streaming system that performs high throughput and fault tolerant processing on real-time data streams, and can perform complex operations such as similar connection on various data sources (e.g., Kdfka, Flume, Twitter, Zero, and Transmission Control Protocol (TCP) sockets), and save the results to an external file system, a database, or apply to a real-time dashboard.

Bloom Filters (BF), a space-efficient random data structure, uses bit arrays to very succinctly represent a set and can determine whether an element belongs to the set. It is a fast probabilistic algorithm that determines whether a set of elements exists. The bloom filter has the advantages that the space efficiency and the query time far exceed those of a common algorithm, and has the defect of certain misrecognition rate.

Front-end behavioral data toolkit (which may be represented by wa-sdk): the software development kit is used for collecting and reporting front-end data, and actively collects front-end operation behaviors of a user in a data embedding mode and the like, wherein the front-end operation behaviors comprise user identification (which can be represented by openid), a user IP address, data (coockie) stored on a user local terminal, code error reporting information, a click log and the like.

The offset (which may be expressed as offset) may be used to mark the location of the consumed data in kafka, facilitating the continued consumption of data from that location the next time the data is consumed.

The parameters involved in the examples of the present application are shown in table 1:

batchDuration	time interval of spark streaming batch submission
		bfNumElements	Single BF expected data size, parameters for constructing BF
bfFalsePosProb	Single BF error Rate, parameters for constructing BF
		bfnumPartition	Number of data segments
bfnumBfPartition	Number of BF fragments
		pvUVTableName	Table of the results of pv and UV in HBase
bfTableName	Table for storing BF in HBase
		offsetsTableName	Final Table in HBase for holding offset in kafka
offsetTmpTableName	Intermediate table in HBase for saving offset in kafka

TABLE 1

In the operation of a financial institution (banking institution, insurance institution or security institution) in performing a business (such as a loan business, a deposit business, etc. of a bank), it is often necessary to count the number of visitors to a financial web page. In the prior art, when the data volume is large, a large bloom filter is needed to count the number of visitors, so that the memory occupation is large. Therefore, in the prior art, when the number of visitors is counted, the memory of the bloom filter occupies a large amount, and the situation is not in line with the requirements of financial institutions such as banks, and the efficient operation of various services of the financial institutions cannot be guaranteed.

Therefore, as shown in fig. 1, an embodiment of the present application provides a statistical method for the number of web page visitors.

Step 101: acquiring first front-end page behavior data of a webpage in a preset period.

Step 102: m bloom filters of the N bloom filters are determined.

M and N are positive integers.

Step 103: and determining the number of newly added visitors in the preset period according to the M bloom filters and the first front-end page behavior data.

In step 101, the first front-end page behavior data is recorded data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field (such as an Internet Protocol (IP) address of a user) for uniquely identifying the visitor, and values of the preset fields are all in a preset value range.

The specific implementation of step 101 may be as follows:

the first amount of position offset of the message middleware (e.g., Kafka, flash) is pre-stored in a database (e.g., HBase). The message middleware is used for storing front-end page behavior data acquired from the webpage. The first position offset is used for indicating the position of the second front-end page behavior data acquired from the message middleware in the message middleware.

Obtaining the first position offset amount from the database. And starting from the data after the first position offset of the message middleware, taking front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data. And updating the first position offset amount into a second position offset amount according to the first front-end page behavior data, and storing the second position offset amount in the database.

For example, the database is HBase _1, the message middleware is kafka _1, and the preset period is 5 seconds. The first position offset amount offset _1 of kafka _1 is stored in advance in HBase _ 1. The front-end page behavior data of 5 seconds is read continuously from kafka _1 as the first front-end page behavior data from the data after offset _1, and the last position of the read front-end page behavior data, i.e., the second position offset amount offset _2, is recorded, and offset _2 is stored to HBase _ 1.

In step 102, the N bloom filters are used for recording the counted visitors according to a preset field; the N bloom filters establish mapping relations with preset fields with values within the preset value range in advance; each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are: and the bloom filters to which the plurality of pieces of data are mapped in the N bloom filters according to the mapping relation.

For example, the total number of bloom filters for the preset value range mapping is 10 bloom filters (i.e., N bloom filters). Each bloom filter is mapped with a sub-preset value range, in 10 bloom filters, bloom filter 1 is mapped with values of preset fields within 0-1000, bloom filter 2 is mapped with values within 1001-2000, …, bloom filter 10 is mapped with values within 9001-10000. The plurality of pieces of data are 8 pieces of data (500, 1500, …, 7500, respectively), and the bloom filters to which these 8 pieces of data are mapped among the N bloom filters according to the mapping relationship are bloom filters 1 to bloom filters 8 (i.e., M bloom filters).

An alternative implementation of step 102 is as follows:

dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; aiming at each group of data in the multiple groups of data, determining the number of newly added visitors of the group of data according to the bloom filter mapped to the group of data in the N bloom filters; and taking the sum of the number of newly added visitors of each bloom filter in the M bloom filters as the number of newly added visitors in the first front-end page behavior data.

Each data in the 8 pieces of data is divided into a group, and each group of data corresponds to one bloom filter. For each piece of data, a preset field added in the piece of data can be determined through a bloom filter mapped by the piece of data, if the preset field is not stored in the bloom filter, the data is regarded as data of a newly added visitor, and the count of the number of the newly added visitors is increased by 1. And adding the number of the newly added visitors of each group of data to obtain the number of the newly added visitors in the first front-end page behavior data.

In the above alternative embodiment, the following method may be followed. Dividing the plurality of pieces of data into a plurality of groups of data:

and numbering the N bloom filters from 0 to N-1 in sequence in advance. Acquiring a hash value of each piece of data according to a preset field of each piece of data in the plurality of pieces of data; and calculating the remainder of the hash value of each piece of data in the plurality of pieces of data for N, and taking the data with the same remainder as a group to map the data in each group with the bloom filter with the same number as the remainder.

In step 103, the number of newly added visitors is the number of visitors which are not counted by the N bloom filters.

In the method of steps 101 to 103, the number of the counted visitor can be stored in the database, and the persistent storage of the number of the counted visitor can be realized. After step 103, the number of the counted visitors may be updated in the following manner:

taking the number of the counted visitors and the number of the newly added visitors as the number of the real-time counted visitors; the real-time statistical visitor number is the visitor number which is currently counted by the N bloom filters; and storing the real-time statistical visitor number into the database.

Through the storage mode, the updated real-time statistics visitor number can be persistently stored in the database, and the condition that the real-time statistics visitor number is lost due to the fact that data in a memory of a computer fails to disappear is prevented.

It should be noted that, based on the above manner, in the process of updating the first position offset amount to the second position offset amount in step 101, the first position offset amount and the second position offset amount may be stored in the database in the following manner:

the first amount of positional offset is initially stored in a first data table of the database; after the first position offset amount is updated to a second position offset amount, the second position offset amount may be stored in a second data table of the database; and after the real-time statistical visitor number is stored in the database, storing the second position offset in the second data table in the first data table. Note that the second position offset amount overlaps the first position offset amount.

In the above manner, for example, offset _1 (the first position offset amount) is stored in the data table a (the first data table), and after offset _1 is updated to offset _2 (the second position offset amount), offset _2 is stored in the data table B (the second data table). Before the real-time statistics visitor quantity is stored in the database, once a fault occurs, because the first position offset quantity in the first data table is not covered, the first position offset quantity can be obtained again, after the real-time statistics visitor quantity is stored in the database, the statistics visitor quantity is updated, the first position offset quantity is not needed, and then the second position offset quantity is used for covering the first position offset quantity.

The method for counting the number of web page visitors provided by the embodiment of the application is described by specific examples.

As shown in fig. 2, the whole process of the webpage visitor number statistical method can be performed by the following five modules: the system comprises a data reporting module 201, a data acquisition access module 202, a message middleware 203, a UV real-time calculation module 204 and a UV query module 205.

First, the data reporting module 201 obtains front-end behavior data of a page to be counted. The method for obtaining can adopt the method described in step 101 and optional methods.

Specifically, the data reporting module 201 may collect the front-end behavior data through a toolkit (wa-sdk) embedded in the front-end page or the terminal page of the access party, and report the front-end behavior data collected through wa-sdk to the data collection access module 202.

It should be noted that wa-sdk may establish a one-to-one correspondence relationship with the field formats in the database, define the field formats, and directly store the fields in the database without parsing. If the security is high, an encryption algorithm can be added to encrypt the whole piece of data, and the data is decrypted according to the appointed secret key after being received.

In the second step, the data acquisition access module 202 receives the front-end behavior data reported by the data reporting module 201, and forwards the front-end behavior data to the message middleware 203.

In the third step, the message middleware 203 receives the front-end behavior data forwarded by the data acquisition access module 202 and stores the front-end behavior data in the message middleware 203.

Fourthly, the UV real-time calculation module 204 extracts the front-end behavior data from the message middleware 203, calculates the UV value, and stores the UV value in the database.

The UV real-time calculation module 204 may calculate the UV value every preset batch duration (batchDuration). The storage of the UV value in the fourth step is not stored in the memory, but is stored in the database, which is a persistent storage. The calculated timeliness is related to the setting of the batch interval parameter batchDuration, and is generally set to be 5-30 seconds. The specific settings and data size are related to the computing resources of the batch.

In the fifth step, the UV query module 205 provides the interface to query the UV value counted in real time.

In the fourth step, a SparkStreaming real-time computing frame used for UV real-time computing is adopted, and duplication removal statistics is carried out by adopting the bloomfilter of the fragments to obtain a UV value.

The parameters involved include: batchDuration (interval of sparkstream batch submission), bfNumElements (expected amount of data for a single BF, parameters to construct BF), bfFalsePosProb (error rate for a single BF, parameters to construct BF), bfnumPartition (number of data fragments), bfnumBfPartition (number of BF fragments). As shown in fig. 3, the main flow of the SparkStreaming-based UV real-time calculation module 204 is as follows:

step 301: and initializing parameters.

Step 302: one batch task is submitted every batchDuration (preset period) to consume data from kafka.

Step 303: UV1 for the batch data was calculated based on the sliced BF.

Step 304: the UV1 (number of new visitors) of the batch data plus the previously calculated UV2 (number of visitors counted up) gives the total UV3 (number of real time visitors) up to the batch.

Step 305: the UV3 was persisted to a database.

As shown in fig. 4, the specific process from step 302 to step 305 is as follows:

step 401: the position offset amount (first position offset amount) reached by the last kafka consumption data is read from the offsettablename table of HBase.

Step 402: and continuing to consume the data in the kafka into the memory from the first position offset amount.

Step 403: the latest offset (second position offset amount) to which the kafka consumption data (first front-end page behavior data) in step 402 has reached is recorded in the offset tmptablename table of HBase.

Step 404: UV (number of newly added visitors) generated by the batch was calculated in real time using the fragment-based BF.

Step 405: the UV1 (number of new visitors) of the batch data plus the previously calculated UV2 (number of visitors counted up) gives the total UV3 (number of real time visitors) up to the batch.

It should be noted that UV3 can be calculated from the pvUVTableName table of HBase by looking up the previous batch.

Step 406: the UV3 was persisted to a database.

Specifically, UV3 was written to the pvUVTableName table of HBase for use in the next batch. After step 406, UV3 is written to the database and queried by the query interface.

Step 407: the position offset amount (second position offset amount) in the offsetttmptatame table (first data table) in the HBase is written to the offsetttablename table (second data table).

In steps 401 to 407, kafka data is consumed from the offset recorded in the offsettablename table, and after the kafka data is consumed, the latest offset is cached using the offsettmptatable name. After the UV calculation of the batch is finished and the batch is persisted to the database, the latest offset in the offset TmpTableName table is written back to the offset TableName. This was done to avoid problems in the calculation process, task failures, or reboots, resulting in consumption of kafka's data, but without the UV being accounted for.

Under the method shown in fig. 4, a low-cost UV real-time statistical method is implemented based on SparkStreaming, and fragmented BF real-time calculation is adopted, and BF is persisted in HBase. The BF of a very large resident memory is changed into bfnumBfPartion small BF, and the bfnumBfPartion BF is persisted in HBase, and the corresponding BF is loaded to the memory when the judgment is needed, so that the memory consumption of the single BF is greatly reduced. In addition, the offset management based on kafka of HBase avoids data duplicate computation caused by restarting or failure. A trade-off may be made between configuration parameters depending on the actual amount of data. Faster calculation speed, less memory consumption and higher accuracy.

As shown in fig. 5, the specific process of step 404 may be as follows:

step 501: a batch of data is fragmented.

Specifically, the data is divided into bfnumPartition pieces (a plurality of pieces of data are divided into a plurality of groups of data, and each piece of data is a group of data). The fragmentation mode is that for each piece of data, a preset field of the data is obtained, for example, for the preset field of the UV statistics, which is an IP address, a hash value is obtained for the preset field, and then the hash value is taken to obtain a remainder of bfnumPartition, which is a fragmentation number. Each slice contains a plurality of pieces of data.

Step 502: the UV of the data inside each slice is calculated. Specifically, the newly added UV of the fragment data is calculated based on the BF of the fragment. And calculating the newly added UV of each fragment in parallel.

After step 502, the newly added UV of the data of each segment is superimposed, that is, the newly added UV of the data of the batch is added.

In steps 401 to 407, a batch data fragmentation processing method is described, and a fragmentation BF-based method is used to calculate the newly added UV for each fragmented data. Calculating UV, namely the duplication removal of needed and historical data, and based on the idea of slicing BF, maintaining numBfpartition BF which is durably arranged in HBase table, judging whether a piece of data appears, firstly inquiring BF corresponding to the data according to certain conditions, and then judging whether the piece of data appears in the BF. As shown in fig. 6, the deduplication method for the single sliced data is described as follows:

step 601: and circularly processing each piece of data of the fragment.

Step 602: and for each piece of data, acquiring a preset field of the data. And taking the remainder of the numBfPartion according to the hash value of the preset field of the data.

Step 603: and acquiring a bloom filter corresponding to the piece of data.

Specifically, the bfTableName table of the de-HBase inquires the corresponding BF, and if the corresponding BF is not found, the parameters bfFalsePossProb and bfNumElements are used for constructing a new BF; if found, the found BF is used.

Step 604: according to the bloom filter of step 603, it is determined whether the predetermined field of the piece of data is present.

If the new UV does not exist, the new UV is not added into the data, namely the new UV is 0; if not, the piece of data is added to the new UV and updated to the bloom filter in step 3, 603.

Step 605: the updated BF in step 604 is stored to the HBase.

That is, according to the preset field in 2, writing the data into the bfTableName table of HBase for the next data deduplication.

Step 606: and adding the newly added UV of each piece of data in the fragment data to obtain a summarized newly added UV of the fragment data.

The application provides a webpage visitor quantity statistics device, include: an obtaining module 701, configured to obtain first front-end page behavior data of a web page in a preset period; the first front-end page behavior data is recorded data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying the visitor, and the value of each preset field is within a preset value range; a processing module 702 configured to determine M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to preset fields; the N bloom filters establish mapping relations with preset fields with values within the preset value range in advance; each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are: the bloom filter to which the plurality of pieces of data are mapped in the N bloom filters according to the mapping relation; the number of newly added visitors in the preset period is determined according to the M bloom filters and the first front-end page behavior data; the newly added visitor quantity is the visitor quantity which is not counted by the N bloom filters.

In an optional implementation manner, the processing module 702 is specifically configured to: dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; aiming at each group of data in the multiple groups of data, determining the number of newly added visitors of the group of data according to the bloom filter mapped to the group of data in the N bloom filters; and taking the sum of the number of newly added visitors of each bloom filter in the M bloom filters as the number of newly added visitors in the first front-end page behavior data.

In an alternative embodiment, the N bloom filters are numbered sequentially from 0 to N-1; the processing module 702 is specifically configured to: acquiring a hash value of each piece of data according to a preset field of each piece of data in the plurality of pieces of data; and calculating the remainder of the hash value of each piece of data in the plurality of pieces of data for N, and taking the data with the same remainder as a group to map the data in each group with the bloom filter with the same number as the remainder.

In an alternative embodiment, the number of visitors of the counted visitors is stored in a database; the processing module 702 is further configured to: taking the number of the counted visitors and the number of the newly added visitors as the number of the real-time counted visitors; the real-time statistical visitor number is the visitor number which is currently counted by the N bloom filters; and storing the real-time statistical visitor number into the database.

In an alternative embodiment, the database further stores a first amount of position offset of the message middleware; the message middleware is used for storing front-end page behavior data acquired from the webpage; the first position offset is used for indicating the position of second front-end page behavior data acquired from the message middleware in the message middleware; the obtaining module 701 is specifically configured to: obtaining the first position offset amount from the database; starting from the data after the first position offset of the message middleware, taking front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data; and updating the first position offset amount into a second position offset amount according to the first front-end page behavior data, and storing the second position offset amount in the database.

In an alternative embodiment, the first amount of positional offset is stored in a first data table of the database; the processing module 702 is further configured to: after updating the first position offset amount to a second position offset amount, and before storing the real-time statistical visitor number into the database, storing the second position offset amount in a second data table of the database; storing the second location offset in the second data table in the first data table after storing the real-time statistical visitor number in the database.

The embodiment of the application provides computer equipment, which comprises a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used for executing the webpage visitor number statistical method and any optional method provided by the embodiment of the application.

The embodiment of the application provides a storage medium, which comprises a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used for executing the webpage visitor number statistical method and any optional method provided by the embodiment of the application.

Finally, it should be noted that: as will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A statistical method for webpage visitor number is characterized by comprising the following steps:

acquiring first front-end page behavior data of a webpage in a preset period; the first front-end page behavior data is recorded data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying the visitor, and the value of each preset field is within a preset value range;

determining M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to preset fields; the N bloom filters establish mapping relations with preset fields with values within the preset value range in advance; each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are: the bloom filter to which the plurality of pieces of data are mapped in the N bloom filters according to the mapping relation;

determining the number of newly added visitors in the preset period according to the M bloom filters and the first front-end page behavior data; the newly added visitor quantity is the visitor quantity which is not counted by the N bloom filters.

2. The method of claim 1, wherein said determining a number of newly added guests in said first front-end page behavior data based on said M bloom filters comprises:

dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group;

aiming at each group of data in the multiple groups of data, determining the number of newly added visitors of the group of data according to the bloom filter mapped to the group of data in the N bloom filters;

and taking the sum of the number of newly added visitors of each bloom filter in the M bloom filters as the number of newly added visitors in the first front-end page behavior data.

3. The method of claim 2, wherein the N bloom filters are numbered sequentially from 0 to N-1; the dividing the plurality of pieces of data into a plurality of groups of data includes:

acquiring a hash value of each piece of data according to a preset field of each piece of data in the plurality of pieces of data;

and calculating the remainder of the hash value of each piece of data in the plurality of pieces of data for N, and taking the data with the same remainder as a group to map the data in each group with the bloom filter with the same number as the remainder.

4. A method according to any of claims 1-3, wherein the number of counted guests is stored in a database; after determining the number of newly added visitors in the preset period according to the M bloom filters, the method further comprises the following steps:

5. The method of claim 4, wherein the database further stores a first amount of position offset for message middleware; the message middleware is used for storing front-end page behavior data acquired from the webpage; the first position offset is used for indicating the position of second front-end page behavior data acquired from the message middleware in the message middleware; the acquiring of the first front-end page behavior data of the webpage in the preset period includes:

obtaining the first position offset amount from the database;

starting from the data after the first position offset of the message middleware, taking front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data; and updating the first position offset amount into a second position offset amount according to the first front-end page behavior data, and storing the second position offset amount in the database.

6. The method of claim 5, wherein the first amount of position offset is stored in a first data table of the database; after updating the first location offset amount to the second location offset amount and before storing the real-time statistical visitor number in the database, the method further includes:

storing the second amount of position offset in a second data table of the database;

after the storing the real-time statistical visitor number into the database, the method further includes:

storing the second position offset amount in the second data table in the first data table.

7. A webpage visitor number counting device is characterized by comprising:

the acquisition module is used for acquiring first front-end page behavior data of a webpage in a preset period; the first front-end page behavior data is recorded data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying the visitor, and the value of each preset field is within a preset value range;

a processing module for determining M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to preset fields; the N bloom filters establish mapping relations with preset fields with values within the preset value range in advance; each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are: the bloom filter to which the plurality of pieces of data are mapped in the N bloom filters according to the mapping relation; the number of newly added visitors in the preset period is determined according to the M bloom filters and the first front-end page behavior data; the newly added visitor quantity is the visitor quantity which is not counted by the N bloom filters.

8. The apparatus of claim 7, wherein the processing module is specifically configured to:

9. The apparatus of claim 8, wherein the N bloom filters are numbered sequentially from 0 to N-1; the processing module is specifically configured to:

10. An arrangement according to any of claims 7-9, wherein the number of counted guests is stored in a database; the processing module is further configured to:

11. A computer device comprising a program or instructions that, when executed, perform the method of any of claims 1 to 6.

12. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 6.