CN110851758B - Webpage visitor quantity counting method and device - Google Patents

Webpage visitor quantity counting method and device Download PDF

Info

Publication number
CN110851758B
CN110851758B CN201911044278.8A CN201911044278A CN110851758B CN 110851758 B CN110851758 B CN 110851758B CN 201911044278 A CN201911044278 A CN 201911044278A CN 110851758 B CN110851758 B CN 110851758B
Authority
CN
China
Prior art keywords
data
bloom filters
visitor
bloom
visitors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911044278.8A
Other languages
Chinese (zh)
Other versions
CN110851758A (en
Inventor
卢道和
罗锶
陈晓峰
胡思文
李勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN201911044278.8A priority Critical patent/CN110851758B/en
Publication of CN110851758A publication Critical patent/CN110851758A/en
Priority to PCT/CN2020/121112 priority patent/WO2021082936A1/en
Application granted granted Critical
Publication of CN110851758B publication Critical patent/CN110851758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for counting the number of web page visitors, wherein the method comprises the following steps: acquiring first front-end page behavior data of a webpage in a preset period; determining M bloom filters of the N bloom filters; the M bloom filters are as follows: the pieces of data are mapped to bloom filters in the N bloom filters according to the mapping relation; determining the number of newly added visitors in the preset period according to the M bloom filters and the first front-end page behavior data; the newly added visitor number is the visitor number which is not counted by the N bloom filters. When the method is applied to financial technology (Fintech), the number of bloom filters required is reduced, the storage occupation is reduced, and the cost is also reduced.

Description

Webpage visitor quantity counting method and device
Technical Field
The invention relates to the field of computer software of financial science and technology (Fintech), in particular to a method and a device for counting the number of web page visitors.
Background
With the development of computer technology, more and more technologies (big data, distributed, blockchain, artificial intelligence, etc.) are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech). At present, in the field of financial science and technology, financial products, financial policies and the like are often displayed in the form of web pages, and the number of visitors of the web pages characterizes interests of users, values of the products and the like to a certain extent. Therefore, it is very necessary to count the number of visitors to one financial web page.
The current statistical method is that log data of a webpage are extracted and analyzed, a bloom filter is loaded, whether the data appear or not is judged by judging whether each piece of log data is mapped into bits of the bloom filter, and therefore the number of webpage visitors is counted. However, when the amount of data is large, a large bloom filter is also required to count the number of guests, resulting in a large memory footprint. Therefore, in the prior art, when the number of visitors is counted, the memory occupation of the bloom filter is large, so that the energy consumption is large, the cost is high, and the problem to be solved is urgent.
Disclosure of Invention
The embodiment of the application provides a webpage visitor quantity counting method and device, which solve the problem that the memory occupation of a bloom filter is large when visitor quantity is counted in the prior art.
In a first aspect, an embodiment of the present application provides a method for counting the number of web page visitors, including: acquiring first front-end page behavior data of a webpage in a preset period; the first front-end page behavior data are record data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying a visitor, and the values of the preset fields are all in a preset value range; determining M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to a preset field; the N bloom filters are pre-established with mapping relation with preset fields with values within the preset value range; wherein each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are as follows: the pieces of data are mapped to bloom filters in the N bloom filters according to the mapping relation; determining the number of newly added visitors in the preset period according to the M bloom filters and the first front-end page behavior data; the newly added visitor number is the visitor number which is not counted by the N bloom filters.
In the method, the acquired behavior data of the front-end page comprises a plurality of pieces of data, and the N bloom filters are obtained by dividing the preset value range of the preset field in N segments according to a set rule in advance, and each preset field is uniquely corresponding to one bloom filter, so that the plurality of pieces of data can necessarily determine M bloom filters in the N bloom filters according to the set rule. Because the N bloom filters record the preset fields of the visitors who have accessed the front-end page, the bloom filters to which a plurality of pieces of data are not mapped do not need to be considered, the number of newly-increased visitors in the preset period can be determined by only M bloom filters, and obviously M is smaller than or equal to N, therefore, compared with the prior art, the method reduces the number of the bloom filters required, reduces the storage occupation and also reduces the cost.
In an optional embodiment, the determining, according to the M bloom filters, the number of newly added guests in the first front-end page behavior data includes: dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; for each set of data in the plurality of sets of data, determining the number of newly added visitors of the set of data according to bloom filters mapped by the set of data in the N bloom filters; and taking the sum of the number of the newly added visitors of each bloom filter in the M bloom filters as the number of the newly added visitors in the first front-end page behavior data.
In the method, the plurality of pieces of data are divided into a plurality of groups of data, the data mapped to the same bloom filter in the M bloom filters is a group, and for each group of data in the plurality of groups of data, the number of newly added visitors of the group of data is determined according to the bloom filters mapped to the N bloom filters by the group of data, and meanwhile, the number of newly added visitors of the plurality of groups of data is counted, so that the efficiency of the number of newly added visitors in the first front-end page behavior data is improved.
In an alternative embodiment, the N bloom filters are numbered sequentially from 0 to N-1; the dividing the plurality of pieces of data into a plurality of sets of data includes: acquiring a hash value of each piece of data according to a preset field of the data in the pieces of data; and calculating a remainder for N according to the hash value of each piece of data in the plurality of pieces of data, and taking the data with the same remainder as a group, so that the data of each group is mapped with a bloom filter with the same number as the remainder.
In the method, the hash value of each piece of data is obtained according to the preset field of each piece of data, and the hash value of each piece of data is calculated for N and the same group of data with the same remainder is random because the preset field of the data is random, so that the plurality of pieces of data are uniformly and randomly mapped with the bloom filter, and the method for mapping the plurality of pieces of data with the bloom filter is provided.
In an alternative embodiment, the number of guests of the counted guests is stored in a database; after determining the number of the newly added visitors in the preset period according to the M bloom filters, the method further comprises the following steps: taking the counted visitor number and the newly-added visitor number as real-time visitor number statistics; the real-time statistics visitor number is the visitor number counted by the N bloom filters currently; and storing the real-time statistical visitor quantity into the database.
Under the above mode, the visitor number of the counted visitor is stored in the database, so that the visitor number of the counted visitor can be stored for a long time and is stored in the database after the real-time counted visitor number is obtained, the updated real-time counted visitor number can be stored in the database for a long time, and the situation that the real-time counted visitor number is lost due to the disappearance of data in a computer fault memory is prevented.
In an alternative embodiment, the database also stores a first position offset of the message middleware; the message middleware is used for storing front-end page behavior data acquired from the webpage; the first position offset is used for indicating the position of second front-end page behavior data which is acquired from the message middleware in the message middleware; the acquiring the first front-end page behavior data of the webpage in the preset period comprises the following steps: acquiring the first position offset from the database; starting from the data after the first position offset of the message middleware, taking front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data; and updating the first position offset into a second position offset according to the first front-end page behavior data, and storing the second position offset into the database.
According to the method, the front-end page behavior data acquired from the webpage are stored through the message middleware, the front-end page behavior data are conveniently acquired, the first position offset is stably and permanently stored in the database, then the front-end page behavior data acquired from the message middleware in the preset period are used as the first front-end page behavior data from the data after the first position offset of the message middleware, and therefore the position of the front-end page behavior data acquired from the message middleware in the message middleware can be recorded more stably and accurately, the front-end page behavior data are further acquired accurately, and the number of real-time statistics visitors is judged accurately.
In an alternative embodiment, the first position offset is stored in a first data table of the database; after the updating the first position offset to the second position offset, before the storing the real-time statistics visitor number in the database, the method further includes: storing the second position offset in a second data table of the database; after the real-time statistics visitor number is stored in the database, the method further comprises: the second position offset in the second data table is stored in the first data table.
In the above manner, the first position offset is stored in a first data table of the database, and after the first position offset is updated to a second position offset, the second position offset is stored in a second data table of the database before the real-time statistics visitor number is stored in the database; storing the second position offset in the second data table in the first data table after the real-time statistical visitor number is stored in the database; the situation that the acquisition of the number of the visitors fails but the first position offset is covered is avoided, so that the position of the front-end page behavior data acquired from the message middleware in the message middleware can be further and more stably and accurately recorded.
In a second aspect, the present application provides a device for counting the number of web page visitors, including: the acquisition module is used for acquiring first front-end page behavior data of the webpage in a preset period; the first front-end page behavior data are record data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying a visitor, and the values of the preset fields are all in a preset value range; a processing module for determining M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to a preset field; the N bloom filters are pre-established with mapping relation with preset fields with values within the preset value range; wherein each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are as follows: the pieces of data are mapped to bloom filters in the N bloom filters according to the mapping relation; the method comprises the steps of determining the number of newly-added visitors in a preset period according to M bloom filters and first front-end page behavior data; the newly added visitor number is the visitor number which is not counted by the N bloom filters.
In an alternative embodiment, the processing module is specifically configured to: dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; for each set of data in the plurality of sets of data, determining the number of newly added visitors of the set of data according to bloom filters mapped by the set of data in the N bloom filters; and taking the sum of the number of the newly added visitors of each bloom filter in the M bloom filters as the number of the newly added visitors in the first front-end page behavior data.
In an alternative embodiment, the N bloom filters are numbered sequentially from 0 to N-1; the processing module is specifically configured to: acquiring a hash value of each piece of data according to a preset field of the data in the pieces of data; and calculating a remainder for N according to the hash value of each piece of data in the plurality of pieces of data, and taking the data with the same remainder as a group, so that the data of each group is mapped with a bloom filter with the same number as the remainder.
In an alternative embodiment, the number of guests of the counted guests is stored in a database; the processing module is further configured to: taking the counted visitor number and the newly-added visitor number as real-time visitor number statistics; the real-time statistics visitor number is the visitor number counted by the N bloom filters currently; and storing the real-time statistical visitor quantity into the database.
In an alternative embodiment, the database also stores a first position offset of the message middleware; the message middleware is used for storing front-end page behavior data acquired from the webpage; the first position offset is used for indicating the position of second front-end page behavior data which is acquired from the message middleware in the message middleware; the acquisition module is specifically configured to: acquiring the first position offset from the database; starting from the data after the first position offset of the message middleware, taking front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data; and updating the first position offset into a second position offset according to the first front-end page behavior data, and storing the second position offset into the database.
In an alternative embodiment, the first position offset is stored in a first data table of the database; the processing module is further configured to: after updating the first position offset to a second position offset, storing the second position offset in a second data table of the database before storing the real-time statistics visitor number in the database; after the real-time statistic visitor number is stored in the database, the second position offset in the second data table is stored in the first data table.
The advantages of the second aspect and the embodiments of the second aspect may be referred to the advantages of the first aspect and the embodiments of the first aspect, and will not be described here again.
In a third aspect, embodiments of the present application provide a computer device, including a program or instructions, which when executed, is configured to perform the method of the first aspect and the embodiments of the first aspect.
In a fourth aspect, embodiments of the present application provide a storage medium including a program or instructions, which when executed, are configured to perform the method of the first aspect and the respective embodiments of the first aspect.
Drawings
Fig. 1 is a schematic step flow diagram of a method for counting the number of web page visitors according to an embodiment of the present application;
fig. 2 is a schematic block diagram of a method for counting the number of web page visitors according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a step of the SparkStreaming-based UV real-time calculation module 204 provided in the embodiment of the present application;
fig. 4 is a schematic flowchart illustrating specific steps from step 302 to step 305 provided in the embodiment of the present application;
FIG. 5 is a flowchart illustrating steps specifically described in step 404 according to an embodiment of the present disclosure;
FIG. 6 is a flowchart illustrating steps for providing single-slice data according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a device for counting the number of web page visitors according to an embodiment of the present application.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will be made with reference to the accompanying drawings and specific embodiments, and it should be understood that specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, and not limit the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.
Abbreviations and key terms appearing in the following application are first introduced.
Message middleware: the method is used for storing, publishing and subscribing the massive offline data messages. For example, the message middleware may be Kafka, flume, or the like.
Big data platform (Big Data Platform, BDP): a platform for large data storage and analysis. Including distributed computing frameworks (e.g., hadoop), data warehouses (e.g., HBase), data warehouse tools (e.g., hive), message middleware, distributed application coordination services (e.g., zookeeper), and the like. It is worth mentioning that the data warehouse is a more important component of a large data platform. For example, the data warehouse is HBase, which is a high-reliability, high-performance, column-oriented, scalable distributed storage system, and a large-scale structured storage cluster can be built on an inexpensive server by using HBase technology. Belongs to a Hadoop ecological ring. The distributed key value opposite type database is used for storing and inquiring mass data.
Independent visitor (UV): refers to a natural person accessing and browsing a web page through the internet. Obviously, the amount of browsing is not equal to the number of individual guests, which refers to how many times the web page is browsed, and the number of individual guests refers to the number of different individual guests accessing the web page. An individual visitor, no matter how many times it browses, is also referred to as an individual visitor. It should be noted that in the embodiment of the present application, whether the visitor is an independent visitor may be determined according to a preset field, for example, the visitor with the same IP address for browsing the web page is determined as an independent visitor by determining the visitor with the same IP address through an internet protocol (Internet protocol, IP) address.
A streaming processing system: for processing big data. For example, sparkStreaming is a streaming system that performs high-throughput, fault-tolerant processing on real-time data streams, can perform complex operations such as similar connections on various data sources (e.g., kdfka, flume, twitter, zero, and transmission control protocol (transmission control protocol, TCP) sockets), and save the results to an external file system, database, or application to a real-time dashboard.
Bloom Filters (BF), which are a very space efficient random data structure that use bit arrays to represent a collection in a very compact way and to determine whether an element belongs to the collection. It is a fast probabilistic algorithm that determines if a set of elements exists. Bloom filters have the advantage of far exceeding the space efficiency and the query time of a common algorithm, and have the disadvantage of having a certain false recognition rate.
Front-end behavioural data toolkit (which may be represented by wa-sdk): the system is a software development kit for collecting and reporting front-end data, and actively collects front-end operation behaviors of a user in a data embedding mode and the like, wherein the front-end operation behaviors comprise user identifications (which can be represented by openids), user IP addresses, data (code) stored on a user local terminal, code error information, click logs and the like.
The offset (which may be denoted as offset) is referred to herein as an offset that may be used to mark the location of consumed data in kafka, facilitating the next time that data is consumed, continuing to consume data from that location.
The parameters involved in the examples of the present application are shown in table 1:
batchDuration time interval for SparkStreaming batch submission
bfNumElements Single BF expected data volume, parameters of the construction BF
bfFalsePosProb Single BF error rate, parameter for constructing BF
bfnumPartition Number of data fragments
bfnumBfPartition BF number of fragments
pvUVTableName Results table of pv and UV in HBase
bfTableName Tables in HBase for storing BF
offsetsTableName Final Table in HBase for preserving offset in kafka
offsetTmpTableName Intermediate Table in HBase for preserving offset in kafka
TABLE 1
In the operation of a financial institution (banking institution, insurance institution or securities institution) in doing business (e.g., loan business, deposit business, etc. of a bank), it is often necessary to count the number of visitors to a financial web page. In the prior art, when the data volume is large, a large bloom filter is needed to count the number of visitors, so that the memory occupation is large. Therefore, in the prior art, when the number of visitors is counted, the memory occupation of the bloom filter is large, and the situation does not meet the requirements of financial institutions such as banks, and the efficient operation of various businesses of the financial institutions cannot be guaranteed.
For this reason, as shown in fig. 1, the embodiment of the application provides a method for counting the number of web page visitors.
Step 101: and acquiring first front-end page behavior data of the webpage in a preset period.
Step 102: m bloom filters of the N bloom filters are determined.
M and N are positive integers.
Step 103: and determining the number of newly-increased visitors in the preset period according to the M bloom filters and the first front-end page behavior data.
In step 101, the first front-end page behavior data is record data generated when the visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field (such as internet (Internet protocol, IP) address of a user) for uniquely identifying a visitor, and the values of the preset fields are all in a preset value range.
The specific implementation process of step 101 may be as follows:
the first location offset of the message middleware (e.g., kafka, jume) is pre-stored in a database (e.g., HBase). The message middleware is used for storing front-end page behavior data acquired from the webpage. The first position offset is used for indicating the position of second front-end page behavior data which is acquired from the message middleware in the message middleware.
The first position offset is obtained from the database. Starting from the data after the first position offset of the message middleware, taking the front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data. And updating the first position offset into a second position offset according to the first front-end page behavior data, and storing the second position offset into the database.
For example, the database is hbase_1, the message middleware is kafka_1, and the preset period is 5 seconds. The hbase_1 stores therein a first positional offset amount offset_1 of kafka_1 in advance. Then starting from the data after offset_1, continuously reading front-end page behavior data from kafka_1 for 5 seconds as first front-end page behavior data, recording the last position of the read front-end page behavior data, namely, the second position offset offset_2, and storing offset_2 into hbase_1.
In step 102, the N bloom filters are used for recording the counted visitors according to a preset field; the N bloom filters are pre-established with mapping relation with preset fields with values within the preset value range; wherein each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are as follows: and the pieces of data are mapped to bloom filters in the N bloom filters according to the mapping relation.
For example, the bloom filters mapped by the preset value range are 10 bloom filters (i.e., N bloom filters). Each bloom filter is mapped with a sub-preset value range, among 10 bloom filters, bloom filter 1 is mapped with a value of a preset field within 0-1000, bloom filter 2 is mapped with a value of a preset field within 1001-2000, … is mapped with a value of a bloom filter 10 within 9001-10000. The pieces of data are 8 pieces of data (500, 1500, …, 7500 respectively), and the bloom filters mapped to by the 8 pieces of data in the N bloom filters according to the mapping relation are bloom filters 1 to 8 (i.e. M bloom filters).
An alternative embodiment of step 102 is as follows:
dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; for each set of data in the plurality of sets of data, determining the number of newly added visitors of the set of data according to bloom filters mapped by the set of data in the N bloom filters; and taking the sum of the number of the newly added visitors of each bloom filter in the M bloom filters as the number of the newly added visitors in the first front-end page behavior data.
Each of the 8 pieces of data is divided into a group, and each group of data corresponds to one bloom filter. For each piece of data, the preset field added in the piece of data can be determined through the bloom filter mapped by the piece of data, if the preset field is not stored in the bloom filter, the data is considered as the data of the newly added visitor, and the count of the number of the newly added visitor is increased by 1. And adding the number of the newly added visitors of each group of data to obtain the number of the newly added visitors in the first front-end page behavior data.
In the above alternative embodiment, the following method may be adopted. Dividing the plurality of pieces of data into a plurality of sets of data:
the N bloom filters are numbered in sequence from 0 to N-1 in advance. Acquiring a hash value of each piece of data according to a preset field of the data in the pieces of data; and calculating a remainder for N according to the hash value of each piece of data in the plurality of pieces of data, and taking the data with the same remainder as a group, so that the data of each group is mapped with a bloom filter with the same number as the remainder.
In step 103, the newly added number of visitors is the number of visitors not counted by the N bloom filters.
In the method of steps 101 to 103, the counted number of guests may be stored in a database, so as to implement persistent storage of the counted number of guests. After step 103, the number of guests counted may be updated specifically in the following manner:
Taking the counted visitor number and the newly-added visitor number as real-time visitor number statistics; the real-time statistics visitor number is the visitor number counted by the N bloom filters currently; and storing the real-time statistical visitor quantity into the database.
Through the storage mode, the updated real-time statistics visitor quantity can be stored in the database in a lasting mode, and the situation that the real-time statistics visitor quantity is lost due to the fact that data disappear in a computer fault memory is prevented.
In the process of updating the first position offset to the second position offset in step 101, the first position offset and the second position offset may be stored in the database according to the following manner:
the first positional offset is initially stored in a first data table of the database; after the first position offset is updated to a second position offset, the second position offset may be stored in a second data table of the database; and after the real-time statistical visitor quantity is stored in the database, storing the second position offset in the second data table in the first data table. The second positional offset amount covers the first positional offset amount.
In the above manner, for example, offset_1 (first position offset amount) is stored in the data table a (first data table), and after offset_1 is updated to offset_2 (second position offset amount), offset_2 is stored in the data table B (second data table). And before the real-time statistical visitor quantity is stored in the database, once the fault occurs, the first position offset in the first data table is not covered, the first position offset can be obtained again, after the real-time statistical visitor quantity is stored in the database, the statistical visitor quantity is indicated to be updated, the first position offset is not needed, and the second position offset is used for covering the first position offset.
The method for counting the number of web page visitors provided in the embodiment of the present application is described below by a specific example.
As shown in fig. 2, the whole flow of the web page visitor number statistical method can be performed by the following five modules: the system comprises a data reporting module 201, a data acquisition access module 202, a message middleware 203, a UV real-time calculation module 204 and a UV query module 205.
In the first step, the data reporting module 201 obtains front-end behavior data of the page to be counted. The acquisition method may employ the method described in step 101, and an alternative method.
Specifically, the data reporting module 201 may collect front-end behavior data through a tool package (wa-sdk) embedded in a front-end page or a terminal page of an access party, and report the front-end behavior data collected through wa-sdk to the data collection access module 202.
It should be noted that wa-sdk can establish a one-to-one correspondence with the field formats in the database, define the field formats, and store the fields directly into the database without parsing. If the encryption algorithm can be added for safety, the whole data is encrypted, and the data is decrypted according to the agreed secret key after being received.
In the second step, the data acquisition access module 202 receives the front end behavior data reported by the data reporting module 201, and forwards the front end behavior data to the message middleware 203.
Third, the message middleware 203 receives the front-end behavior data forwarded by the data acquisition access module 202, and stores the front-end behavior data in the message middleware 203.
Fourth, the UV real-time calculation module 204 takes the front-end behavior data from the message middleware 203, calculates the UV value, and stores the UV value in the database.
The UV real-time calculation module 204 may calculate UV values every preset batch duration (batch duration). The UV value stored in the fourth step is not stored in the memory, but stored in the database, and is a persistent storage. The timeliness of the calculation is related to setting the batch interval parameter, batch duration, typically 5-30 seconds. Specific settings and data size are related to the computing resources of the batch.
Fifth, the UV query module 205 provides the interface to query the UV values that are counted in real time.
In the fourth step, the SparkStreaming real-time calculation frame used by the UV real-time calculation is adopted, and the deblocking statistics is carried out by adopting the sliced bloom filter to obtain the UV value.
The parameters involved include: the bandwidth duration (interval of time that SparkStreaming batches commit), bfNumElements (single BF expected data amount, parameters to construct BF), bffalse porprob (single BF error rate, parameters to construct BF), bfnumb partition (number of data fragments). As shown in fig. 3, the main flow of the SparkStreaming-based UV real-time calculation module 204 is as follows:
step 301: initializing parameters.
Step 302: a batch of tasks was submitted every duration of the batch duration (preset period), and data was consumed from kafka.
Step 303: UV1 of the batch data was calculated based on BF of the splits.
Step 304: UV1 (newly added number of visitors) of the batch data plus previously calculated UV2 (number of visitors counted) gives total UV3 (real-time number of visitors) up to the batch.
Step 305: UV3 was persisted to the database.
As shown in fig. 4, the specific process of steps 302 to 305 is as follows:
Step 401: the position offset (first position offset) reached by the last kafka consumption data is read from the offsettstablename table of the HBase.
Step 402: continuing to consume the data in the kafka into the memory from the first position offset.
Step 403: the latest offset (second position offset) reached by the kafka consumption data (first front-end page behavior data) in step 402 is recorded in the offsetTmpTableName table of the HBase.
Step 404: the batch generated UV (newly added guest number) was calculated in real time using a fragment-based BF.
Step 405: UV1 (newly added number of visitors) of the batch data plus previously calculated UV2 (number of visitors counted) gives total UV3 (real-time number of visitors) up to the batch.
It should be noted that UV3 can be calculated by looking up the last lot from the pvUVTableName table of HBase.
Step 406: UV3 was persisted to the database.
Specifically, UV3 was written to the pvUVTableName table of HBase for use in the next batch. After step 406, UV3 is written to the database and queried by the query interface.
Step 407: the position offset (second position offset) in the offsetTmpTableName table (first data table) in the HBase is written to the offsettstablename table (second data table).
In steps 401 to 407, the data of kafka is consumed from the offset recorded in the offsettstablename table, and after the data of kafka is consumed, the latest offset is buffered first using the offsetTmpTableName. After the UV calculation of the batch is completed and persisted to the database, the latest offset in the offsetTmpTableName table is written back to the offsettstablename. This is done to avoid problems in the calculation process, task failure, or restart, resulting in kafka's data consumption, but without UV consideration.
Under the method shown in fig. 4, the low-cost UV real-time statistical method based on SparkStreaming is implemented, and the real-time computation of the BF of the fragments is adopted, and the BF is persisted into HBase. The BF of a very large resident memory is changed into bfnumBfPartification small BF, and bfnumBfPartification small BF is persisted in HBase, and the corresponding BF is loaded into the memory when judgment is needed, so that the memory consumption of a single BF is greatly reduced. In addition, the offset management of kafka based on HBase avoids the repeated calculation of data when restarting or failure occurs. A trade-off may be made between configuration parameters depending on the actual amount of data. Faster calculation speed, less memory consumption and higher accuracy.
As shown in fig. 5, the specific process of step 404 may be as follows:
step 501: the data of one batch is fragmented.
Specifically, the data is divided into bfnumPartition pieces (the pieces of data are divided into a plurality of sets of data, and each piece of data is one set of data). The slicing method is to obtain a preset field of the data for each piece of data, for example, the preset field of UV statistics is an IP address, a hash value is obtained for the preset field, and then the remainder of obtaining bfnumPartition by taking the hash value is the slicing number. Each slice contains a plurality of pieces of data.
Step 502: the UV of the data inside each slice is calculated. Specifically, UV added to the piece of data is calculated based on BF of the piece of data. The data of each slice calculates its newly added UV in parallel.
After step 502, the newly added UV of each piece of data is added, which is the newly added UV of the batch of data.
In steps 401 to 407, a method for slicing batch data is described, and each sliced piece of sliced data is further used to calculate the newly added UV according to the method based on the sliced BF. The UV is calculated to be the data which needs and is historic to remove the weight, the thought based on the slicing BF is to maintain numBfPartion BF which are lasting in the HBase table, whether one piece of data appears or not is judged, the BF corresponding to the data is firstly inquired according to a certain condition, and then whether the piece of data appears or not is judged in the BF. As shown in fig. 6, the deduplication method for single-slice data is specifically described as follows:
Step 601: each piece of data of the slice is processed in a loop.
Step 602: for each piece of data, a preset field of the data is obtained. And taking a remainder for the numBfPartification according to the hash value of the preset field of the data.
Step 603: and acquiring a bloom filter corresponding to the piece of data.
Specifically, the bfTableName table of the HBase is removed to inquire the corresponding BF, if not, the parameters bffalse epospob and bfNumElements are used to construct a new BF; if so, the discovered BF is used.
Step 604: according to the bloom filter in step 603, it is determined whether a preset field of the piece of data has occurred.
If the new UV appears, the data does not count the new UV, namely the new UV is 0; if not, the piece of data is counted in the newly added UV and updated to the bloom filter in step 603, 3.
Step 605: the BF updated in step 604 is stored to the HBase.
I.e. according to the preset field in 2, writing into the bfTableName table of HBase for the next piece of data to be used in de-duplication.
Step 606: and adding the newly added UV of each piece of data in the piece of data to obtain a summarized newly added UV of the piece of data.
The application provides a visitor quantity statistics device of webpage, include: an acquiring module 701, configured to acquire first front-end page behavior data of a web page in a preset period; the first front-end page behavior data are record data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying a visitor, and the values of the preset fields are all in a preset value range; a processing module 702 configured to determine M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to a preset field; the N bloom filters are pre-established with mapping relation with preset fields with values within the preset value range; wherein each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are as follows: the pieces of data are mapped to bloom filters in the N bloom filters according to the mapping relation; the method comprises the steps of determining the number of newly-added visitors in a preset period according to M bloom filters and first front-end page behavior data; the newly added visitor number is the visitor number which is not counted by the N bloom filters.
In an alternative embodiment, the processing module 702 is specifically configured to: dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; for each set of data in the plurality of sets of data, determining the number of newly added visitors of the set of data according to bloom filters mapped by the set of data in the N bloom filters; and taking the sum of the number of the newly added visitors of each bloom filter in the M bloom filters as the number of the newly added visitors in the first front-end page behavior data.
In an alternative embodiment, the N bloom filters are numbered sequentially from 0 to N-1; the processing module 702 is specifically configured to: acquiring a hash value of each piece of data according to a preset field of the data in the pieces of data; and calculating a remainder for N according to the hash value of each piece of data in the plurality of pieces of data, and taking the data with the same remainder as a group, so that the data of each group is mapped with a bloom filter with the same number as the remainder.
In an alternative embodiment, the number of guests of the counted guests is stored in a database; the processing module 702 is further configured to: taking the counted visitor number and the newly-added visitor number as real-time visitor number statistics; the real-time statistics visitor number is the visitor number counted by the N bloom filters currently; and storing the real-time statistical visitor quantity into the database.
In an alternative embodiment, the database also stores a first position offset of the message middleware; the message middleware is used for storing front-end page behavior data acquired from the webpage; the first position offset is used for indicating the position of second front-end page behavior data which is acquired from the message middleware in the message middleware; the acquiring module 701 is specifically configured to: acquiring the first position offset from the database; starting from the data after the first position offset of the message middleware, taking front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data; and updating the first position offset into a second position offset according to the first front-end page behavior data, and storing the second position offset into the database.
In an alternative embodiment, the first position offset is stored in a first data table of the database; the processing module 702 is further configured to: after updating the first position offset to a second position offset, storing the second position offset in a second data table of the database before storing the real-time statistics visitor number in the database; after the real-time statistic visitor number is stored in the database, the second position offset in the second data table is stored in the first data table.
The embodiment of the application provides a computer device, which comprises a program or an instruction, and the program or the instruction is used for executing the webpage visitor quantity counting method and any optional method provided by the embodiment of the application when being executed.
The embodiment of the application provides a storage medium, which comprises a program or an instruction, and the program or the instruction is used for executing the webpage visitor quantity counting method and any optional method provided by the embodiment of the application when being executed.
Finally, it should be noted that: it will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (10)

1. A method for counting the number of web page visitors, comprising the steps of:
acquiring first front-end page behavior data of a webpage in a preset period; the first front-end page behavior data are record data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying a visitor, and the values of the preset fields are all in a preset value range;
Determining M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to a preset field; the N bloom filters are pre-established with mapping relation with preset fields with values within the preset value range; wherein each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are as follows: the pieces of data are mapped to bloom filters in the N bloom filters according to the mapping relation;
dividing the plurality of pieces of data into a plurality of groups of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; for each set of data in the plurality of sets of data, determining the number of newly added visitors of the set of data according to bloom filters mapped by the set of data in the N bloom filters; taking the sum of the number of newly added visitors of each bloom filter in the M bloom filters as the number of newly added visitors in the first front-end page behavior data; the newly added visitor number is the visitor number which is not counted by the N bloom filters.
2. The method of claim 1, wherein the N bloom filters are numbered sequentially from 0 to N "1; the dividing the plurality of pieces of data into a plurality of sets of data includes:
acquiring a hash value of each piece of data according to a preset field of the data in the pieces of data;
and calculating a remainder for N according to the hash value of each piece of data in the plurality of pieces of data, and taking the data with the same remainder as a group, so that the data of each group is mapped with a bloom filter with the same number as the remainder.
3. The method of claim 1 or 2, wherein the number of guests for which guests have been counted is stored in a database; after the sum of the number of the newly added visitors of each bloom filter in the M bloom filters is used as the number of the newly added visitors in the first front-end page behavior data, the method further comprises:
taking the counted visitor number and the newly-added visitor number as real-time visitor number statistics; the real-time statistics visitor number is the visitor number counted by the N bloom filters currently; and storing the real-time statistical visitor quantity into the database.
4. The method of claim 3, wherein the database further stores a first position offset for message middleware; the message middleware is used for storing front-end page behavior data acquired from the webpage; the first position offset is used for indicating the position of second front-end page behavior data which is acquired from the message middleware in the message middleware; the acquiring the first front-end page behavior data of the webpage in the preset period comprises the following steps:
Acquiring the first position offset from the database;
starting from the data after the first position offset of the message middleware, taking front-end page behavior data acquired from the message middleware in the preset period as the first front-end page behavior data; and updating the first position offset into a second position offset according to the first front-end page behavior data, and storing the second position offset into the database.
5. The method of claim 4, wherein the first positional offset is stored in a first data table of the database; after the updating the first position offset to the second position offset, before the storing the real-time statistics visitor number in the database, the method further includes:
storing the second position offset in a second data table of the database;
after the real-time statistics visitor number is stored in the database, the method further comprises:
the second position offset in the second data table is stored in the first data table.
6. A web page visitor quantity counting device, characterized by comprising:
the acquisition module is used for acquiring first front-end page behavior data of the webpage in a preset period; the first front-end page behavior data are record data generated when a visitor clicks the webpage; the first front-end page behavior data comprises a plurality of pieces of data, each piece of data comprises a preset field for uniquely identifying a visitor, and the values of the preset fields are all in a preset value range;
A processing module for determining M bloom filters of the N bloom filters; m and N are positive integers; the N bloom filters are used for recording the counted visitors according to a preset field; the N bloom filters are pre-established with mapping relation with preset fields with values within the preset value range; wherein each preset field in the preset value range uniquely corresponds to one bloom filter; the M bloom filters are as follows: the pieces of data are mapped to bloom filters in the N bloom filters according to the mapping relation; for dividing the plurality of pieces of data into a plurality of sets of data; wherein the data mapped to the same bloom filter in the M bloom filters is a group; for each set of data in the plurality of sets of data, determining the number of newly added visitors of the set of data according to bloom filters mapped by the set of data in the N bloom filters; taking the sum of the number of newly added visitors of each bloom filter in the M bloom filters as the number of newly added visitors in the first front-end page behavior data; the newly added visitor number is the visitor number which is not counted by the N bloom filters.
7. The apparatus of claim 6, wherein the N bloom filters are numbered sequentially from 0 to N-1; the processing module is specifically configured to:
acquiring a hash value of each piece of data according to a preset field of the data in the pieces of data;
and calculating a remainder for N according to the hash value of each piece of data in the plurality of pieces of data, and taking the data with the same remainder as a group, so that the data of each group is mapped with a bloom filter with the same number as the remainder.
8. The apparatus of claim 6 or 7, wherein the number of guests for which guests have been counted is stored in a database; the processing module is further configured to:
taking the counted visitor number and the newly-added visitor number as real-time visitor number statistics; the real-time statistics visitor number is the visitor number counted by the N bloom filters currently; and storing the real-time statistical visitor quantity into the database.
9. A computer device comprising a program or instructions which, when executed, performs the method of any of claims 1 to 5.
10. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 5.
CN201911044278.8A 2019-10-30 2019-10-30 Webpage visitor quantity counting method and device Active CN110851758B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911044278.8A CN110851758B (en) 2019-10-30 2019-10-30 Webpage visitor quantity counting method and device
PCT/CN2020/121112 WO2021082936A1 (en) 2019-10-30 2020-10-15 Method and apparatus for counting number of webpage visitors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911044278.8A CN110851758B (en) 2019-10-30 2019-10-30 Webpage visitor quantity counting method and device

Publications (2)

Publication Number Publication Date
CN110851758A CN110851758A (en) 2020-02-28
CN110851758B true CN110851758B (en) 2024-02-06

Family

ID=69598950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911044278.8A Active CN110851758B (en) 2019-10-30 2019-10-30 Webpage visitor quantity counting method and device

Country Status (2)

Country Link
CN (1) CN110851758B (en)
WO (1) WO2021082936A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851758B (en) * 2019-10-30 2024-02-06 深圳前海微众银行股份有限公司 Webpage visitor quantity counting method and device
CN113486025B (en) * 2021-07-28 2023-07-25 北京腾云天下科技有限公司 Data storage method, data query method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9436751B1 (en) * 2013-12-18 2016-09-06 Google Inc. System and method for live migration of guest
CN106598984A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Data processing method and device of web crawler
CN108900619A (en) * 2018-07-06 2018-11-27 阿里巴巴集团控股有限公司 A kind of independent Statistics of accessing population method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101383034B (en) * 2008-09-18 2016-05-18 腾讯科技(深圳)有限公司 The method and system of a kind of advertistics and input
US8990243B2 (en) * 2011-11-23 2015-03-24 Red Hat, Inc. Determining data location in a distributed data store
CN102761627B (en) * 2012-06-27 2015-12-09 北京奇虎科技有限公司 Based on cloud network address recommend method and system and the relevant device of terminal access statistics
CN105577455A (en) * 2016-03-07 2016-05-11 达而观信息科技(上海)有限公司 Method and system for performing real-time UV statistic of massive logs
CN110851758B (en) * 2019-10-30 2024-02-06 深圳前海微众银行股份有限公司 Webpage visitor quantity counting method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9436751B1 (en) * 2013-12-18 2016-09-06 Google Inc. System and method for live migration of guest
CN106598984A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Data processing method and device of web crawler
CN108900619A (en) * 2018-07-06 2018-11-27 阿里巴巴集团控股有限公司 A kind of independent Statistics of accessing population method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Hadoop下改进布隆过滤器算法的网页去重;黄伟建;杨海龙;;计算机工程与科学(第02期);全文 *
面向虚拟现实内容的网络爬虫系统的设计与实现;文天乐;;中国高新科技(第07期);全文 *

Also Published As

Publication number Publication date
CN110851758A (en) 2020-02-28
WO2021082936A1 (en) 2021-05-06

Similar Documents

Publication Publication Date Title
CN106202235B (en) Data processing method and device
CN102340415B (en) Server cluster system and monitoring method thereof
CN106528787B (en) query method and device based on multidimensional analysis of mass data
CN111209352B (en) Data processing method and device, electronic equipment and storage medium
US7809752B1 (en) Representing user behavior information
CN106547784B (en) Data splitting and storing method and device
CN105468492A (en) SE(search engine)-based data monitoring method and system
CN110851758B (en) Webpage visitor quantity counting method and device
Gupta et al. Faster as well as early measurements from big data predictive analytics model
CN109714249B (en) Method and related device for pushing applet messages
CN113312376B (en) Method and terminal for real-time processing and analysis of Nginx logs
CN110399268A (en) A kind of method, device and equipment of anomaly data detection
WO2017092444A1 (en) Log data mining method and system based on hadoop
CN112182043A (en) Log data query method, device, equipment and storage medium
CN106557483B (en) Data processing method, data query method, data processing equipment and data query equipment
CN109446167A (en) A kind of storage of daily record data, extracting method and device
CN110019152A (en) A kind of big data cleaning method
CN108664322A (en) Data processing method and system
CN110619006A (en) Statistical data management method, device, platform and storage medium based on Internet of things
CN109947713B (en) Log monitoring method and device
CN114510708A (en) Real-time data warehouse construction and anomaly detection method, device, equipment and product
CN111131393B (en) User activity data statistical method, electronic device and storage medium
US10558647B1 (en) High performance data aggregations
CN110032445B (en) Big data aggregation calculation method and device
CN113778996A (en) Large data stream data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant