CN115525652A

CN115525652A - User access data processing method and device

Info

Publication number: CN115525652A
Application number: CN202211163989.9A
Authority: CN
Inventors: 陆志君
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-27

Abstract

The embodiment of the application provides a user access data processing method and device, wherein the user access data processing method comprises the following steps: the method comprises the steps of obtaining user access data of a target service in a target dimension, compressing access time information contained in the user access data according to a data structure corresponding to a target compression bitmap, generating and storing a corresponding compression result, constructing a data storage model corresponding to the access time information based on a preset data storage model, the user access data and the compression result, and determining an index value statistical result corresponding to a target statistical index in the target dimension according to the compression result and the data storage model.

Description

User access data processing method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a user access data processing method. One or more embodiments of the present application are also directed to a user access data processing apparatus, a computing device, and a computer readable storage medium.

Background

In the fields of user behavior analysis, artificial intelligence learning and the like, tag data needs to be used and stored, for example, data used for marking gender, age, city, active time in the near day and the like of a user portrait in the user behavior analysis belong to statistical tags, data defined according to the fact that transaction times in the near 30 days are more than or equal to 2 in the user behavior analysis belong to regular tags, data generated through data mining in the artificial intelligence learning belong to machine learning mining tags, for example, data generated through judging the preference degree of a user to a certain commodity according to the consumption habits of the user belong to machine learning mining tags.

A large amount of memory space will be required for storing these tag data. If the storage technology capable of saving the storage space is used for storing the tag data, the occupation of the storage space can be reduced, so that the requirement on the storage space is reduced under the condition of the same performance effect, and particularly, a very obvious cost saving effect can be obtained when the number of the tag data is large. Therefore, an effective method is needed to solve such problems.

Disclosure of Invention

In view of this, the embodiment of the present application provides a method for processing user access data. One or more embodiments of the present application also relate to a user access data processing apparatus, a computing device, and a computer readable storage medium, so as to solve the technical defect that, in the prior art, when user behavior analysis is performed, a large amount of storage space is consumed for storing a large amount of tag data, and occupation of the storage space cannot be reduced.

According to a first aspect of the embodiments of the present application, there is provided a method for processing user access data, including:

acquiring user access data of a target service in a target dimension;

compressing the access time information contained in the user access data according to a data structure corresponding to a target compression bitmap, generating and storing a corresponding compression result;

constructing a data storage model corresponding to the access time information based on a preset data storage model, the user access data and the compression result;

and determining an index value statistical result corresponding to the target statistical index under the target dimension according to the compression result and the data storage model.

According to a second aspect of embodiments of the present application, there is provided a user access data processing apparatus, including:

the acquisition module is configured to acquire user access data of the target service in a target dimension;

the compression module is configured to compress the access time information contained in the user access data according to a data structure corresponding to a target compression bitmap, generate a corresponding compression result and store the compression result;

the construction module is configured to construct a data storage model corresponding to the access time information based on a preset data storage model, the user access data and the compression result;

and the determining module is configured to determine an index value statistical result corresponding to the target statistical index under the target dimension according to the compression result and the data storage model.

According to a third aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor implements the steps of the user-accessed data processing method when executing the computer-executable instructions.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the user access data processing method.

An embodiment of the application realizes a user access data processing method and a device, wherein the user access data processing method comprises the steps of obtaining user access data of a target service in a target dimension, compressing access time information contained in the user access data according to a data structure corresponding to a target compression bitmap, generating and storing a corresponding compression result, constructing a data storage model corresponding to the access time information based on a preset data storage model, the user access data and the compression result, and determining an index value statistical result corresponding to a target statistical index in the target dimension according to the compression result and the data storage model.

According to the embodiment of the application, a series of statistical indexes (access tag data) corresponding to any behaviors can be covered through one data storage model by utilizing the marking characteristic of a target compression bitmap, so that when user behavior analysis is carried out, the storage technology is utilized to store the related access tag data, the occupation of storage space can be reduced, and the data storage cost is favorably reduced.

Drawings

FIG. 1 is an architecture diagram of a method for processing user access data provided by an embodiment of the present application;

FIG. 2 is a flowchart of a method for processing user access data according to an embodiment of the present application;

fig. 3 is a schematic diagram of a process of processing data accessed by the user according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a processing procedure of a method for processing user access data according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a structure of a user accessing a data processing apparatus according to an embodiment of the present application;

fig. 6 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.

First, the noun terms to which one or more embodiments of the present application relate are explained.

And (4) DAU: daily Active Users, i.e., active Users within a day, remove duplicates.

The WAU: weekly Active Users, namely the Active Users in a week remove the heavy, can be natural week, also can be the week of business.

MTD: month TO Days, first arrival today.

Task idempotency: the same calculation task is executed in different time points or time ranges, and the execution result of each calculation task is not influenced no matter how many times the calculation task is executed.

UDF: user Defined Function, user Defined Function.

BitMap: the BitMap is a relatively common data structure, the BitMap index is widely applied to databases and search engines, whether a numerical value exists can be quickly positioned, and the BitMap is an efficient data compression algorithm and can remarkably accelerate the query speed. However, since the BitMap still occupies a large amount of memory (linearly increases), we generally need to compress the BitMap.

Hive: a data warehouse tool based on Hadoop is used for data extraction, transformation and loading, and is a mechanism capable of storing, inquiring and analyzing large-scale data stored in Hadoop. The Hive data warehouse tool can map the structured data file into a database table, provide SQL query function and convert SQL sentences into MapReduce tasks for execution.

ClickHouse: open-source, columnar database management system (DBMS) for online analysis (OLAP), CK for short, is mainly used for online analytical processing query (OLAP), and can generate an analysis data report in real time by using SQL query.

A user generates a series of behavior events on a designated device, the bottom detailed data of the part can be obtained through user authorization, device client-side embedded point reporting, server-side reporting and other modes, and then the requirements of various analysis scenes or data products are met by utilizing data warehouse modeling. From a data warehouse perspective, user behavior has seven elements: 1. user (unique device identification); 2. the time of occurrence; 3. environments (e.g., model, brand, page, module); 4. type of service (e.g., live, game, community, etc.); 5. specific behaviors (e.g., exposure, click, comment, like, cost, etc.); 6. the number of acts; 7. the behavior depth (such as exposure duration, click frequency, comment word number, spending amount, etc.) is marked by recording daily increment detail data, and modeling is carried out separately according to specific use scenes.

Taking the user access behavior as an example, the user access related data indexes are used very frequently, except for the DAU, the computations such as new activation, retention, access frequency, MAU, MTD, WAU and the like are complex and variable, and in the modeling process, in addition to the retention of the user access detailed data, separate modeling is needed in different scenes.

From the perspective of model general construction, resource consumption is huge, behavior indexes of different time spans are scattered, the model is not very friendly to users, and a set of uniform user access marking scheme is required.

In summary, the current processing method has the following defects:

1. the modeling scheme of the general behavior mark is oriented to the same scene, and when the same scene is built, historical data is referenced and recomputed, so that great waste is generated on storage and calculation. For example, MTD calculation, access users who have been calculated before month to today need to be recalculated every day, but the already calculated parts of month to yesterday are recalculated and stored today.

2. The modeling scheme of the general behavior mark is to construct different models facing different scenes, and can not abstract intermediate common points, thereby generating great waste on storage and calculation. For example, future N-day retention and future N-day access frequency calculations, the calculations have similarities although the scenarios are different, while the general approach is to model the storage separately.

3. The model produced by the modeling scheme of the general behavior mark has model self-dependency and cannot achieve data back-brushing at any time, namely the back-brushing is not guaranteed to be not idempotent, for example, new activation calculation needs to be carried out, every day of incremental access (7000 w +) is needed to associate all access information (20 hundred million +) from history to yesterday, and if data in the middle of one day is in error, numerous and complicated calculation in any day afterwards is in vain.

Under the background of mass data, data needs to be evaluated, calculated and stored in the middle rapidly, and a series of data structures specially prepared for the mass data are produced. For example, the HyerLog, bloomFilterd and the like can quickly estimate the specified data volume by using small storage. These are all probabilistic based algorithms that, while running fast, do not achieve accurate data volumes. BitMap can solve the problems and is a data structure which appears in the data field and the search engine very early, for example, bitSet can be used for replacing HashSet in Java to carry out digital accurate deduplication, setbit and getbit in Redis can directly operate BitMap, and the bottom layer implementation of the bitMap is a binary structure which is directly translated into 0 and 1. However, it has two obvious problems: first, the long [ ] array inside the BitSet is vector based, i.e., expands dynamically with the largest number stored in the Set. Maximum length of array formula: (maxValue-1) > >6+1, that is, when storing a larger VALUE, the memory can directly occupy above the mega (M) level, and an excessively large VALUE range will result in OOM (for example, long. Secondly, taking an example of a BitMap storing 40 hundred million data, based on 32-bit unscigned Int, 2^ 32} bit =2^ 29} B =2^9MB ^ 512MB memory is consumed roughly, but when data is sparse, such a large memory space needs to be opened up, and the storage efficiency cannot be achieved. In order to solve the problem that the bitmap is not suitable for sparse storage, a plurality of researchers provide various algorithms to compress the sparse bitmap, so that memory occupation is reduced, and efficiency is improved. Representative of the comparison are WAH1, EWAH2, concise3, and Roaring bitmap4. The first three algorithms are compressed based on Run-length encoding (RLE), and Roaring bitmap can be regarded as improvement, so that RBM is introduced into the scheme as a core data structure of an access marking model, a uniform behavior marking method is provided, a historical calculation result is ensured to be reused, and no waste is caused to calculation and storage of the same scene; in addition, a batch of general calculation functions are preset, so that similar calculation unified processing logic is guaranteed, the calculation and storage of different scenes are not wasted, and the task return power ideality at any time can be guaranteed.

In the application, a user access data processing method is provided. One or more embodiments of the present application are also directed to a user-accessed data processing apparatus, a computing device, and a computer-readable storage medium, each of which is described in detail in the following embodiments.

Referring to fig. 1, fig. 1 is an architecture diagram illustrating a method for processing user access data according to an embodiment of the present application.

In fig. 1, user access data of a target service in a target dimension is obtained first, access time information included in the user access data is compressed according to a data structure corresponding to a target compressed bitmap (RBM), a corresponding compression result is generated and stored, a data storage model corresponding to the access time information is constructed based on a preset data storage model, the user access data, and the compression result, and an index value statistical result corresponding to a target statistical index in the target dimension is determined according to the compression result and the data storage model, so that a series of statistical indexes (access label data) corresponding to any behavior can be covered through one data storage model by using a marking characteristic of the target compressed bitmap, and thus when user behavior analysis is performed, related access label data is stored by using the storage technology, occupation of a storage space can be reduced, and data storage cost can be reduced.

Referring to fig. 2, fig. 2 is a flowchart illustrating a user access data processing method according to an embodiment of the present application, including the following steps:

step 202, user access data of the target service in the target dimension is obtained.

Specifically, the target service may be an object recommendation service, such as an advertisement recommendation service or a commodity recommendation service. The object recommendation method and the object recommendation device can perform object recommendation by analyzing user access data related to the target service, for example, analyzing the activity of users of a certain APP, analyzing the number of active users, analyzing the remaining quantity of the users and the like; the target dimension, that is, the dimension related to the access behavior of the user, may be a device system (an IOS system or an android system, etc.) and a device model of the smart device used by the user, because the user may generally use the smart device to perform APP access or perform object access. User access data including, but not limited to, user (unique device identification), access time, environment (e.g., model, brand, page, module), type of service accessed (e.g., live, game, community, etc.), specific access behavior (e.g., exposure, click, comment, like, spend, etc.), number of behaviors, depth of behavior (e.g., duration of exposure, frequency of click, number of words comment, amount spent, etc.), etc.

In the embodiment of the application, under the condition that a certain target statistic index of a target service is counted by user access data, because the related user access data needs to be stored, the data volume of the part of user access data is usually huge, if the part of user access data is stored in a common storage mode, a larger storage space needs to be consumed, so that the storage space can be saved, the occupation of the storage space by the user access data is reduced, and the storage cost can be saved.

And 204, compressing the access time information contained in the user access data according to a data structure corresponding to the target compression bitmap, generating a corresponding compression result and storing the compression result.

Specifically, the target compressed bitmap, which may be a RoaringBitmap, hereinafter referred to as RBM for short, has two current versions for storing 32-bit and 64-bit integers, respectively. According to the storage structure principle, when storing 32-bit integers, 32-bit int (unsigned) type data is divided into 2^16 buckets (i.e. the upper 16-bit binary of the data is used as the number of the bucket, and at most 2^16=65536 buckets are possible), each bucket has a Container to store the lower 16 bits of a value, and an RBM is a collection of a plurality of containers.

Based on this, compressing the access time information contained in the user access data according to the data structure corresponding to the target compression bitmap, and generating and storing a corresponding compression result, including:

converting access time information contained in the user access data into a target data type according to a data structure of a target compression bitmap, wherein the target data type comprises 32-bit binary number;

performing data splitting on the access time information of the target data type to generate a first access time identifier and a second access time identifier, wherein the first access time identifier and the second access time identifier respectively comprise 16-bit binary numbers;

and based on the first access time identifier, indexing to obtain a target bucket index number in the target compression bitmap, and storing the second access time identifier to a container corresponding to the target bucket index number.

Specifically, if the target compression bitmap is an RBM and a 32-bit integer is stored according to the storage structure principle, the access time information included in the user access data can be converted into a 32-bit binary number according to the data structure, and then the conversion result can be split to generate a first access time identifier and a second access time identifier.

Wherein, since 32-bit int (unsigned) type data can be divided into 2^16 buckets (i.e. using the upper 16 bits of data, i.e. the first 16 bits binary as the number of the bucket, and there can be at most 2^16 ^ 65536 buckets) when storing 32-bit integers through the RBM, each bucket has a Container to store the lower 16 bits of a value, i.e. the last 16 bits, and one RBM is a collection of many containers.

Therefore, the first access time identifier and the second access time identifier generated by splitting are both 16-bit binary numbers, the first access time identifier is the first 16 bits of the 32-bit binary number, and the second access time identifier is the last 16 bits of the 32-bit binary number.

The corresponding bucket may then be indexed according to the value corresponding to the first 16-bit binary, and then the value corresponding to the last 16-bit binary may be stored in the corresponding Container (Container) of the bucket.

Taking the access time information as the access behavior date as an example, only a 32-bit version of the RBM is needed. Since the access date does not exceed 4000 within 10 years, arrayContainer is eventually used for storage. ArrayContainer exists for satisfying sparse storage, can be used to store double-byte type numbers, maximum supports 2^ {16 }/(2 x 8) =4096 numbers, space overhead is (2 +2 c) B, c is radix, and time overhead is O (log (n)).

When storing and inquiring the value, dividing the value k into 16 high bits and 16 low bits, finding the corresponding bucket according to the 16 high bit value, and then storing the 16 low bit value in the corresponding Container.

When creating a new Container, if only one element is inserted, the RBM is stored with ArrayContainer by default. When the capacity of the ArrayContainer (wherein each element is short int, occupies two bytes, and the elements inside are arranged from large to small) exceeds 4096 (4096 short int is 8 k), the ArrayContainer automatically changes to BitmapContainer (the occupied space is 8k all the time) for storage.

In specific implementation, compressing the access time information contained in the user access data according to a data structure corresponding to the target compression bitmap, and generating and storing a corresponding compression result, including:

compressing the access time information contained in the user access data according to a data structure corresponding to a target compression bitmap to generate a corresponding compression result;

and encrypting the compression result through a preset encryption algorithm, and storing the encryption result to a data warehouse.

Specifically, the preset encryption algorithm may be a Base64 encryption algorithm.

When storing the access time information, dividing each access time information into high 16 bits and low 16 bits, finding out a corresponding bucket according to the high 16 bit value, and then storing the low 16 bit value in a corresponding Container.

The compression result stored in the target compression bitmap of the embodiment of the application is usually stored in the Hive, and because the data type of the compression result is different from that of the data stored in the Hive, before the compression result is stored in the Hive, the compression result needs to be encrypted through a Base64 encryption algorithm, so that the data type corresponding to the encryption result is consistent with that of the data stored in the Hive, and then the generated encryption result can be stored in the Hive.

Or, the access time information includes an access date;

correspondingly, the compressing the access time information contained in the user access data according to the data structure corresponding to the target compression bitmap includes:

converting the access date contained in the user access data according to the reference time to generate a corresponding conversion result;

and compressing the conversion result according to a data structure corresponding to the target compression bitmap.

Specifically, the reference Time is Coordinated Universal Time (UTC).

When the access time information is compressed and stored, the access time information can be converted into UTC data, then each UTC data is divided into 16 high bits and 16 low bits, a corresponding bucket is found according to the 16 high bits, and then the 16 low bits are stored in a corresponding Container.

In addition, when the access time information is an access action date, since the number of access dates is small, it is usually sufficient to store only a 32-bit version of RBM. And storing by ArrayContainer.

Further, a compression result of the UTC data is obtained, the compression result can be encrypted through a Base64 encryption algorithm, the data type corresponding to the encryption result is consistent with the data type of the data stored in the Hive, and then the generated encryption result can be stored in the Hive.

compressing the access time information contained in the user access data according to a data structure corresponding to the target compression bitmap to generate a corresponding compression result;

determining a target position of the access time information in the target compression bitmap according to the compression result;

and adjusting the value corresponding to the target position according to the compression result, and storing the adjusted target compression bitmap into a data warehouse.

Specifically, when the access time information is compressed and stored, if a 32-bit version of RBM is used, the access time information is compressed according to a data structure corresponding to a target compressed bitmap, specifically, the access time information may be divided into 16 high bits and 16 low bits, where the 16 high bit is used to determine a position of a bucket, the 16 low bit is used to determine a specific storage position of the access time information in the bucket, that is, a Container, and then a value of the target position is adjusted according to a compression result, specifically, the 16 high bit is stored as a key in a short [ ] keys, and the 16 low bit is stored as a value in one of the containers [ ] values.

After the compression result is stored in the RBM, the compression result can be encrypted through a Base64 encryption algorithm, so that the data type corresponding to the encryption result is consistent with the data type of the data stored in Hive, and then the generated encryption result can be stored in Hive.

And step 206, constructing a data storage model corresponding to the access time information based on a preset data storage model, the user access data and the compression result.

Specifically, the preset data storage model may be a data table that is constructed in advance, and the data table may include preset field information, for example, fields included in the data table are: the type, field name, field data type, field meaning and remark, and the corresponding field value can be stored under each field.

The preset data storage model constructed in the embodiment of the application is shown in table 1.

TABLE 1

On the basis of constructing the preset data storage model, the constructing a data storage model corresponding to the access time information based on the preset data storage model, the user access data and the compression result includes:

determining at least two fields contained in a preset data storage model;

determining a first target field corresponding to the user access data, and adding the user access data as a field value to the first target field;

and determining a second target field corresponding to the encryption result, adding the encryption result as a field value to the second target field, and generating a data storage model corresponding to the access time information.

Specifically, as described above, the preset data storage model includes different fields, and the data storage model corresponding to each access time information is constructed based on the preset data storage model, the user access data, and the encryption result of the access time information, specifically, a first target field corresponding to the user access data and a second target field corresponding to the encryption result of the access time information in the preset data storage model are determined according to the field information included in the preset data storage model, the user access data is added to the first target field as the field value of the first target field, and the encryption result of the access time information is added to the second target field as the field value of the second target field, so as to generate the data storage model corresponding to the access time information.

In addition, if the access time information is access dates and each access date corresponds to one data storage model, the data storage model may include a first data storage model and a second data storage model, where the first data storage model is constructed based on historical user access data, a compression result of historical access time information of the historical user access data, and the preset data storage model; the second data storage model is constructed on the basis of incremental user access data, a compression result of current access time information of the incremental user access data and the preset data storage model;

accordingly, the method further comprises:

and determining a user identifier contained in the second data storage model, and establishing an association relation between the first data storage model and the second data storage model under the condition that the first data storage model is determined to contain the user identifier.

Specifically, under the condition that the access time information is access dates and each access date corresponds to one data storage model, at least one first data storage model and at least one second data storage model can be generated, wherein the first data storage model is constructed by historical user access data, a compression result of historical access time information of the historical user access data and a preset data storage model; and constructing a second data storage model based on the incremental user access data, the compression result of the current access time information of the incremental user access data and the preset data storage model.

In addition, if the second data storage model generated based on the incremental user access data and the first data storage model generated based on the historical user access data both include the same user identifier, the first data storage model and the second data storage model may be associated with each other, and based on the association relationship, all the user access data related to the user identifier may be obtained.

In the embodiment of the application, the specific implementation of the preset data storage model can be diversified, the partition data of T-2 and the incremental data of T-1 are needed to be used for full join operation in the Hive level, and if Ieberg and Hudi engines can be directly updated incrementally, the speed is higher.

And 208, determining an index value statistical result corresponding to the target statistical index under the target dimension according to the compression result and the data storage model.

Specifically, as described above, the target service may be an object recommendation service, such as an advertisement recommendation service or a commodity recommendation service. The object recommendation method and the object recommendation device can perform object recommendation by analyzing user access data related to the target service, for example, analyzing the activity of users of a certain APP, analyzing the number of active users, analyzing the remaining quantity of the users and the like; the target dimension, that is, the dimension related to the access behavior of the user, may be a device system (an IOS system or an android system, etc.) and a device model of the smart device used by the user, because the user may generally use the smart device to perform APP access or perform object access.

The target statistical indexes under the target dimension comprise: in a target time interval, accessing a corresponding index to be counted to a business object of the target business based on a target dimension, wherein the index to be counted comprises the number of access users, the corresponding access duration and/or the object transaction amount;

correspondingly, the index value statistic result comprises the following steps: and counting the number of access users accessing the service object of the target service based on the target dimension, the corresponding access duration and/or the corresponding object transaction amount in the target time interval to obtain a statistical result.

For example, when the target service is an advertisement recommendation service and the target dimension is a device model, the target statistical index may be the amount of users accessing the target advertisement by using the intelligent device of model 1 in the target time interval, or may be the access duration corresponding to the users accessing the target advertisement by using the intelligent device of model 1 in the target time interval, and the like; in the case that the target service is a commodity recommendation service and the target dimension is an equipment system, the target statistical index may be the amount of users accessing the target commodity through the intelligent equipment using the system a in the target time interval, or the amount of users purchasing the target commodity through the intelligent equipment using the system a in the target time interval.

Based on this, since the data storage models include the compression result of the access time information, in the case that an index value statistical result corresponding to the target statistical index is needed to determine the target dimension of the target service, the number of access users accessing the service object of the target service based on the target dimension, the corresponding access duration and/or the corresponding object transaction amount in the target time interval may be counted according to the user access data and the compression result of the access time information included in each data storage model, so as to obtain a statistical result.

In specific implementation, the determining an index value statistical result corresponding to the target statistical index in the target dimension according to the compression result and the data storage model includes:

determining a target statistical index to be counted under the target dimension, and determining target access time information related to the target statistical index;

and determining a target data storage model containing the encryption result of the target access time information, and determining an index value statistical result corresponding to the target statistical index according to the user access data contained in the target data storage model.

Further, after determining the target access time information related to the target statistical indicator, the method further includes:

acquiring the encryption result stored in the data warehouse, and decrypting the encryption result to obtain a compression result of the access time information;

determining whether a target data storage model containing an encryption result of the target access time information exists according to the compression result;

and if so, executing the step of determining the target data storage model containing the encryption result of the target access time information.

Specifically, when an index value statistical result corresponding to a target statistical index in a target dimension needs to be determined according to a compression result of access time information and a data storage model, first, target access time information related to the target statistical index may be determined, for example, in the case where the access time information is an access date, if the target statistical index is: in month 8, the number of users accessing the advertisement G1 by using the intelligent device of the system A is 8, the target access time information related to the target statistical index can be 08-01, 08-02, \8230 \ 08;, 08-31, and the like, then a related target data storage model can be determined according to the target access time information, and an index value statistical result corresponding to the target statistical index can be determined according to the user access data contained in the target data storage model.

Since the data storage model includes the encryption result corresponding to the compression result of the access time information, when the target data storage model including the target access time information is determined according to the target access time information, the encryption result included in the data storage model may be decrypted by using a base64Decode function to obtain the compression result of the access time information, then the target data storage model including the encryption result of the target access time information may be determined according to the compression result, and then the index value statistical result corresponding to the target statistical index in the target dimension may be determined according to the user access data included in each target data storage model.

For example, in the case that the user access data in the target data storage model includes a user (unique device identifier), an access time, an environment (e.g., a model, a brand, a page, a module), an accessed service type (e.g., live broadcast, game, community, etc.), a specific access behavior (e.g., exposure, click, comment, like, cost, etc.), a behavior frequency, a behavior depth (e.g., exposure duration, click frequency, number of comment words, amount spent, etc.), etc., an index value statistical result corresponding to the target statistical index in the target dimension may be determined according to a manner of counting the target statistical index on the user access data.

Or, the determining an index value statistical result corresponding to the target statistical index according to the user access data included in the target data storage model includes:

and calling a pre-created statistical function, processing the user access data contained in the target data storage model, and generating an index value statistical result corresponding to the target statistical index.

Specifically, the pre-created statistical function may be a UDF, and the embodiment of the present application may perform real-time calculation directly according to actual requirements by matching with the pre-created UDF, specifically, perform statistical processing on user access data included in each target data storage model by calling the UDF, and generate an index value statistical result corresponding to a target statistical index.

A schematic diagram of a user access data processing process provided in an embodiment of the application is shown in fig. 3, where fig. 3 includes a serialization process and a deserialization process, where in the serialization process, access dates to be serialized include 2019-11-01, 2019-12-27, 2020-01-01, 2020-01-02, 2020-01-04, 2020-02-20, and 2020-03-20, and each access date is converted into UTC data, and the corresponding conversion results are 18201, 18257, 18262, 18263, 18265, 18312, and 18341; compressing each conversion result according to a data structure of the RBM, encrypting the compression result by a Base64 encryption algorithm to obtain an encryption result which is OjAAAAAAAAYAEAAAABIHUUUdWR 1dHWUeIR6VH, and then storing the encryption result to Hive; in the deserialization process, the encryption result can be decrypted first to obtain a corresponding decryption result, the decryption result is the UTC data, then whether a target data storage model containing a target access date exists can be determined according to the UTC data, if yes, a UDF function can be called to process the user access data contained in the target data storage model, and an index value statistical result corresponding to the target statistical index is generated.

In addition, the UDF function pre-constructed in the embodiment of the present application is shown in table 2, where table 2 only schematically exemplifies a part of the UDF function, and the specifically constructed UDF function may be determined according to actual requirements.

TABLE 2

In the embodiment of the application, one data storage model can directly cover data of different scenes such as downstream DAUs and MAUs, new activation, retention, active frequency and the like, and the number of the models for carrying the scenes can be reduced greatly on the multi-bin modeling logic by using the data storage model, so that high cohesion and low coupling of user access data are achieved, and calculation and storage resources are saved.

The embodiment of the application can completely mark historical user access data related to the target service at one time, and adopts uniform RBM (round Bit Map) marks for incremental data, such as access dates. The RBM has the characteristics of high compression rate and fast reading and writing, so that the storage overhead can be further saved, and important change information can be completely recorded without distortion. In addition, a series of calculation indexes of target dimensionality can be covered by using the RBM marking characteristic, taking an access behavior as an example, one data storage model can directly cover data of different scenes such as downstream DAU, MAU, new activation, retention, active frequency and the like, and the number of models for carrying the scenes can be greatly reduced on the multi-bin modeling logic, so that high cohesion and low coupling of behavior information are achieved, and calculation and storage resources are saved to a great extent; in addition, the embodiment of the application carries out one-time complete marking on the historical user access data in the target dimension, and transfers the self-dependence from the model level to the field information of the data storage model without sensing the external use, so that the stability of the back-brushing of the model is ensured.

An embodiment of the application realizes a user access data processing method and device, wherein the user access data processing method comprises the steps of obtaining user access data of a target service in a target dimension, compressing access time information contained in the user access data according to a data structure corresponding to a target compression bitmap, generating and storing a corresponding compression result, building a data storage model corresponding to the access time information based on a preset data storage model, the user access data and the compression result, and determining an index value statistical result corresponding to a target statistical index in the target dimension according to the compression result and the data storage model.

According to the embodiment of the application, a series of statistical indexes (access tag data) corresponding to any behaviors can be covered through one data storage model by utilizing the marking characteristic of the RBM, so that when user behavior analysis is carried out, the related access tag data is stored by utilizing the storage technology, the occupation of storage space can be reduced, and the data storage cost is favorably reduced.

Referring to fig. 4, taking an application of the user access data processing method provided in the embodiment of the present application in an actual scene as an example, the user access data processing method is further described. Fig. 4 shows a processing flow chart of a user access data processing method according to an embodiment of the present application, which specifically includes the following steps:

step 402, obtaining user access data of the target service in the target dimension.

Step 404, converting the access date contained in the user access data into UTC data.

And 406, compressing the UTC data according to the data structure corresponding to the RBM to generate a corresponding compression result.

And step 408, encrypting the compression result through a Base64 encryption algorithm, and storing the encryption result to a data warehouse.

At step 410, at least two fields contained in the preset data storage model are determined.

At step 412, a first target field corresponding to the user access data is determined and the user access data is added to the first target field as a field value.

And step 414, determining a second target field corresponding to the encrypted result of the access date, adding the encrypted result of the access date as a field value to the second target field, and generating a data storage model corresponding to the access date.

Step 416, determine the target statistical index to be counted in the target dimension, and determine the target access date related to the target statistical index.

And 418, acquiring the encryption result stored in the data warehouse, and decrypting the encryption result to obtain a compression result of the access date.

Step 420, determining whether a target data storage model containing the encrypted result of the target access date exists according to the compression result.

And 422, if so, determining a target data storage model containing the encryption result of the target access date, calling a pre-created statistical function, processing the user access data contained in the target data storage model, and generating an index value statistical result corresponding to the target statistical index.

According to the embodiment of the application, a series of statistical indexes (access label data) corresponding to any behaviors can be covered through one data storage model by utilizing the marking characteristic of the RBM, so that when user behavior analysis is carried out, the storage technology is utilized to store the related access label data, the occupation of storage space can be reduced, and the data storage cost is favorably reduced.

Corresponding to the above method embodiment, the present application further provides an embodiment of a user access data processing apparatus, and fig. 5 shows a schematic structural diagram of a user access data processing apparatus provided in an embodiment of the present application. As shown in fig. 5, the apparatus includes:

an obtaining module 502 configured to obtain user access data of a target service in a target dimension;

a compression module 504, configured to compress the access time information included in the user access data according to a data structure corresponding to a target compression bitmap, generate a corresponding compression result, and store the compression result;

a construction module 506 configured to construct a data storage model corresponding to the access time information based on a preset data storage model, the user access data, and the compression result;

a determining module 508 configured to determine an index value statistical result corresponding to the target statistical index in the target dimension according to the compression result and the data storage model.

Optionally, the compression module 504 is further configured to:

Optionally, the building module 506 is further configured to:

determining at least two fields contained in a preset data storage model;

Optionally, the access time information includes an access date;

accordingly, the compression module 504 is further configured to:

Optionally, the compression module 504 is further configured to:

converting access time information contained in the user access data into a target data type according to a data structure of a target compression bitmap, wherein the target data type comprises 32-bit binary numbers;

and indexing to obtain a target bucket index number in the target compression bitmap based on the first access time identifier, and storing the second access time identifier to a container corresponding to the target bucket index number.

Optionally, the determining module 508 is further configured to:

determining a target statistical index to be counted in the target dimension, and determining target access time information related to the target statistical index;

Optionally, the user access data processing apparatus further comprises a processing module configured to:

Optionally, the determining module 508 is further configured to:

Optionally, the data storage model includes a first data storage model and a second data storage model, and the first data storage model is constructed based on historical user access data, a compression result of historical access time information of the historical user access data, and the preset data storage model; the second data storage model is constructed on the basis of incremental user access data, a compression result of current access time information of the incremental user access data and the preset data storage model;

accordingly, the user access data processing apparatus further comprises a setup module configured to:

and determining a user identifier contained in the second data storage model, and establishing an association relation between the first data storage model and the second data storage model under the condition that the first data storage model contains the user identifier.

Optionally, the compression module 504 is further configured to:

determining the target position of the access time information in the target compression bitmap according to the compression result;

Optionally, the target statistical indicator in the target dimension includes: in a target time interval, accessing a corresponding index to be counted to a business object of the target business based on a target dimension, wherein the index to be counted comprises the number of access users, the corresponding access duration and/or the object transaction amount;

accordingly, the index value statistic result comprises: and counting the number of access users accessing the service object of the target service based on the target dimension, the corresponding access time and/or the corresponding object transaction amount in the target time interval to obtain a statistical result.

The above is an exemplary scheme of the present embodiment for accessing a data processing apparatus by a user. It should be noted that the technical solution for the user to access the data processing apparatus and the technical solution for the user to access the data processing method belong to the same concept, and details that are not described in detail in the technical solution for the user to access the data processing apparatus can be referred to the description of the technical solution for the user to access the data processing method.

FIG. 6 illustrates a block diagram of a computing device 600 provided according to an embodiment of the present application. The components of the computing device 600 include, but are not limited to, a memory 610 and a processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to store data.

Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 640 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present application, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 6 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 600 may also be a mobile or stationary server.

Wherein the processor 620 is configured to execute computer-executable instructions for executing the computer-executable instructions, wherein the steps of the user access data processing method are implemented when the processor executes the computer-executable instructions.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the user access data processing method described above belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the user access data processing method described above.

An embodiment of the present application also provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor, implement the steps of the user access data processing method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the user access data processing method, and for details that are not described in detail in the technical solution of the storage medium, reference may be made to the description of the technical solution of the user access data processing method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of combinations of acts, but it should be understood by those skilled in the art that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that acts and modules referred to are not necessarily required to implement the embodiments of the application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the teaching of the embodiments of the present application. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A method for processing user access data, comprising:

acquiring user access data of a target service in a target dimension;

compressing the access time information contained in the user access data according to a data structure corresponding to the target compression bitmap, generating and storing a corresponding compression result;

2. The method for processing user access data according to claim 1, wherein compressing the access time information included in the user access data according to a data structure corresponding to a target compression bitmap to generate and store a corresponding compression result comprises:

3. The method for processing user access data according to claim 2, wherein the constructing a data storage model corresponding to the access time information based on a preset data storage model, the user access data, and the compression result includes:

determining at least two fields contained in a preset data storage model;

4. The method of claim 2, wherein the access time information includes an access date;

correspondingly, compressing the access time information contained in the user access data according to a data structure corresponding to the target compression bitmap includes:

5. The data processing method of claim 1, wherein compressing the access time information included in the user access data according to a data structure corresponding to a target compression bitmap to generate and store a corresponding compression result, comprises:

6. The method according to claim 2, wherein determining an index value statistical result corresponding to a target statistical index in the target dimension according to the compression result and the data storage model comprises:

7. The method of claim 6, wherein after determining the target access time information associated with the target statistical indicator, further comprising:

8. The method according to claim 6, wherein the determining an index value statistic result corresponding to the target statistic index according to the user access data included in the target data storage model includes:

and calling a pre-created statistical function, processing user access data contained in the target data storage model, and generating an index value statistical result corresponding to the target statistical index.

9. The user access data processing method according to claim 1, wherein the data storage model comprises a first data storage model and a second data storage model, and the first data storage model is constructed based on historical user access data, a compression result of historical access time information of the historical user access data, and the preset data storage model; the second data storage model is constructed on the basis of incremental user access data, a compression result of current access time information of the incremental user access data and the preset data storage model;

correspondingly, the method further comprises:

10. The method for processing user access data according to claim 1, wherein compressing the access time information included in the user access data according to a data structure corresponding to a target compression bitmap to generate and store a corresponding compression result comprises:

11. The method of claim 1, wherein the target statistical indicators in the target dimension comprise: in a target time interval, accessing a corresponding index to be counted to a service object of the target service based on a target dimension, wherein the index to be counted comprises the number of access users, corresponding access duration and/or object transaction amount;

12. A user access data processing apparatus, comprising:

the compression module is configured to compress the access time information contained in the user access data according to a data structure corresponding to the target compression bitmap, generate and store a corresponding compression result;

13. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor implements the steps of the user access data processing method according to any one of claims 1 to 11 when executing the computer-executable instructions.

14. A computer-readable storage medium storing computer instructions which, when executed by a processor, carry out the steps of the method of processing user access data according to any one of claims 1 to 11.