CN110737691B - Method and apparatus for processing access behavior data - Google Patents

Method and apparatus for processing access behavior data Download PDF

Info

Publication number
CN110737691B
CN110737691B CN201810719951.2A CN201810719951A CN110737691B CN 110737691 B CN110737691 B CN 110737691B CN 201810719951 A CN201810719951 A CN 201810719951A CN 110737691 B CN110737691 B CN 110737691B
Authority
CN
China
Prior art keywords
data
access behavior
user
preset
sampling rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810719951.2A
Other languages
Chinese (zh)
Other versions
CN110737691A (en
Inventor
余韬
徐然
周振宇
吴名宇
叶峻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810719951.2A priority Critical patent/CN110737691B/en
Publication of CN110737691A publication Critical patent/CN110737691A/en
Application granted granted Critical
Publication of CN110737691B publication Critical patent/CN110737691B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application discloses a method and a device for processing access behavior data. One embodiment of the method comprises: acquiring access behavior data in a preset data acquisition period, wherein the access behavior data comprises user identification of a user accessing a network; performing hash operation on the user identifiers in the acquired access behavior data to obtain hash values of the user identifiers, and sequencing the user identifiers according to the hash values of the user identifiers; and extracting access behavior data meeting the preset sample data size according to the sequencing of the user identification to serve as sample data. The embodiment can effectively control the sampling data volume, thereby effectively controlling the aggregation calculation speed, simultaneously ensuring that the sampling method has good stability and randomness, and obtaining a high-precision data analysis result.

Description

Method and apparatus for processing access behavior data
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to the technical field of data analysis, and particularly relates to a method and a device for processing access behavior data.
Background
With the development of internet technology, big data technology is gradually applied to various industries. The access behavior data collection and analysis can be applied to aspects of product popularization, operation, monitoring and the like. In analyzing data, it is often necessary to perform multi-bit cross analysis, construct grouping conditions with multiple dimensions or indexes, and calculate aggregate indexes such as sums and averages within each group. Because the computing resources are limited, the data are generally subjected to aggregation calculation on the sampled data after the data are sampled, so that the data analysis speed is increased. The current sampling technology generally adopts a static sampling rate sampling mode to analyze data, a fixed sampling rate is determined in advance, and data collected every day are sampled according to the sampling rate and fall into a database for query.
Disclosure of Invention
The embodiment of the application provides a method and a device for processing access behavior data.
In a first aspect, an embodiment of the present application provides a method for processing access behavior data, including: acquiring access behavior data in a preset data acquisition period, wherein the access behavior data comprises user identification of a user accessing a network; performing hash operation on the user identifiers in the acquired access behavior data to obtain hash values of the user identifiers, and sequencing the user identifiers according to the hash values of the user identifiers; and extracting access behavior data meeting the preset sample data volume according to the sequence of the user identification to serve as sample data.
In some embodiments, the above method further comprises: determining a sampling rate according to a preset sample data volume and the total data volume of the acquired access behavior data; and performing aggregation operation based on the sample data in at least one preset data acquisition period and the corresponding sampling rate to obtain an analysis result of the access behavior data.
In some embodiments, the performing an aggregation operation based on sample data in at least one preset data acquisition period and a corresponding sampling rate to obtain an analysis result of the access behavior data includes: determining the maximum sampling rate corresponding to each user identifier in sample data in at least one preset data acquisition period; and accumulating the reciprocal of the maximum sampling rate corresponding to each user identifier in the sample data in at least one preset data acquisition period to obtain an aggregation result of the independent visitor number of the page of the access behavior data.
In some embodiments, the performing aggregation operation based on sample data in at least one preset data acquisition period and a corresponding sampling rate to obtain an analysis result of the access behavior data includes at least one of: accumulating values obtained by dividing the numerical value representing the characteristic of each user accessing the preset site and the corresponding sampling rate in the sample data in at least one preset data acquisition period to obtain the sum of the characteristic values of the user accessing the preset site corresponding to the access behavior data; accumulating the reciprocal of the sampling rate of the sample data in at least one preset data acquisition period to obtain a counting statistical result of the user behavior data; and taking the value obtained by dividing the value of the characteristic representing the preset site visited by each user in the sample data in at least one preset data acquisition period by the corresponding sampling rate and the value obtained by dividing the value by the reciprocal of the sampling rate as the average value of the characteristic values of the preset site visited by the user corresponding to the access behavior data.
In some embodiments, the extracting, according to the sorting of the user identifier, the access behavior data that meets a preset sample data size as sample data includes: and extracting access behavior data which meet a preset sample data size and are corresponding to the user identifications sequenced before the user identifications which are not extracted as sample data.
In a second aspect, an embodiment of the present application provides an apparatus for processing access behavior data of a user, including: the access behavior acquisition unit is configured to acquire access behavior data in a preset data acquisition period, wherein the access behavior data comprises user identification of a user accessing a network; the operation unit is configured to perform hash operation on the user identifiers in the acquired access behavior data to obtain hash values of the user identifiers, and sort the user identifiers according to the hash values of the user identifiers; and the sampling unit is configured to extract the access behavior data meeting the preset sample data amount as sample data according to the sorting of the user identification.
In some embodiments, the apparatus further comprises an analysis unit configured to: determining a sampling rate according to a preset sample data volume and the total data volume of the acquired access behavior data; and performing aggregation operation based on the sample data in at least one preset data acquisition period and the corresponding sampling rate to obtain an analysis result of the access behavior data.
In some embodiments, the analysis unit is configured to perform the aggregation operation as follows: determining the maximum sampling rate corresponding to each user identifier in sample data in at least one preset data acquisition period; and accumulating the reciprocal of the maximum sampling rate corresponding to each user identifier in the sample data in at least one preset data acquisition period to obtain an aggregation result of the independent visitor number of the page of the access behavior data.
In some embodiments, the analysis unit is configured to perform the aggregation operation in at least one of the following ways: accumulating values obtained by dividing the numerical value representing the characteristic of each user accessing the preset site and the corresponding sampling rate in the sample data in at least one preset data acquisition period to obtain the sum of the characteristic values of the user accessing the preset site corresponding to the access behavior data; accumulating the reciprocal of the sampling rate of the sample data in at least one preset data acquisition period to obtain a counting statistical result of the user behavior data; and taking the value obtained by dividing the value representing the characteristic of each user accessing the preset site by the accumulated result of the value obtained by dividing the value of the characteristic of each user accessing the preset site by the corresponding sampling rate and the accumulated result of the reciprocal of the sampling rate in the sample data in at least one preset data acquisition period as the average value of the characteristic values of the user accessing the preset site corresponding to the access behavior data.
In some embodiments, the sampling unit further implicitly extracts, as the sample data, the access behavior data that satisfies a preset sample data amount as follows: and extracting access behavior data which meet the preset sample data size and correspond to the user identifiers which are sequenced before the user identifiers which are not extracted as sample data.
In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement a method for processing access behavior data as provided in the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the program, when executed by a processor, implements the method for processing access behavior data provided in the first aspect.
According to the method and the device for processing the access behavior data, the access behavior data in the preset data acquisition period are acquired, the access behavior data comprise the user identification of the user accessing the network, hash operation is conducted on the user identification in the acquired access behavior data to obtain the hash value of each user identification, the user identifications are sorted according to the hash value of the user identification, the access behavior data meeting the preset sample data volume are extracted according to the sorting of the user identifications and serve as sample data, the sampling data volume can be effectively controlled, the aggregation calculation speed is effectively controlled, meanwhile, the sampling method is guaranteed to have good stability and randomness, and a high-precision data analysis result is obtained.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram to which embodiments of the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a method for processing access behavior data according to the present application;
FIG. 3 is a schematic illustration of a data sampling result of a method for processing access behavior data according to the present application;
FIG. 4 is a scene diagram of an application example of a method for processing access behavior data according to the application;
FIG. 5 is a schematic diagram of an apparatus for processing access behavior data according to the present application;
FIG. 6 is a schematic block diagram of a computer system suitable for use to implement the electronic device of an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 illustrates an exemplary system architecture 100 to which the method for processing access behavior data or the apparatus for processing access behavior data of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between terminal devices 101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user 110 may use the terminal devices 101, 102, 103 to interact with the server 105 over the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various information query applications installed thereon, such as a search engine client, a navigation client, instant messaging software, audio-video playing software, and so on.
The terminal devices 101, 102, 103 may be various electronic devices having a display and supporting internet access including, but not limited to, smart phones, tablets, smart watches, notebook computers, laptop laptops, e-book readers, and the like.
The server 105 may be a query server, e.g. a search engine server, providing query services for terminal devices. The server 105 may parse the query requests sent by the terminal devices 101, 102, and 103, query corresponding data according to the parsing result, and may feed back the queried data to the terminal devices 101, 102, and 103 through the network 104. The server 105 may also be a background server that performs statistical analysis on user behavior data based on the terminal device, the server 105 may record behaviors of the user requesting access to the network resource through the terminal devices 101, 102, and 103, perform statistical analysis on the recorded behavior data according to a set period, and when the user sends a query request of a statistical analysis result through the terminal devices 101, 102, and 103, may feed back the statistical analysis result to the terminal devices 101, 102, and 103.
It should be noted that, the method for processing access behavior data provided in the embodiment of the present application may be executed by the server 105, and accordingly, the apparatus for processing access behavior data may be disposed in the server 105.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple software modules for providing distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for processing access behavior data according to the present application is illustrated. The method for processing access behavior data comprises the following steps:
step 201, access behavior data in a preset data acquisition period is acquired.
In this embodiment, the method for processing access behavior data may acquire access behavior data in a preset data acquisition period. The preset data collection period may be a preset data collection period, and may be, for example, 4 hours, one day, one week, and the like. The access behavior data may be data characterizing the behavior of a user to access network resources and may include a user identification of the user accessing the network. The execution main body may record access behavior data of the user when the user accesses the network resource, where the recorded access behavior data may include a user identifier, such as a user ID, of the user accessing the network resource, may also include an address of the network resource accessed by the user, such as an access link address, and may also include a start time and an end time of the user accessing the network resource.
Generally, a user can access network resources through various network access portals provided by a mobile phone, a computer, and other terminal devices. For example, a user may enter a web address in a browser to access a designated page, and for example, a user may select the name of an item in an online shopping application to access a detailed description page for the item. The network server or the terminal device may record data related to the access operation of the user, such as user identification, access time, an address of a resource accessed, a device model of the terminal device used in the access, and the like. And the network server or the terminal device can store the access behavior data of the user in a preset data acquisition period in a data set. When data sampling is performed, the execution main body of the access behavior data processing method may call the stored access behavior data in the preset data acquisition period from the memory, or receive the access behavior data in the preset data acquisition period sent by the terminal device.
Step 202, performing hash operation on the user identifiers in the obtained access behavior data to obtain hash values of the user identifiers, and sorting the user identifiers according to the hash values of the user identifiers.
The data volume of the access behavior data in the preset data acquisition period is usually large, and the acquired access behavior data can be sampled to reduce the operation amount of data processing. In this embodiment, the behavior data of a part of the users in the acquired behavior data may be selected as sample data. The hash operation may be specifically performed on the user identifier in the obtained access behavior data to obtain a hash value of each user identifier, and then the hash values of the user identifiers are sorted.
Here, the user id is usually a character string composed of characters such as letters, symbols, numbers, etc., and the hash value of the user id may be a binary value obtained by hashing the user id. The hash values of different user identifications are different. Because the acquired access behavior data contains massive behavior data of the user accessing the network, the hash value can be used as a basis for selecting sample data according to the uniqueness of the hash value of the user identifier.
After the hash value of each user identifier is calculated, the user identifiers may be sorted according to the size of the hash value, specifically, the hash values may be sorted in the descending order, or the hash values may be sorted in the descending order.
And 203, extracting access behavior data meeting the preset sample data size according to the sequence of the user identification to serve as sample data.
In this embodiment, the access behavior data corresponding to the user identifier may be extracted from the acquired access behavior data as sample data according to the ranking of the user identifier from high to low. When extracting sample data, it is necessary to ensure that the data size of the extracted sample data satisfies a preset sample data size. Here, the preset sample data size may be a preset value, and when total data sizes of the access behavior data acquired in different data acquisition periods are different, the sample data size obtained by sampling is kept as the preset sample data size.
In some optional implementation manners of this embodiment, access behavior data that meets a preset sample data size and is corresponding to a user identifier ordered before an unextracted user identifier may be extracted as sample data. Specifically, the access behavior data corresponding to each user identifier may be sequentially extracted according to the sequence of the user identifiers, and it is determined whether the data size of the extracted behavior data reaches a preset sample data size, if so, the behavior data is stopped being extracted, otherwise, the user behavior data corresponding to the next user identifier is continuously extracted until the data size of the extracted user behavior data reaches the preset sample data size. Or according to the sequence of the user identifications and the data size of the behavior data corresponding to each user identification, determining the access behavior data corresponding to the user identifications from the first to the nth bits of the sequence as sample data. For example, the user identifier sequence and the corresponding data size are in turn: the method comprises the steps of recording a user A to 100 records, a user B to 200 records, a user C to 200 records, a user D to 150 records and a user E to 300 records, \8230, when the preset sample data volume is 500 records, determining that the total amount of access behavior data corresponding to the user A, the user B and the user C which are ranked at the first position, the second position and the third position is 500 according to the ranking, and taking the access behavior data corresponding to the user A, the user B and the user C as sample data.
The method extracts the sample data according to the sorting of the hash values of the user identifiers, so that the access behavior data of the user subject can be guaranteed to be sampled if the access behavior data of the same user subject is sampled at the current sampling rate, and the access behavior data of the user subject can be sampled if the sampling rate is the same or higher; if the user agent is not sampled at the current sampling rate, the access behavior data of the user agent must not be sampled when the sampling rate is the same or lower. For example, when the user a is sampled as the sample data of the first day in the access behavior data sampling of the first day, if the total amount of the access behavior data of the second day is less than the total amount of the access behavior data of the first day, the access behavior data of the user a on the second day is also necessarily sampled into the sample data set of the second day. Therefore, consistency can be kept in sampling of the access behavior data in different data acquisition periods, stability of the sampling method is guaranteed, and the data aggregation analysis result based on sample data has good reliability.
According to the method for processing the access behavior data by the user, the access behavior data in the preset data acquisition period are acquired, the access behavior data comprise the user identification of the user accessing the network, the hash value of each user identification is obtained by performing hash operation on the user identification in the acquired access behavior data, the user identifications are sorted according to the hash value of the user identification, the access behavior data meeting the preset sample data volume are extracted according to the sorting of the user identifications and serve as sample data, the sampling data volume can be effectively controlled, the aggregation calculation speed is effectively controlled, the sampling method is guaranteed to have good stability and randomness, and a high-precision data analysis result is obtained.
In some embodiments, the method for processing user behavior data may further include: determining a sampling rate according to a preset sample data volume and the total data volume of the acquired access behavior data; and performing aggregation operation based on the sample data in at least one preset data acquisition period and the corresponding sampling rate to obtain an analysis result of the access behavior data.
Specifically, a quotient of a preset sample data size (i.e., a data size of sampled sample data) and a data size of user access behavior data acquired in a corresponding data acquisition period may be calculated as a sampling rate in the data acquisition period.
Referring to FIG. 3, a diagram of a data sampling result of a method for processing access behavior data according to the present application is shown. As shown in fig. 3, in different data acquisition periods T1, T2, T3, T4, T5, and T6, the total amount of the acquired behavior data is 100 ten thousand, 50 ten thousand, 75 ten thousand, 49.9 ten thousand, 60 ten thousand, and 70 ten thousand, respectively, and the preset sample data amount is 50 ten thousand, then the sample data amounts in the data acquisition periods T1, T2, T3, T4, T5, and T6 are in turn: the sampling rates of the data acquisition periods T1, T2, T3, T4, T5 and T6 are respectively calculated to be 50%, 100%, 66.67%, 100%, 83.3% and 71.43%. It can be seen that under the condition of different total data amounts, the data amount of the sample data is kept consistent, and the sampling rate can be changed, so that the problem that the aggregation calculation speed based on the sample data is influenced due to overlarge sample data amount when the data amount suddenly increases under the fixed sampling rate can be avoided.
Then, the sampling rates corresponding to the sample data of different data acquisition periods may be marked, for example, the sample data in the first period is { user A1 data-sampling rate γ 11, user B1 data-sampling rate γ 12, user C1 data-sampling rate γ 13, \ 8230 }, the sample data in the second period is { user A2 data-sampling rate γ 21, user B2 data-sampling rate γ 22, user C2 data-sampling rate γ 23, \ 8230 }, and so on. Aggregation calculation can be performed based on sample data in one or more preset data acquisition periods and the corresponding sampling rate, so that an analysis result of the access behavior data is obtained. The aggregation calculation may include calculation and statistics of indexes such as a sum, a mean, a maximum, a minimum, and a data amount satisfying a certain condition of preset data items in the sample data. Here, the preset data item may be a data item characterizing the user access behavior, such as the number of clicks of the user, the number of users clicking a certain page, the user access duration, and the like.
In some optional implementation manners, the step of performing an aggregation operation based on sample data in at least one preset data acquisition period and a corresponding sampling rate to obtain an analysis result of the access behavior data may include: determining the maximum sampling rate corresponding to each user identifier in sample data in at least one preset data acquisition period; and accumulating the reciprocal of the maximum sampling rate corresponding to each user identifier in the sample data in at least one preset data acquisition period to obtain an aggregation result of the independent visitor number of the page of the access behavior data.
In particular, assume that the sampling rate of the user agent i accessing the target page is P i The sampling rate of each user body accessing the target page is P i Can represent 1/P i Individual user entity, but at different data acquisition periods, sample rate P i Changes may occur so that the number of principals represented by each user principal in the sample data is 1/P i Changes may also occur. When the access behavior data of a plurality of data acquisition periods is acquired, the maximum sampling rate of the user subject in the plurality of data acquisition periods can be selected, that is, the maximum sampling rate corresponding to each user identifier in sample data in at least one preset data acquisition period is determined, the reciprocal of the maximum sampling rate represents the minimum number of independent user subjects represented by the user identifier, and then the minimum number of independent user subjects represented by the user subject can be accumulated and summed based on the reciprocal of the maximum sampling rate to obtain the aggregation result of the page independent visitor number of the access behavior data.
For example, the sample data includes n independent user agents, and each independent user agent i has m i The record of the access target page has a sampling rate of P ij Then the minimum number of independent user agents represented by user agent i is min (1/P) ij ) And the total number U of the independent visitor numbers of the final target page is as follows:
Figure BDA0001718420170000101
here, the total number U of the number of independent visitors of the target page is an aggregate calculation result of the number of subject deduplication, that is, the total number of user subjects who access the target page.
Taking a preset data acquisition cycle as one day, and performing aggregation calculation by using sample data extracted from the access behavior data acquired for two consecutive days as an example, the calculation mode of the number U of the independent visitors on the page is proved to be capable of representing the total number of the user subjects accessing the target page.
Assuming that the number of independent user subjects accessing the target page in the access behavior data acquired in the first day is U1, and the number of sampled independent user subjects is S1, the sampling rate in the first day is P1= S1/U1. The number of independent user agents in the access behavior data collected the next day is U2, the number of sampled independent user agents is S2, wherein the number of users retained from the first day is R, and then the sampling rate of the second day is P2= S2/U2. Here, the number of users remaining from the first day is R representing the number of users accessing the destination page on both the first day and the second day.
When P1 is greater than or equal to P2, of the total number of user agents corresponding to the number S2 of independent user agents sampled on the next day (i.e., the total number of user agents accessing the target page on the second day), the user agents of P2 × R are retained on the first day, and since P1 is greater than or equal to P2, these retained user agents are necessarily in the total number of user agents corresponding to S1 (i.e., the total number of user agents accessing the target page on the first day), then the total number U' of page independent visitor numbers on two days is calculated according to the above formula (1):
Figure BDA0001718420170000111
when P1 is not greater than P2, of the total number of user agents corresponding to the number S1 of independent user agents sampled on the first day (i.e., the total number of user agents accessing the target page on the first day), the user agents of P1 × R may be retained to the second day, and since P1 is not greater than P2, these retained user agents are necessarily all in the total number of user agents corresponding to S2 (i.e., the total number of user agents accessing the target page on the second day), and then the total number U' of page independent visitor numbers on two days is calculated according to the above formula (1):
Figure BDA0001718420170000112
it can be seen that when P1 is greater than or equal to P2 or P1 is less than or equal to P2, the user principal deduplication number calculated according to the method provided by the above formula (1) is correct, so that repeated statistics of users who visit repeatedly can be avoided, and a more accurate aggregation analysis result can be obtained.
Due to the fact that the data sampling algorithm adopted in the method for processing the access behavior data can ensure that the reserved user principal is sampled in the sample data set with the higher sampling rate, after the reserved user principal is deduplicated from the sample data set with the lower sampling rate, the user principal in the sample data set with the higher sampling rate already contains all reserved user principal.
Referring to fig. 4, a scene diagram of an application example for processing access behavior data according to the present application is shown, that is, a schematic diagram of a method adopted in performing aggregation calculation of independent visitor numbers is shown.
As shown in fig. 4, a circle "o" represents a user subject at a first sampling rate (e.g., a user subject acquired in a first data acquisition cycle), a triangle "Δ" represents a user subject at a second sampling rate (e.g., a user subject acquired in a second data acquisition cycle), and a star symbol "star" represents a user subject at a third sampling rate (e.g., a user subject acquired in a third data acquisition cycle), wherein the first sampling rate < the second sampling rate < the third sampling rate. Each large circle represents the set of all user agents at the corresponding sampling rate.
As can be seen from fig. 4, for the samples in the intersection of two large circles or the intersection of three large circles, in the user subject set to which the samples belong, the sample set with the highest sampling rate is selected to be retained, that is, other repeated samples can be correctly deduplicated, for example, in fig. 4, after deduplication, in the user subject set under the first sampling rate and the user subject set under the second sampling rate, only the user subject under the second sampling rate can be retained, and the user subject under the first sampling rate is removed; in the intersection of the user agent set at the first sampling rate and the user agent set at the third sampling rate, only the user agent set at the third sampling rate may be retained, and the user agent set at the first sampling rate is removed; in the intersection of the user main body set at the second sampling rate and the user main body set at the third sampling rate, only the user main body at the third sampling rate can be reserved, and the user main body at the second sampling rate is removed; in the user agent set at the first sampling rate, the user agent set at the second sampling rate, and the intersection of the user agent set at the third sampling rate, only the user agent set at the third sampling rate may be retained, and the user agent set at the first sampling rate and the user agent set at the second sampling rate may be removed. Because the sampling rate of the user agent set containing the reserved user agent is the highest sampling rate in the user agent set containing the user agent, accurate results of the de-duplication numbers of all the user agent sets can be restored based on the reserved user agent.
In some embodiments, the performing the aggregation operation based on the sample data in at least one preset data acquisition period and the corresponding sampling rate to obtain the analysis result of the access behavior data may include, but is not limited to, at least one of the following:
firstly, accumulating values obtained by dividing a numerical value representing the characteristic of each user accessing the preset site and a corresponding sampling rate in sample data in at least one preset data acquisition period to obtain the sum of characteristic values of the user accessing the preset site corresponding to the access behavior data. Here, the numerical value of the characteristic that the user accesses the preset site may include a PV (Page View, page access number), an access time period, and the like of the user accessing the preset site. For example, when calculating the PV of the target site in at least one preset data acquisition period, it may be assumed that the sample data includes n records (i.e., includes access records of n users), and the sampling rate of the ith record is Pi, then the number of accesses PVi of the user to access the target site in the sample data may be divided by the corresponding sampling rate Pi to obtain the number of accesses PVi/Pi of the target site corresponding to the ith record, and then the sum of the page accesses PV of the user to access the target site is obtained by accumulating:
Figure BDA0001718420170000121
and secondly, accumulating the reciprocal of the sampling rate of the sample data in at least one preset data acquisition period to obtain a counting statistical result of the user behavior data. Here, the reciprocal 1/Pi of the sampling rate may represent the total number US of the user subjects represented by the user subjects in the sample data corresponding to the sampling rate Pi, that is, the counting statistic result US of the user behavior data is:
Figure BDA0001718420170000131
and thirdly, dividing the accumulated result of the value obtained by dividing the numerical value representing the characteristic of each user accessing the preset site by the corresponding sampling rate in the sample data in at least one preset data acquisition period and the accumulated result of the reciprocal of the sampling rate to obtain a value, and taking the value obtained by dividing the accumulated result of the reciprocal of the sampling rate as the average value of the characteristic values of the user accessing the preset site corresponding to the access behavior data. That is, the sum of the feature values of the preset site visited by the user obtained by the first aggregation calculation may be divided by the counting statistical result obtained by the second aggregation calculation, for example, the sum of the page visits PV of the target site visited by the user is divided by the total number US of the user agents, so as to obtain an average visit AveragePV of each user:
Figure BDA0001718420170000132
the aggregation calculation result can be pushed to the terminal device sending the query request when receiving the corresponding query request. For example, when the terminal device sends a request to query the total number of user visits of a certain site in the last month in response to a user request, the execution subject of the method for processing visit behavior data may count and sample the behavior data of the user in the last month in a day as a data acquisition period, calculate a daily sampling rate, and then calculate the total number of user visits according to the formula (4) and push the total number to the terminal device. Therefore, the sample data can be utilized to quickly aggregate and calculate the analysis result of the overall data, and a more accurate query result can be provided.
With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for processing access behavior data, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.
As shown in fig. 5, the apparatus 500 for processing access behavior data of the present embodiment includes: an acquisition unit 501, an arithmetic unit 502, and a sampling unit 503. The obtaining unit 501 may be configured to obtain access behavior data in a preset data acquisition period, where the access behavior data includes a user identifier of a user accessing a network; the operation unit 502 may be configured to perform a hash operation on the user identifiers in the obtained access behavior data to obtain hash values of the user identifiers, and sort the user identifiers according to the hash values of the user identifiers; the sampling unit 503 may be configured to extract access behavior data satisfying a preset amount of sample data as sample data in an order of user identification.
In some embodiments, the apparatus 500 may further include an analysis unit configured to: determining a sampling rate according to a preset sample data size and the total data amount of the acquired access behavior data; and performing aggregation operation based on the sample data in at least one preset data acquisition period and the corresponding sampling rate to obtain an analysis result of the access behavior data.
In some embodiments, the analysis unit may be configured to perform the aggregation operation as follows: determining the maximum sampling rate corresponding to each user identifier in sample data in at least one preset data acquisition period; and accumulating the reciprocal of the maximum sampling rate corresponding to each user identifier in the sample data in at least one preset data acquisition period to obtain an aggregation result of the independent visitor number of the page of the access behavior data.
In some embodiments, the analysis unit may be configured to perform the aggregation operation in at least one of the following ways: accumulating values obtained by dividing the numerical values representing the characteristics of each user accessing the preset site and the corresponding sampling rate in the sample data in at least one preset data acquisition period to obtain the sum of the characteristic values of the user accessing the preset site corresponding to the access behavior data; accumulating the reciprocal of the sampling rate of the sample data in at least one preset data acquisition period to obtain a counting statistical result of the user behavior data; and taking the value obtained by dividing the value of the characteristic representing the preset site visited by each user in the sample data in at least one preset data acquisition period by the corresponding sampling rate and the value obtained by dividing the value by the reciprocal of the sampling rate as the average value of the characteristic values of the preset site visited by the user corresponding to the access behavior data.
In some embodiments, the sampling unit 503 may further extract access behavior data satisfying a preset sample data size as sample data in the following manner: and extracting access behavior data which meet the preset sample data size and correspond to the user identifiers which are sequenced before the user identifiers which are not extracted as sample data.
It should be understood that the elements recited in apparatus 500 correspond to various steps in the methods described with reference to fig. 2 and 4. Thus, the operations and features described above for the method are equally applicable to the apparatus 500 and the units included therein, and are not described in detail here.
According to the device 500 for processing the access behavior data in the embodiment of the application, the access behavior data in the preset data acquisition period is acquired through the acquisition unit, the access behavior data comprise the user identification of the user accessing the network, the operation unit performs hash operation on the user identification in the acquired access behavior data to obtain the hash value of each user identification, the user identifications are sorted according to the hash value of the user identification, the sampling unit extracts the access behavior data meeting the preset sample data volume according to the sorting of the user identifications to serve as the sample data, the sampling data volume can be effectively controlled, the aggregation calculation speed is effectively controlled, the sampling method is guaranteed to have good stability and randomness, and a high-precision data analysis result is obtained.
Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing an electronic device of an embodiment of the present application. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that the computer program read out therefrom is mounted in the storage section 608 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609 and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an arithmetic unit, and a sampling unit. The names of the units do not form a limitation to the units themselves in some cases, and for example, the acquiring unit may also be described as a "unit that acquires access behavior data in a preset data acquisition period".
As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: the method comprises the steps of obtaining access behavior data in a preset data acquisition period, wherein the access behavior data comprise user identifications of users accessing a network, conducting Hash operation on the user identifications in the obtained access behavior data to obtain Hash values of the user identifications, sequencing the user identifications according to the Hash values of the user identifications, and extracting the access behavior data meeting preset sample data size according to the sequencing of the user identifications to serve as sample data.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. A method for processing access behavior data of a user, comprising:
acquiring access behavior data in a preset data acquisition period, wherein the access behavior data comprises user identification of a user accessing a network;
performing hash operation on the user identifiers in the acquired access behavior data to obtain hash values of the user identifiers, and sequencing the user identifiers according to the hash values of the user identifiers;
extracting access behavior data meeting a preset sample data size according to the sequencing of the user identification to serve as sample data; if the access behavior data of the same user subject is sampled at a sampling rate corresponding to the current data acquisition cycle, the access behavior data is also sampled in a data acquisition cycle with the same sampling rate or a higher sampling rate; for the access behavior data of the same user subject, if the access behavior data is not sampled at the sampling rate corresponding to the current data acquisition period, the access behavior data is not sampled at the data acquisition period with the same sampling rate or with a lower sampling rate;
determining a sampling rate according to a preset sample data size and the total data amount of the acquired access behavior data, wherein the sampling rate comprises the following steps: determining a sampling rate according to a quotient of a preset sample data volume and the total data volume of the obtained access behavior data; wherein the preset sample data volume is the same in different preset data acquisition periods;
and performing aggregation operation based on the sample data in at least one preset data acquisition period and the corresponding sampling rate to obtain an analysis result of the access behavior data.
2. The method according to claim 1, wherein the performing an aggregation operation based on sample data and a corresponding sampling rate in at least one preset data acquisition period to obtain an analysis result of the access behavior data comprises:
determining a maximum sampling rate corresponding to each user identifier in the sample data in the at least one preset data acquisition period;
and accumulating the reciprocal of the maximum sampling rate corresponding to each user identifier in the sample data in the at least one preset data acquisition period to obtain an aggregation result of the independent visitor number of the page of the access behavior data.
3. The method according to claim 1 or 2, wherein the performing an aggregation operation based on the sample data and the corresponding sampling rate in at least one preset data acquisition period to obtain an analysis result of the access behavior data includes at least one of:
accumulating values obtained by dividing a numerical value representing the characteristic of each user accessing the preset site and a corresponding sampling rate in sample data in at least one preset data acquisition period to obtain the sum of the characteristic values of the user accessing the preset site corresponding to the access behavior data;
accumulating the reciprocal of the sampling rate of the sample data in at least one preset data acquisition period to obtain the counting statistical result of the access behavior data;
and dividing the accumulated result of the value of the characteristic of each user accessing the preset site in the sample data in at least one preset data acquisition period by the corresponding sampling rate and the accumulated result of the reciprocal of the sampling rate to obtain a value, and taking the value as the average value of the characteristic values of the user accessing the preset site corresponding to the access behavior data.
4. The method according to claim 1, wherein the extracting, as sample data, access behavior data that satisfies a preset sample data size according to the ranking of the user identifier comprises:
and extracting access behavior data which meets the preset sample data size and is corresponding to the user identifier which is sequenced before the user identifier which is not extracted, and taking the access behavior data as sample data.
5. An apparatus for processing access behavior data of a user, comprising:
the access behavior acquisition unit is configured to acquire access behavior data in a preset data acquisition period, wherein the access behavior data comprises user identification of a user accessing a network;
the operation unit is configured to perform hash operation on the user identifiers in the acquired access behavior data to obtain hash values of the user identifiers, and sort the user identifiers according to the hash values of the user identifiers;
the sampling unit is configured to extract access behavior data meeting a preset sample data size according to the sequencing of the user identification as sample data; if the access behavior data of the same user subject is sampled at a sampling rate corresponding to the current data acquisition cycle, the access behavior data is also sampled in a data acquisition cycle with the same sampling rate or a higher sampling rate; for the access behavior data of the same user subject, if the access behavior data is not sampled at the sampling rate corresponding to the current data acquisition period, the access behavior data is not sampled at the data acquisition period with the same sampling rate or with a lower sampling rate;
the apparatus further comprises an analysis unit configured to:
determining a sampling rate according to a preset sample data size and the total data amount of the acquired access behavior data, wherein the sampling rate comprises the following steps: determining a sampling rate according to a quotient of a preset sample data size and the total data amount of the obtained access behavior data; wherein the preset sample data volume is the same in different preset data acquisition periods;
and performing aggregation operation based on the sample data in at least one preset data acquisition period and the corresponding sampling rate to obtain an analysis result of the access behavior data.
6. The apparatus of claim 5, wherein the analysis unit is configured to perform an aggregation operation as follows:
determining a maximum sampling rate corresponding to each user identifier in the sample data in the at least one preset data acquisition period;
and accumulating the reciprocal of the maximum sampling rate corresponding to each user identifier in the sample data in the at least one preset data acquisition period to obtain an aggregation result of the independent visitor number of the page of the access behavior data.
7. The apparatus of claim 5 or 6, wherein the analysis unit is configured to perform the aggregation operation in at least one of the following ways:
accumulating values obtained by dividing the numerical value representing the characteristic of each user accessing the preset site and the corresponding sampling rate in the sample data in at least one preset data acquisition period to obtain the sum of the characteristic values of the users accessing the preset site corresponding to the access behavior data;
accumulating the reciprocal of the sampling rate of the sample data in at least one preset data acquisition period to obtain the counting statistical result of the access behavior data;
and dividing the accumulated result of the value of the characteristic of each user accessing the preset site in the sample data in at least one preset data acquisition period by the corresponding sampling rate and the accumulated result of the reciprocal of the sampling rate to obtain a value, and taking the value as the average value of the characteristic values of the user accessing the preset site corresponding to the access behavior data.
8. The apparatus according to claim 5, wherein the sampling unit further extracts access behavior data satisfying a preset sample data amount as the sample data as follows:
and extracting access behavior data which meets the preset sample data size and is corresponding to the user identifier which is sequenced before the user identifier which is not extracted, and taking the access behavior data as sample data.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.
10. A computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-4.
CN201810719951.2A 2018-07-03 2018-07-03 Method and apparatus for processing access behavior data Active CN110737691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810719951.2A CN110737691B (en) 2018-07-03 2018-07-03 Method and apparatus for processing access behavior data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810719951.2A CN110737691B (en) 2018-07-03 2018-07-03 Method and apparatus for processing access behavior data

Publications (2)

Publication Number Publication Date
CN110737691A CN110737691A (en) 2020-01-31
CN110737691B true CN110737691B (en) 2022-11-04

Family

ID=69234346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810719951.2A Active CN110737691B (en) 2018-07-03 2018-07-03 Method and apparatus for processing access behavior data

Country Status (1)

Country Link
CN (1) CN110737691B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111694802B (en) * 2020-06-12 2023-04-28 百度在线网络技术(北京)有限公司 Method and device for obtaining duplicate removal information and electronic equipment
CN112118189B (en) * 2020-08-27 2021-05-25 北京基调网络股份有限公司 Flow sampling method, computer equipment and computer readable storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986605A (en) * 2010-11-04 2011-03-16 北京迈朗世讯科技有限公司 Method and system for processing web surfing data of user based on backbone network
CN102855309A (en) * 2012-08-21 2013-01-02 亿赞普(北京)科技有限公司 Information recommendation method and device based on user behavior associated analysis
CN105677844A (en) * 2016-01-06 2016-06-15 北京摩比万思科技有限公司 Mobile advertisement big data directional pushing and user cross-screen recognition method
CN105844107A (en) * 2016-03-31 2016-08-10 百度在线网络技术(北京)有限公司 Data processing method and device
CN108073699A (en) * 2017-12-12 2018-05-25 中国联合网络通信集团有限公司 Big data polymerization analysis method and device
CN108197324A (en) * 2018-02-06 2018-06-22 百度在线网络技术(北京)有限公司 For storing the method and apparatus of data
CN108228873A (en) * 2018-01-17 2018-06-29 腾讯科技(深圳)有限公司 Object recommendation, publication content delivery method, device, storage medium and equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986605A (en) * 2010-11-04 2011-03-16 北京迈朗世讯科技有限公司 Method and system for processing web surfing data of user based on backbone network
CN102855309A (en) * 2012-08-21 2013-01-02 亿赞普(北京)科技有限公司 Information recommendation method and device based on user behavior associated analysis
CN105677844A (en) * 2016-01-06 2016-06-15 北京摩比万思科技有限公司 Mobile advertisement big data directional pushing and user cross-screen recognition method
CN105844107A (en) * 2016-03-31 2016-08-10 百度在线网络技术(北京)有限公司 Data processing method and device
CN108073699A (en) * 2017-12-12 2018-05-25 中国联合网络通信集团有限公司 Big data polymerization analysis method and device
CN108228873A (en) * 2018-01-17 2018-06-29 腾讯科技(深圳)有限公司 Object recommendation, publication content delivery method, device, storage medium and equipment
CN108197324A (en) * 2018-02-06 2018-06-22 百度在线网络技术(北京)有限公司 For storing the method and apparatus of data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《【编程珠玑】【第一章】生成随机数、随机取样的问题》;weixin_30322405;《https://blog.csdn.net/weixin_30322405/article/details/99394012》;20180117;第5页 *

Also Published As

Publication number Publication date
CN110737691A (en) 2020-01-31

Similar Documents

Publication Publication Date Title
CN107679211B (en) Method and device for pushing information
CN110598157B (en) Target information identification method, device, equipment and storage medium
CN107193974B (en) Regional information determination method and device based on artificial intelligence
CN114422267B (en) Flow detection method, device, equipment and medium
CN110019367B (en) Method and device for counting data characteristics
CN112613917A (en) Information pushing method, device and equipment based on user portrait and storage medium
CN110633423A (en) Target account identification method, device, equipment and storage medium
CN110737691B (en) Method and apparatus for processing access behavior data
CN114528269A (en) Method, electronic device and computer program product for processing data
CN108989383B (en) Data processing method and client
CN111488386B (en) Data query method and device
CN113590756A (en) Information sequence generation method and device, terminal equipment and computer readable medium
CN110895587A (en) Method and device for determining target user
CN114840634B (en) Information storage method and device, electronic equipment and computer readable medium
CN113220705A (en) Slow query identification method and device
WO2018223993A1 (en) Application search method, device and server
CN110020166A (en) A kind of data analysing method and relevant device
CN105245380B (en) Message propagation mode identification method and device
CN110557351A (en) Method and apparatus for generating information
CN114154052A (en) Information recommendation method and device, computer equipment and storage medium
CN114417102A (en) Text duplicate removal method and device and electronic equipment
CN110875949A (en) Method and device for pushing information
CN112184370A (en) Method and device for pushing product
CN107368597B (en) Information output method and device
CN112579673A (en) Multi-source data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant