CN114840565A - Sampling query method, device, electronic equipment and computer readable storage medium - Google Patents

Sampling query method, device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN114840565A
CN114840565A CN202210236128.2A CN202210236128A CN114840565A CN 114840565 A CN114840565 A CN 114840565A CN 202210236128 A CN202210236128 A CN 202210236128A CN 114840565 A CN114840565 A CN 114840565A
Authority
CN
China
Prior art keywords
data
user
query
event information
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210236128.2A
Other languages
Chinese (zh)
Inventor
桑文锋
刘耀洲
曹犟
付力力
张广强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sensors Data Network Technology Beijing Co Ltd
Original Assignee
Sensors Data Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sensors Data Network Technology Beijing Co Ltd filed Critical Sensors Data Network Technology Beijing Co Ltd
Priority to CN202210236128.2A priority Critical patent/CN114840565A/en
Publication of CN114840565A publication Critical patent/CN114840565A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes

Abstract

The application provides a sampling query method, a sampling query device, electronic equipment and a computer readable storage medium. After original event information including user internal identification, preset data query information and a sampling proportion are obtained, user partition identification is determined according to the user internal identification, the user partition identification is stored in the original event information to obtain partition event information, data is queried in the partition event information according to the preset data query information and the sampling proportion, and a query result is generated according to the queried data. According to the method, a set of user grouping strategies is constructed according to the internal identification of the user, so that users in different groups can adopt different partition storage strategies to store, and the efficiency and the accuracy of data analysis are improved.

Description

Sampling query method, device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of data analysis technologies, and in particular, to a sampling query method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
In the big data era, each product has a unique core index for measuring whether the product is successful, and data analysis is an important method for calculating the index of the product.
When data indexes of certain products are concerned and examined, accurate data are generally acquired in a full-scale query mode for analysis, however, under the condition that the actual data volume is very large, the efficiency of data analysis is seriously influenced due to the fact that the full-scale data acquisition time is long and the speed of single query is slow, and the behavior trend of user behaviors on certain indexes is inconvenient to grasp quickly.
Therefore, in order to avoid the problem that the current query mode affects the data analysis efficiency under the condition of very large data volume, a sampling query method needs to be provided to improve the efficiency and accuracy of data analysis.
Disclosure of Invention
The application provides a sampling query method, a sampling query device, electronic equipment and a computer readable storage medium, which are used for improving the efficiency and accuracy of data analysis.
In order to solve the technical problem, the present application provides the following technical solutions:
the application provides a sampling query method, which comprises the following steps:
acquiring original event information, preset data query information and a sampling proportion, wherein the original event information comprises a user internal identifier;
determining a user partition identifier according to the user internal identifier, and storing the user partition identifier into the original event information to obtain partition event information;
and inquiring data in the partition event information according to the preset data inquiry information and the sampling proportion, and generating an inquiry result according to the inquired data.
Correspondingly, the present application also provides a sampling query apparatus, including:
the system comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring original event information, preset data query information and a sampling proportion, and the original event information comprises a user internal identifier;
the partition module is used for determining a user partition identifier according to the internal identifier of the user and storing the user partition identifier into the original event information to obtain partition event information;
and the sampling calculation module is used for inquiring data in the partition event information according to the preset data inquiry information and the sampling proportion and generating an inquiry result according to the inquired data.
Meanwhile, the present application provides an electronic device, which comprises a processor and a memory, wherein the memory is used for storing a computer program, and the processor is used for operating the computer program in the memory to execute the steps in the sample query method.
In addition, the present application also provides a computer-readable storage medium, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to execute the steps in the above sample query method.
Has the advantages that: the application provides a sampling query method, a sampling query device, electronic equipment and a computer readable storage medium. Specifically, the method comprises the steps of firstly obtaining original event information, preset data query information and a sampling proportion, wherein the original event information comprises a user internal identifier, then determining a user partition identifier according to the user internal identifier, storing the user partition identifier into the original event information to obtain partition event information, finally querying data in the partition event information according to the preset data query information and the sampling proportion, and generating a query result according to the queried data. According to the method, a set of user grouping strategies is established according to the internal identification of the user, so that users in different groups can be stored by adopting different partition storage strategies, and when sampling query is carried out, corresponding groups can be rapidly screened out according to the sampling proportion to further carry out data query, and the efficiency and the accuracy of data analysis are improved.
Drawings
The technical solution and other advantages of the present application will become apparent from the detailed description of the embodiments of the present application with reference to the accompanying drawings.
Fig. 1 is a schematic system architecture diagram of a sample query system according to an embodiment of the present application.
Fig. 2 is a schematic flowchart of a sample query method according to an embodiment of the present application.
Fig. 3 is another schematic flow chart of a sample query method according to an embodiment of the present application.
Fig. 4 is a schematic interface diagram of a sample query according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of a sampling query device according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "including" and "having," and any variations thereof, in the description and claims of this application are intended to cover non-exclusive inclusions; the division of the modules presented in this application is only a logical division, and may be implemented in other ways in practical applications, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed.
In the present application, the original event information is information describing an original event model, such as an event name (event _ id), a user internal identifier (user _ id), a user original identifier (distinct _ id), a date of an event occurrence (month _ id, week _ id, day, etc.), a specific time (time) of the event occurrence, and the like.
In the present application, the preset data query information is information for data query, which is manually set or is default by the system, and includes data query conditions (e.g., event name, date of occurrence of event, etc.) and data query indicators (e.g., counting indicators such as number of visitors, page browsing amount, staying time, etc., and composite indicators such as jump rate, access depth, conversion rate, etc.).
In the present application, the sampling ratio ranges from a sampling granularity to 1, wherein the sampling granularity is determined by a preset number of regions. For example, if the number of the predetermined regions is 64, the sampling granularity is 1/64.
In this application, the user partition identifier, i.e., the partition number stored by the user partition, is denoted as sample _ group _ id.
In this application, the partition event information includes, in addition to the original event information, a user partition identifier and an area identifier (event _ packet).
The application provides a sampling query method, a sampling query device, electronic equipment and a computer readable storage medium.
Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture of a sample query system provided in the present application, as shown in fig. 1, the sample query system at least includes a query terminal 101 and a data server 102, where:
a communication link is provided between the inquiry terminal 101 and the data server 102 to realize information interaction. The type of communication link may include a wired, wireless communication link, or fiber optic cable, etc., and the application is not limited thereto.
The query terminal 101 may be a terminal device having a data query function, such as a smart phone, a tablet computer, and a notebook computer.
The data server 102 may be an independent server, or a server network or a server cluster composed of servers; for example, the server described in the present application includes, but is not limited to, a computer, a network host, a database server, and a Cloud server formed by an application server or a plurality of servers, wherein the Cloud server is formed by a large number of computers or network servers based on Cloud Computing (Cloud Computing).
The application provides a sample query system, which comprises a query terminal 101 and a data server 102. Specifically, the data server 102 obtains original event information, preset data query information and a sampling ratio from the query terminal 101, where the original event information includes a user internal identifier, then determines a user partition identifier according to the user internal identifier, stores the user partition identifier into the original event information to obtain partition event information, finally queries data in the partition event information according to the preset data query information and the sampling ratio, generates a query result according to the queried data, and feeds the query result back to the query terminal 101 for display by the data server 102.
In the sampling query process, the sampling query system constructs a set of user grouping strategies according to the internal identification of the user, so that different groups of users can adopt different partition storage strategies to store, and corresponding groups can be quickly screened out according to the sampling proportion to further perform data query during sampling query, thereby improving the efficiency and accuracy of data analysis.
It should be noted that the system architecture diagram shown in fig. 1 is only an example, the server, the terminal, and the scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system and the occurrence of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems. The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.
By combining the system architecture of the sample query system, the following will describe the sample query method in the present application in detail, please refer to fig. 2, where fig. 2 is a schematic flow chart of the sample query method according to an embodiment of the present application. The method at least comprises the following steps:
s201: and acquiring original event information, preset data query information and a sampling proportion, wherein the original event information comprises a user internal identifier.
And acquiring original event information, preset data query information and a sampling proportion in the query terminal through the data server. The original event information is information describing an original event model, such as an event name (event _ id), a user internal identifier (user _ id), a user original identifier (distinct _ id), a date of an event occurrence (month _ id, week _ id, day, etc.), a specific time of the event occurrence (time), and the like; the preset data query information is information which is manually set or defaulted by a system and is used for data query, and comprises data query conditions (such as an event name, an event occurrence date and the like) and data query indexes (such as counting indexes of the number of visitors, page browsing amount, stay time and the like, and composite indexes of jump rate, access depth, conversion rate and the like); the sampling ratio ranges from a sampling granularity to 1, wherein the sampling granularity is determined by the preset number of areas. For example, if the number of the predetermined regions is 64, the sampling granularity is 1/64.
Specifically, the Event model (i.e., Event model) includes two core entities, an Event (Event) and a user (user). The event entity simply describes that a user has completed a specific thing somehow at a certain time and a certain place. From this it can be seen that a complete event (event) contains several key factors as follows: users participating in the event generally use a user original identifier (distint _ ID) to set a unique ID of the user in the data interface, for a user who does not log in, the ID may be an anonymous ID such as a cookie or a device ID, and for a user who logs in, an actual user ID allocated in the background is generally used; the actual time of the event (month _ id, week _ id, day, time, etc.), in the data interface, the time field is generally used to record the event occurrence time accurate to milliseconds, month is month _ id, week is week, day is XX days in the specific XX year; the location of the event, the manner in which the user engaged in the event, and the specific content of the event made by the user are not described in detail herein. In addition, in the present application, for each user, the system first generates a unique internal ID (i.e., an internal user ID) based on account information of the user, and then calculates partition information to which the user should belong based on the internal user ID, which will be described in detail below.
S202: and determining a user partition identifier according to the internal identifier of the user, and storing the user partition identifier into the original event information to obtain partition event information.
In one embodiment, in order to ensure the integrity of a single user behavior sequence and the accuracy of user statistical data under the condition of big data sampling in the process of importing user behavior data with large data volume, a set of user partition storage strategies for partition storage according to users is constructed in the process of importing user behavior data, and the user partition storage strategies are performed based on internal user identifiers, and the specific steps include: acquiring the number of preset areas; and determining the user partition identification corresponding to the user internal identification according to the user internal identification and the preset area number. The preset number of the areas can be set manually or can be default by the system. The number of zones, that is, the total number of zones of a user partition, each user zone may also be referred to as a bucket, the number of zones, that is, the total number of buckets, and the preset number of zones may be set manually or may be a default of the system.
Further, the specific step of obtaining the user internal identifier according to the user internal identifier and the preset number of the areas includes: calculating the storage address of the user internal identification based on a Hash algorithm and the user internal identification; and carrying out remainder operation on the storage address based on the preset area number to obtain the user partition identification corresponding to the user partition identification. The Hash (Hash) algorithm can be MD4, MD5, SHA-1, and the like, and MD5 and SHA-1 are the most widely used Hash algorithms at present, and they are designed based on MD 4. Specifically, the process of calculating the user partition identifier is shown in formula 1:
sample _ group _ id ═ Hash (user _ id)% region _ num (formula 1);
wherein, sample _ group _ id is a user partition identifier stored by a user partition, user _ id is a user internal identifier, region _ num is a preset number of areas (for example, 32, 64, etc.),% represents a remainder in a computer language, Hash (user _ id) represents that the user internal identifier (user _ id) is operated by a Hash algorithm (the Hash algorithm is a cryptographic algorithm which can only encrypt and can not decrypt, and can convert information with any length into a character string with fixed length), and a storage address (the storage address here is a storage address in a Hash table and is not an actual physical address) calculated by the Hash function and the user _ id is called a Hash address. It should be noted that the user internal identifier (user _ id) is stored by using a hash table.
Meanwhile, after the user partition identifier (sample _ group _ id) is obtained through calculation, the system can automatically store the user partition identifier (sample _ group _ id) into the original event information and exist as an independent user attribute column, so that the partition event information containing the user partition identifier column can be obtained.
S203: and inquiring data in the partition event information according to preset data inquiry information and a sampling proportion, and generating an inquiry result according to the inquired data.
Based on S202, the original event information is partitioned according to the internal identifier of the user, so that the interested data can be queried in the partitioned partition event information according to the preset data query information and sampling ratio, and the specific process will be set forth below.
In one embodiment, the step of performing a sampling query on the data includes: determining target event information and a data calculation mode according to preset data query information and a sampling proportion; and inquiring data in the target event information according to the target event information and the data calculation mode, and generating an inquiry result according to the inquired data. The preset data query information comprises a data query condition and a data query index, and the data query condition comprises a pre-queried event name (name), a pre-queried event occurrence date and the like; the sampling ratio is a sampling index in a proportional sampling method, the proportional sampling is a method for sampling according to a uniform ratio without considering the size of sample variability, the range of the sampling ratio in the application is related to the number of preset user areas, and the common sampling ratio is 1/2, 1/4 and the like. It should be noted that, the reciprocal of the preset number of user areas is referred to as a sampling granularity, and if the preset number of user areas is 64, the sampling granularity is 1/64.
The specific steps of determining the target event information and the data calculation mode according to the preset data query information and the sampling proportion comprise: analyzing preset data query information to obtain data query conditions and data query indexes; screening target event information from the partition event information according to the data query condition; and determining a data calculation mode according to the data query index and the sampling proportion. The data query conditions include the name of the event to be queried, the date of the event to be queried, etc., and the data query indicators include the total number of times (event analysis function common indicator, which indicates the number of times a certain event is triggered in a selected time range, for example, the value calculated when selecting a page browsing event and checking according to the total number of times is the page browsing amount), the number of triggering users (event analysis function common indicator, which indicates the number of independent users triggering a certain event in a selected time range, for example, selecting a registration success event, and checking according to the number of independent users, the calculated value is the number of successful registration users in the selected time range), the number of people (event analysis function common indicator, which indicates the average number of times that an independent user triggers a certain event in a selected time range, for example, selecting a page browsing event and checking according to the number of people, the calculated value is the average page view depth), etc.
For example, the number of users browsing the query event 1001 on the XX year-XX month-XX day, wherein the event 1001and the XX year-XX month-XX day are data query conditions, and the number of users browsing is a data query index. Therefore, the events meeting the requirements that the event name is 1001and the event occurrence time is XX year-XX month-XX day can be screened from the partition event information according to the data query condition, and the relevant information of the screened target events is used as the target event information.
In one embodiment, different data query indicators may have different data calculation manners, and therefore, the data calculation manner needs to be determined according to the type of the data query indicator, and the specific steps include: determining a proportionality coefficient according to the data query index and the sampling proportion; and determining a data calculation mode according to the scaling coefficient and the sampling proportion. Specifically, the data query index may be divided into an index that needs to be scaled up according to a sampling scale and an index that does not need to be scaled up according to the sampling scale (for example, the number of people's average times (the number of people's average trigger times of a certain event), the number of people's average values (the number of people's average values of a certain number type attribute), the maximum value (the maximum value of a certain number type attribute), the minimum value (the minimum value of a certain number type attribute), the number of people's average Session times, and the like); if the data query index is an index needing to be amplified according to the sampling proportion, the proportionality coefficient is the reciprocal of the sampling proportion, and the data calculation mode is that after corresponding event information is extracted according to the sampling proportion for query, the proportionality coefficient is required to be multiplied to ensure that the final statistical result is close to the actual data result; if the data query index is an index which does not need to be amplified according to the sampling proportion, the proportion coefficient is 1, and the data calculation mode is that only corresponding event information needs to be extracted according to the sampling proportion for query.
In one embodiment, two types of data calculation modes and target event information are obtained according to the steps and the description, and the specific steps of querying data based on the target event information comprise: carrying out remainder operation on the user partition identification according to a data calculation mode to obtain the sampled user partition identification; determining sampling event information from the target event information according to the sampled user partition identification, and generating a sampling result according to the sampling event information; and calculating the sampling result according to the data query index and the data calculation mode to obtain a query result.
For example, if the number of users who browse the page event 1001 is triggered by inquiring XX year, XX month and XX day, the calculation mode SQL statement under the logic of non-sampling inquiry is: a select count as uv from events day, XX-XX, and event id 1001, where uv is the number of Web end day active users (within 1 day (00:00-24:00), the number of non-repeating users accessing a website (based on browser cookies), and the number of times that the same visitor accesses the website in a day is counted only 1 time). However, this non-sampling query logic is not suitable for the case of large data volume, and the analysis efficiency is seriously affected by the slow speed of single query.
When the sampling query mode provided by the application is adopted, for the index of the page browsing amount, which needs to be amplified according to the sampling proportion, the sampling proportion is assumed to be 1/2, and since only half of users are extracted in the sampling process to enter the sampling statistical result, the number of the users counted by the page browsing amount is reduced by half after sampling, and therefore, a proportionality coefficient (namely the reciprocal of the sampling proportion) needs to be multiplied in the final aggregation result to ensure that the final statistical result is close to the actual data result. Specifically, if the number of users browsing the page event 1001 is triggered by querying the number of users in XX year, XX month and XX day, the calculation method SQL statement under the logic of the sample query provided by the application is as follows: select 2 count as uv from events where day is ' XX-XX ' and event id is 1001and sample group id% 2 is 0, wherein target event information satisfying the data query condition is screened out through a ' where day is ' XX-XX ' and event id is 1001 ' statement, and user partition identification (sample group id) is calculated through a ' sample group id% 2 is 0 statement, because the sampling ratio is 1/2, so that the divisor in the calculation is 2, and similarly, if the sampling ratio is 1/a (a is a positive integer), the divisor in the calculation is a, and this condition can be added to the where condition, so that half of the user data can be screened out, and the efficiency of the sampling query can be improved, and finally, calculating the number of users covered in the sampling event information meeting the data query condition and the sampling proportion through a 'count (discontinuity user _ id)' statement, wherein the counted number of users is reduced after the index of the page browsing amount is sampled, so that the final query result is obtained by multiplying the index by a proportion coefficient 2 during the final calculation.
In addition, when the sampling query mode provided by the application is adopted, for the index of the average browsing amount of the page owner, which does not need to be amplified according to the sampling ratio, the sampling ratio is assumed to be 1/2, and as only half of the users are extracted in the sampling process to enter the sampling statistical result, the sampling does not result in that the average browsing amount of the page owner is reduced by half, so that the proportionality coefficient is 1. Specifically, if the number of users per capita who browses the page event 1001 is triggered by querying the XX year, XX month and XX day, the calculation mode SQL statement under the logic of the sample query provided by the application is as follows: a selection count/count (discrete user id) from events where day is '201-11-22' and event id is 1001and sample group id% 2 is 0, wherein target event information satisfying the data query condition is selected through a 'where day is' XX-XX 'and event id is 1001' statement, and a user partition identifier (sample group id) is calculated through a 'sample group id% 2 is 0' statement, because the sampling ratio is 1/2, so that the divisor in the calculation is 2, and similarly, if the sampling ratio is 1/a (a is a positive integer), the divisor in the calculation is a, it can be seen that this condition is added to the where condition, half of the user data can be ensured to enter the sampling statistics result, and the sampling query can quickly select the user data, therefore, the efficiency of data analysis is improved, and finally, the number of the people-average users in the sampling event information meeting the data query conditions and the sampling proportion is calculated through a 'count/count _ user _ id' statement, and meanwhile, because the index is the people-average browsing volume and belongs to an index which does not need to be amplified according to the sampling proportion, the final query result can be obtained by multiplying the index by a proportion coefficient 1 during the final calculation.
It should be noted that the original event information, the partition event information, the target event information, and the sampling event information referred to in this application are stored by using a typical event table structure.
As shown in fig. 3, fig. 3 is another schematic flow chart of the sample query method according to the embodiment of the present application. The sampling query method provided by the application is simply divided into the following steps:
s301: and (4) event import.
Namely, event information such as front-end operation, back-end log, service data and the like is imported into the data server through the system and stored.
S302: and partitioning according to the internal identification of the user.
In the process of importing the user behavior data with large data volume, in order to ensure the integrity of a single user behavior sequence and the accuracy of user statistical data under the condition of big data sampling, a set of user partition storage strategies for partition storage according to users are constructed in the process of importing the user behavior data, and the user partition storage strategies are carried out based on internal identification of the users. By constructing the hash algorithm, the same user can be only distributed in one group, and meanwhile, different groups of users adopt different partition storage strategies.
S303: resulting in user partition 0, user partition 1 … … user partition N-1.
And according to the S302, partitioning the events, namely grouping the users to obtain a user partition 0 and a user partition 1 … …, namely a user partition N-1. Wherein N is a predetermined number of regions.
S304: and (5) sampling and querying.
Sampling query is carried out on well-partitioned event information, and sampling calculation modes of the event information are mainly divided into two types: one is a calculation method requiring the class index to be amplified according to the sampling ratio, and the other is a calculation method not requiring the class index to be amplified according to the sampling ratio. However, for any index, the step of sampling query needs to determine the target event information according to the data query condition in the preset data query information, and then determine the data calculation mode according to the data query index and the sampling ratio.
S305: and (4) reducing the index.
For two different types of data query indexes, the main difference is that the scale coefficients in the process of restoring the indexes are different: for the indexes needing to be amplified according to the sampling proportion, the data calculation mode is that after corresponding event information is extracted according to the sampling proportion and inquired, a proportion coefficient is multiplied to ensure that the final statistical result is close to the actual data result; for the indexes which do not need to be amplified according to the sampling proportion, the data calculation mode is to extract corresponding event information according to the sampling proportion for inquiring, and other calculation is not needed.
S306: and returning a query result.
And processing the event information according to the process to obtain a final query result, and feeding the query result back to the query terminal.
As shown in fig. 4, fig. 4 is a schematic interface diagram of a sample query of a query terminal according to an embodiment of the present application. As shown in fig. 4, the user can query the index (e.g., the number of users browsing Web pages) by himself, select the sampling ratio by sliding the slide rail of the "sampling setting", and select the presentation of "approximate calculation" for the query result. The lower part of fig. 4 is a display area of the query result, which is mainly used for displaying the query result of the data query index, and the user can select the display mode (such as a line graph, a bar graph, a sector graph, and the like) of the query result by himself.
The sampling query method provided by the application can be applied to all analysis models of behavior analysis of big data users, and a typical use scene is that the speed of single query is very low under the condition that the actual data volume is very large, and at this time, the data of a small number of users can be selected through a query sampling mode to quickly verify the guess and observe the trend. When the specific indexes to be concerned and assessed are finally determined, the full-scale query can be selected to obtain an accurate numerical value. Of course, there may be discrepancy between the results derived from the query sampling and the real full-scale query results, and the discrepancy is smaller when the user data size is larger and the data distribution is more uniform. If only the trend of the data is concerned, sampling query is carried out on a small part of user groups, and errors caused by query sampling are not obvious.
Based on the content of the foregoing embodiments, the present application provides a sampling query apparatus. The sample query apparatus is configured to execute the sample query method provided in the foregoing method embodiment, and specifically, referring to fig. 5, the apparatus includes:
a first obtaining module 501, configured to obtain original event information, preset data query information, and a sampling ratio, where the original event information includes a user internal identifier;
a partition module 502, configured to determine a user partition identifier according to the user internal identifier, and store the user partition identifier in the original event information to obtain partition event information;
and the sampling calculation module 503 is configured to query data in the partition event information according to the preset data query information and the sampling ratio, and generate a query result according to the queried data.
In one embodiment, the partition module 502 includes:
the second acquisition module is used for acquiring the number of the preset areas;
and the partition identification determining module is used for determining the user partition identification corresponding to the user internal identification according to the user internal identification and the preset area number.
In one embodiment, the partition identification determination module comprises:
the address calculation module is used for calculating the storage address of the user internal identification based on a Hash algorithm and the user internal identification;
and the identification calculation module is used for carrying out remainder operation on the storage address based on the preset area number to obtain the user partition identification corresponding to the user partition identification.
In one embodiment, the sample calculation module 503 includes:
the first determining module is used for determining target event information and a data calculation mode according to the preset data query information and the sampling proportion;
and the first query module is used for querying data in the target event information according to the target event information and the data calculation mode and generating a query result according to the queried data.
In one embodiment, the first determining module comprises:
the information analysis module is used for analyzing the preset data query information to obtain a data query condition and a data query index;
the information screening module is used for screening target event information from the subarea event information according to the data query condition;
and the second determining module is used for determining a data calculation mode according to the data query index and the sampling proportion.
In one embodiment, the second determining module comprises:
the coefficient determining module is used for determining a proportional coefficient according to the data query index and the sampling proportion;
and the third determining module is used for determining a data calculation mode according to the proportional coefficient and the sampling proportion.
In one embodiment, the sample query device further comprises:
the partition operation module is used for carrying out remainder operation on the user partition identification according to the data calculation mode to obtain the sampled user partition identification;
the sampling module is used for determining sampling event information from the target event information according to the sampled user partition identification and generating a sampling result according to the sampling event information;
and the query calculation module is used for calculating the sampling result according to the data query index and the data calculation mode to obtain a query result.
The sampling query device of the embodiment of the present application may be configured to execute the technical solutions of the foregoing method embodiments, and the implementation principles and technical effects thereof are similar and will not be described herein again.
Different from the prior art, the sampling query device provided by the application is provided with the partition module, a set of user grouping strategies can be established according to the internal identification of the user through the partition module, so that users in different groups can adopt different partition storage strategies to store, and when sampling query is carried out, corresponding groups can be rapidly screened according to the sampling proportion to further carry out data query, so that the efficiency and the accuracy of data analysis are improved.
Accordingly, an electronic device may include, as shown in fig. 6, a processor 601 having one or more processing cores, a Wireless Fidelity (WiFi) module 602, a memory 603 having one or more computer-readable storage media, an audio circuit 604, a display unit 605, an input unit 606, a sensor 607, a power supply 608, and a Radio Frequency (RF) circuit 609. Those skilled in the art will appreciate that the configuration of the electronic device shown in fig. 6 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 601 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 603 and calling data stored in the memory 603, thereby performing overall monitoring of the electronic device. In one embodiment, processor 601 may include one or more processing cores; preferably, the processor 601 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 601.
WiFi belongs to short-distance wireless transmission technology, and electronic equipment can help a user to receive and send emails, browse webpages, access streaming media and the like through the wireless module 602, and provides wireless broadband internet access for the user. Although fig. 6 shows the wireless module 602, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.
The memory 603 may be used to store software programs and modules, and the processor 601 executes various functional applications and data processing by running the computer programs and modules stored in the memory 603. The memory 603 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 603 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 603 may also include a memory controller to provide the processor 601 and the input unit 606 access to the memory 603.
The audio circuitry 604 includes speakers that can provide an audio interface between the user and the electronic device. The audio circuit 604 may transmit the electrical signal converted from the received audio data to a speaker, and the electrical signal is converted into a sound signal by the speaker and output; on the other hand, the speaker converts the collected sound signal into an electrical signal, which is received by the audio circuit 604 and converted into audio data, and the audio data is processed by the audio data output processor 601 and then transmitted to another electronic device through the radio frequency circuit 609, or the audio data is output to the memory 603 for further processing. The audio circuit 604 may also include an earbud jack to provide communication of a peripheral headset with the electronic device.
The display unit 605 may be used to display information input by or provided to the user and various graphical user interfaces of the terminal, which may be configured by graphics, text, icons, video, and any combination thereof. The Display unit 605 may include a Display panel, and in one embodiment, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 601 to determine the type of the touch event, and then the processor 601 provides a corresponding visual output on the display panel according to the type of the touch event. Although in FIG. 6 the touch-sensitive surface and the display panel are implemented as two separate components for input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel for input and output functions.
The input unit 606 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. Specifically, in one particular embodiment, input unit 606 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. In one embodiment, the touch sensitive surface may comprise two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 601, and can receive and execute commands sent by the processor 601. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 606 may include other input devices in addition to a touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The electronic device may also include at least one sensor 607, such as a light sensor, motion sensor, and other sensors. As for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured to the electronic device, detailed descriptions thereof are omitted.
The electronic device also includes a power supply 608 (e.g., a battery) for powering the various components, which may be logically coupled to the processor 601 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 608 may also include any component including one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The radio frequency circuit 609 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then sends the received downlink information to one or more processors 601 for processing; in addition, data relating to uplink is transmitted to the base station. In general, the radio frequency circuitry 609 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the radio frequency circuit 609 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
Although not shown, the electronic device may further include a camera, a bluetooth module, and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 601 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 603 according to the following instructions, and the processor 601 runs the application program stored in the memory 603, so as to implement the following functions:
acquiring original event information, preset data query information and a sampling proportion, wherein the original event information comprises a user internal identifier;
determining a user partition identifier according to the user internal identifier, and storing the user partition identifier into the original event information to obtain partition event information;
and inquiring data in the partition event information according to the preset data inquiry information and the sampling proportion, and generating an inquiry result according to the inquired data.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to implement the functions of the above sample query method.
Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The sampling query method, the sampling query device, the electronic device, and the computer-readable storage medium provided in the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principles and embodiments of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A sample query method, comprising:
acquiring original event information, preset data query information and a sampling proportion, wherein the original event information comprises a user internal identifier;
determining a user partition identifier according to the user internal identifier, and storing the user partition identifier into the original event information to obtain partition event information;
and inquiring data in the partition event information according to the preset data inquiry information and the sampling proportion, and generating an inquiry result according to the inquired data.
2. The sample query method of claim 1, wherein the step of determining a user partition identifier according to the user internal identifier and storing the user partition identifier in the original event information to obtain partition event information comprises:
acquiring the number of preset areas;
and determining the user partition identification corresponding to the user internal identification according to the user internal identification and the preset area number.
3. The sample query method of claim 2, wherein the step of determining the user partition identifier corresponding to the user internal identifier according to the user internal identifier and the preset number of areas comprises:
calculating the storage address of the user internal identification based on a Hash algorithm and the user internal identification;
and performing remainder operation on the storage address based on the preset area number to obtain the user partition identification corresponding to the user partition identification.
4. The sampling query method according to claim 1, wherein the step of querying data in the partition event information according to the preset data query information and the sampling ratio and generating a query result according to the queried data comprises:
determining target event information and a data calculation mode according to the preset data query information and the sampling proportion;
and inquiring data in the target event information according to the target event information and the data calculation mode, and generating an inquiry result according to the inquired data.
5. The sample query method of claim 4, wherein the step of determining a target event information and data calculation method according to the preset data query information and the sample ratio comprises:
analyzing the preset data query information to obtain a data query condition and a data query index;
screening target event information from the partition event information according to the data query condition;
and determining a data calculation mode according to the data query index and the sampling proportion.
6. The sample query method of claim 5, wherein the step of determining a data calculation method based on the data query indicator and the sample ratio comprises:
determining a proportionality coefficient according to the data query index and the sampling proportion;
and determining a data calculation mode according to the proportion coefficient and the sampling proportion.
7. The sample query method of claim 6, wherein the step of querying data in the target event information according to the target event information and the data calculation manner and generating query results according to the queried data comprises:
carrying out remainder operation on the user partition identification according to the data calculation mode to obtain the sampled user partition identification;
determining sampling event information from the target event information according to the sampled user partition identification, and generating a sampling result according to the sampling event information;
and calculating the sampling result according to the data query index and the data calculation mode to obtain a query result.
8. A sample query device, comprising:
the system comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring original event information, preset data query information and a sampling proportion, and the original event information comprises a user internal identifier;
the partition module is used for determining a user partition identifier according to the internal identifier of the user and storing the user partition identifier into the original event information to obtain partition event information;
and the sampling calculation module is used for inquiring data in the partition event information according to the preset data inquiry information and the sampling proportion and generating an inquiry result according to the inquired data.
9. An electronic device comprising a processor and a memory, the memory storing a computer program, the processor being configured to execute the computer program in the memory to perform the steps of the sample query method of any one of claims 1 to 7.
10. A computer readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the sample query method of any one of claims 1 to 7.
CN202210236128.2A 2022-03-11 2022-03-11 Sampling query method, device, electronic equipment and computer readable storage medium Pending CN114840565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210236128.2A CN114840565A (en) 2022-03-11 2022-03-11 Sampling query method, device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210236128.2A CN114840565A (en) 2022-03-11 2022-03-11 Sampling query method, device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114840565A true CN114840565A (en) 2022-08-02

Family

ID=82561737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210236128.2A Pending CN114840565A (en) 2022-03-11 2022-03-11 Sampling query method, device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114840565A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703248A (en) * 2023-08-07 2023-09-05 腾讯科技(深圳)有限公司 Data auditing method, device, electronic equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703248A (en) * 2023-08-07 2023-09-05 腾讯科技(深圳)有限公司 Data auditing method, device, electronic equipment and computer readable storage medium
CN116703248B (en) * 2023-08-07 2024-01-30 腾讯科技(深圳)有限公司 Data auditing method, device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN108471376B (en) Data processing method, device and system
CN108345543B (en) Data processing method, device, equipment and storage medium
WO2015081801A1 (en) Method, server, and system for information push
CN107204964B (en) Authority management method, device and system
WO2014169661A1 (en) Method and system for processing report information
CN112540996B (en) Service data verification method and device, electronic equipment and storage medium
CN113420051A (en) Data query method and device, electronic equipment and storage medium
CN112749074B (en) Test case recommending method and device
CN113537685A (en) Data processing method and device
CN108966340B (en) Equipment positioning method and device
EP3582450A1 (en) Message notification method and terminal
CN106294087B (en) Statistical method and device for operation frequency of business execution operation
CN114840565A (en) Sampling query method, device, electronic equipment and computer readable storage medium
CN111090877B (en) Data generation and acquisition methods, corresponding devices and storage medium
CN108632054B (en) Information transmission quantity prediction method and device
CN112131482B (en) Aging determining method and related device
CN114648336A (en) Face payment method and device, electronic equipment and storage medium
CN112214699B (en) Page processing method and related device
CN108616552B (en) Webpage access method, device and system
CN111191998A (en) Item processing method and device
CN114363406B (en) Push message processing method, device, equipment and storage medium
CN110610417B (en) Information display method, device and equipment
CN110390549B (en) Registration small number identification method, device, server and storage medium
US20140310087A1 (en) Method and system for processing report information
CN115543314A (en) Report generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination