WO2018048350A1 - Procédés et systèmes d'analyse de données de cohorte et structures de données pour effectuer une analyse de données de cohorte - Google Patents

Procédés et systèmes d'analyse de données de cohorte et structures de données pour effectuer une analyse de données de cohorte Download PDF

Info

Publication number
WO2018048350A1
WO2018048350A1 PCT/SG2017/050443 SG2017050443W WO2018048350A1 WO 2018048350 A1 WO2018048350 A1 WO 2018048350A1 SG 2017050443 W SG2017050443 W SG 2017050443W WO 2018048350 A1 WO2018048350 A1 WO 2018048350A1
Authority
WO
WIPO (PCT)
Prior art keywords
activity
cohort
data
user
tuples
Prior art date
Application number
PCT/SG2017/050443
Other languages
English (en)
Inventor
Dawei Jiang
Qingchao CAI
Beng Chin Ooi
Kian Lee Tan
Kum Hoe Anthony TUNG
Original Assignee
National University Of Singapore
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University Of Singapore filed Critical National University Of Singapore
Publication of WO2018048350A1 publication Critical patent/WO2018048350A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Definitions

  • the present disclosure relates to data analysis and more specifically to the analysis of user activity data using cohort analysis.
  • Cohort analysis which originates from social science, is a powerful data exploration technique for finding unusual user behavior trends from large activity datasets using the concept of cohorts.
  • Cohort analysis can be beneficial to modem Internet applications, such as online commercial websites and mobile game applications, which normally generate a large volume of user behavioral data.
  • modem Internet applications such as online commercial websites and mobile game applications, which normally generate a large volume of user behavioral data.
  • cohort analysis such applications can efficiently find the abnormal user behavioral patterns and inspect the associated factors to enhance their business.
  • Internet applications often accumulate a huge amount of activity data representing information associated with user actions. Such activity data is often tabulated to provide insights into the behavior of users in order to increase sales and ensure user retention.
  • cohort analysis is used an analytical technique for assessing the effects of aging on human behavior in a changing society. In particular, it allows us to tease apart the effect of aging from the effect of social change, and hence can offer insights that are more valuable.
  • cohort analytics social scientists study the human behavioral trends in three steps: 1 ) group users into cohorts; 2) determine the age of user activities and 3) compute aggregations for each (cohort, age) bucket.
  • the first step employs the so-called cohort operation to capture the effect of social differences.
  • Social scientists choose a particular action e (called the birth action) and divide users into different groups (called cohorts) based on the first time (called birth time) that users performed e.
  • Each cohort is then represented by the time bin (e.g., day, week or month) associated with the birth time.
  • Each activity tuple of a user is then assigned to the same cohort that this user belongs to.
  • social scientists capture the effect of aging by partitioning activity tuples in each cohort into smaller sub-partitions based on age.
  • the age of an activity tuple t is the duration between the birth time of that user and the time that user performed t.
  • aggregated behavioral measure is reported for each (cohort, age) bucket.
  • the classic cohort analysis we presented so far is extremely useful for user retention analysis.
  • Example 2 A doctor might be interested in finding if there exists a correlation between patient readmission and the physical status of patients when they are initially admitted. This can be achieved by grouping patients into cohorts based on their physical status, and then comparing between cohorts the number of readmissions in different time periods to see how this number is related to the physical status of users in the cohort.
  • Example 3 Venture Capital is eager to find what kind of startups have potential to provide a positive investment return. To this end, it can group startups into cohorts based on many aspects, such as products, user acquisition, net revenue, user retention and company structure, at the time when they received investment, and then find the attribute values of the cohorts consisting of more companies that finally survived or survived for a long time.
  • a data analysis method for analyzing user activity data comprises indications of activities for a plurality of users.
  • the method comprises storing the user activity data as an activity table, the activity table comprising a plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity table being sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier; and evaluating a cohort query by iterating over the activity table, the cohort query specifying a birth condition indicating a division of the plurality of users into a plurality of cohorts and a function of at least one attribute to be evaluated for each cohort of the plurality of cohorts. Storing the user activity data in an activity table sorted in this way allows efficient processing of cohort queries.
  • the cohort queries may comprise birth selection operators and / or age selection operators.
  • a cohort query comprises both birth selection operators and age selection operators, since the operators are commutative, the birth selection operator can be moved in the query plan to optimize the evaluation of the cohort query.
  • the cohort query may also comprise an aggregation operator which assigns each user to a cohort and aggregates tuples for each cohort.
  • the storage of the activity table may comprise dividing the activity table into a plurality of data chunks such that all tuples for any one user are included in one data chunk.
  • the chunks may be compressed by various different compression schemes.
  • Run-length encoding may be applied to a column of the data chunks comprising the user identifier. This run length encoded data may be used when advancing through the activity table.
  • a column of the data chunks comprising the time stamp and / or a column of the data chunks comprising an integer attribute may be compressed using delta encoding.
  • a dictionary encoding scheme may be applied to the data chunks as follows. For an attribute of the plurality of attributes, constructing a global dictionary which associates each unique value of the attribute with a global identifier; for each data chunk constructing a chunk dictionary which maps a chunk specific identifier for each unique value of the attribute occurring in the chunk to the global identifier corresponding to that unique value for the attribute; and replacing the attribute values for that attribute with respective chunk specific identifiers. Such encoding may be applied to the activity identifier.
  • a data structure for storing user activity data comprising indications of activities for a plurality of users.
  • the data structure comprises an activity table comprising plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity table is sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier.
  • the activity table may comprise a plurality of data chunks wherein all tuples for any one user are included in one data chunk. Columns of the chunks may be encoded by run-length encoding, delta encoding and / or by a global dictionary in combination with a chunk specific dictionary.
  • a data processing system comprising a storage management module and a query processing module.
  • the storage management module is operable to store user activity data comprising indications of activities for a plurality of users as an activity table, the activity table comprising a plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier, the activity table being sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier.
  • the query processing module is operable to evaluate a cohort query by iterating over the activity table, the cohort query specifying a birth condition indicating a division of the plurality of users into a plurality of cohorts and a function of at least one attribute to be evaluated for each cohort of the plurality of cohorts.
  • a data processing method of evaluating a query on a data structure comprises an activity table comprising plurality of tuples, each tuple having a plurality of attributes comprising a user identifier, a time stamp and an activity identifier.
  • the activity table sorted according to a primary key comprising the user identifier, the time stamp and the activity identifier.
  • the query comprises a birth selection operator which retrieves tuples for a user that satisfies a condition for inclusion in a cohort and an age selection operator which retrieves tuples which satisfy a specified condition.
  • the method comprises: applying birth selection operator to the activity table and then applying the age selection operator to the activity table.
  • a non-transitory computer- readable medium has stored thereon program instructions for causing at least one processor to perform operations of a method disclosed above.
  • Figure 1 shows the corresponding SQL query Q s for an example cohort analytics task
  • Figure 2 shows the correspondence between the three proposed cohort operators and the equivalent SQL statements
  • Figure 3 is a block diagram showing a technical architecture 100 of data processing system according to an embodiment of the present invention
  • Figure 4 is a flowchart showing a method of analysing user data according to an embodiment of the present invention.
  • Figure 5 shows a query plan for the cohort query used in an embodiment of the present invention
  • Figures 6a to 6d show the query performance for queries Q1 to Q4 respectively with different chunk sizes using an embodiment of the present invention
  • Figure 7 shows the storage size requirements for different chunk sizes using an embodiment of the present invention
  • Figure 8 is a graph of processing times of queries in an embodiment of the present invention showing the effect of birth selection
  • Figure 9 is a graph of processing times of queries in an embodiment of the present invention showing the effect of age selection
  • Figure 10 shows a comparison of time for generating materialized view for COHANA (which is an embodiment of the present invention) and with alternative methods MonetDB and Postgres (PG); and
  • Figures 11a to 11 d are graphs showing a comparison of the performance of different evaluation schemes.
  • Table 1 below provides an example, which is used in the following description.
  • players tend to buy more weapons in their initial game sessions than they do in later game sessions - this is the effect of aging.
  • social changes may also affect the players' shopping behavior, e.g., with new weapons being introduced by iterative game development, players may start to spend again in order to acquire these weapons.
  • Table 1 shows some samples of a real dataset containing user activities collected from a mobile game. Each tuple in this table represents a user action and its associated information. For example, tuple t1 represents that player 001 launched the game on 2013/01/19 in Australia in a dwarf role.
  • Table 1 Player activity samples of a mobile game
  • cohort analytics we can study the trend of human behavior in three steps.
  • users are divided into different cohorts (each cohort is therefore a group of users) based on the time that users first perform an action.
  • This step is of vital importance in cohort analysis since its purpose is to isolate the effect of social changes. Social scientists think that people who are born in the same period will exhibit similar behavioral patterns.
  • the activity tuples of interest are also partitioned accordingly so that tuples of a user are assigned to the same cohort as the user.
  • tuples t2-t4 are assigned to 2013-05-19 launch cohort to which player 001 belongs.
  • the tuples of each cohort are further split into smaller sub-partitions based on age (time).
  • the desired aggregate is then applied on each such sub-partition.
  • age a week's duration
  • Table 3 presents results of the cohort analysis of the shopping trend in our game example.
  • Each row of this table represents the shopping trend of a cohort, and each column captures the aging effect of the average expenditures of that cohort since its "birth" with respect to the time the game is launched for the first time.
  • the aging effect i.e., players spent more gold on buying weapons on their initial game sessions than their later game sessions.
  • the drop-off trend becomes better. This suggests that the iterative game development indeed improves the players' gaming experience as they seemed to be spending more gold on buying weapons to win battles, an insight which cannot be drawn from OLAP style results in Table 2.
  • the analysis result can be used as a training dataset for other analytical techniques.
  • a least intrusive approach to supporting cohort analytics is to use an existing relational DBMS and express the cohort analysis task as a SQL query. We illustrate such an approach using the following cohort analysis task:
  • Example 4 Given the launch birth action and the activity table as shown in Table 1 (denoted as D), for players who play the dwarf role at their birth time, cohort those players based on their birth countries and report the total gold that country launch cohorts spent since they were born.
  • Figure 1 shows the corresponding SQL query Q s for this task.
  • Q s the corresponding SQL query
  • p, a, t, c abbreviations respectively to denote the player, action, time, and country attribute in Table 1.
  • the Q s employs four sub-queries (i.e., Figure 1(a) - Figure 1(d)) and one outer query (i.e., Figure 1(e)) to produce the results.
  • this SQL approach performs poorly for three reasons:
  • MV materialized view
  • Figure 1 (c) only includes a single calculated birth attribute birthRole as it is the only attribute appearing in the birth selection condition (i.e., the condition of playing as the dwarf role at birth time) of the analysis task. However, if other calculated birth attributes are also involved in the birth selection condition, we need to include those attributes in the MV as well. In the extreme case, every possible birth attribute shall be included in the MV, doubling the storage space as compared to the original activity table.
  • the MV only answers cohort queries introduced by launch birth action. If another birth action (e.g., shop) is used, one more MV is required. Obviously, this per birth action per MV approach does not scale even for a small number of birth actions due to the cost of MV construction and maintenance.
  • cohort analysis is a data exploration technique that examines longitudinal behavioral trends of different cohorts since they were born. 3.1. Data Model
  • activity relation a special relation where each tuple represents the information associated with a single user activity.
  • activity relation an activity table.
  • activity relation and activity table are used interchangeably.
  • An activity table D is a relation with attributes A u , A t , A e , A ⁇ ,...,A n where n ⁇ 1.
  • a u is a string uniquely identifying a user;
  • a e is also a string, representing an action chosen from a pre-defined collection of actions, and
  • a t records the time at which A u performed A e .
  • Every other attribute in D is a standard relational attribute.
  • an activity table has a primary key constraint on (A u , A t , A e ). That is, each user / can only perform a specific action e once at each time instant.
  • the first three columns correspond to the user (A u ), timestamp (A t ) and action (A e ) attribute, respectively.
  • Role and Country are dimension attributes, which respectively specify the role and the country of player A u when performing A e at A t . Following the two dimension attributes is gold, a measure attribute representing the virtual currency that player A u spent for this action.
  • An action e is called a birth action if e is used to define the birth time of users.
  • ⁇ and ⁇ are the standard projection and selection operators.
  • age activity tuple an activity tuple with a positive age.
  • age activity tuple an activity tuple with a positive age.
  • the age g is normalized by a certain time unit such as a day, week or month. Without loss of generality, we assume that the granularity of g is a day.
  • the activity tuple is the birth activity tuple of player 001 , and the birth time is 2013/01/19:1000.
  • the activity tuple t 2 is an age tuple of player 001 produced at age 1.
  • the birth selection operator is used to retrieve activity tuples of qualified users whose birth activity tuples satisfy a specific condition C.
  • C is a propositional formula and e is a birth action.
  • the age selection operator is used to generate an activity table from D which retains all birth activity tuples in D but only a subset of age activity tuples which satisfy a condition C.
  • C is a propositional formula and e is a birth action.
  • e is a birth action.
  • C is a propositional formula and e is a birth action.
  • shop is the birth action, and we want to derive an activity table which retains all birth activity tuples in Table 1 but only includes age activity tuples which indicate users performing in-game shopping in all countries but China.
  • the following expression can be used to obtain the desired activity table.
  • the result set of the above selection operation is where t 2 is the birth activity tuple of player 001 , t 3 and f 4 are the qualified age activity tuples of player 001.
  • the activity tuples t 7 and is are the birth activity tuple and the qualified age activity tuple of player 002.
  • a common requirement in specifying operation is that we often want to reference the attribute values of birth activity tuples in C. For example, given the birth action shop, we may want to select age activity tuples whose users perform in-game shopping at the same location as their country of birth. We introduce a Birth() function for this purpose. Given an attribute A, for any activity tuple d, the birth( A) returns the value of attribute A in d[A u ]'s birth activity tuple:
  • tuples of player 001 and player 002, respectively, and t 3 and t8 are the qualified age activity tuples.
  • cohort aggregation operator This operator produces cohort aggregates in two steps:. 1 ) cohort users and 2) aggregate activity tuples.
  • L is a cohort attributes set
  • e is a birth action
  • ⁇ A is a standard aggregation function with respect to the attribute A.
  • the cohort aggregation operator takes an activity table D as input and produces a normal relational table R as output.
  • Each row in the output table R consists of four parts (dug,s,m), where di is the projection of birth activity tuples onto the cohort attributes set L and identifies the cohort, g is the age, i.e., the time point that we report the aggregates, s is the size of the cohort, i.e., the number of users in the cohort specified by d L , and m is the aggregated measure produced by the aggregate function f A . Note that we only apply ⁇ A on age activity tuples with g > 0.
  • D ⁇ R can be expressed as a composition of those operators that takes D as input and produces a relation R as output with the constraint that the same birth action e is used for all cohort operators in Q.
  • e is the birth action that is specified by the data analyst for the whole cohort query.
  • the order of BIRTH FROM and AGE ACTIVITIES IN clauses is irrelevant, and the birth selection (i.e., ) and age selection (i.e., ) clauses are optional.
  • Example 4 With the newly developed cohort operators, the cohort analysis task presented in Example 4 can be expressed by the following query:
  • cohort queries can be mixed with SQL queries in a single query. For example, one may want to use a SQL query to retrieve specific cohort trends produced by a cohort query for further analysis.
  • This mixed querying requirement can be achieved by applying the standard SQL WITH clause to encapsulate a cohort query as a sub-query that can be processed by an outer SQL query.
  • the following example demonstrates how to use a mixed query to retrieve specific cohort spent trends reported by Qi for further analysis.
  • Another extension is to introduce binary cohort operators (e.g., join, intersection etc.) for analyzing multiple activity tables.
  • the MV approach stores each activity tuple of user / ' along with /"s birth attributes.
  • birth selection operator one can use a SQL SELECT statement with a WHERE clause specifying the birth selection condition on the materialized birth attributes.
  • age selection operator can be simulated by a SQL SELECT statement with a WHERE clause specifying the age selection condition along with an additional predicate to include birth activity tuples.
  • the cohort aggregation operator can be implemented by applying a SQL GROUP BY aggregation operation on the joined results between the cohortSize table and the qualified age activity tuples.
  • Figure 2 demonstrates for Q1 of Example 1 the correspondence between the three proposed cohort operators and the equivalent SQL statements posed on the MV built for the launch birth action.
  • the player, action and time attributes are respectively abbreviated to p, a, and t. be, br, bt and age are four attributes additionally materialized along with the original activity table.
  • the first three attributes, be, br and bt, respectively represent the birth attributes for country, role and time.
  • the SQL statements of Figure 2 are separated out for ease of exposition: one can optimize them by combining Figure 2(a) and 2(b) into a single SQL statement, as we do in all experiments. 4. Implementation of a Cohort Query Engine
  • FIG 3 is a block diagram showing a technical architecture 100 of data processing system according to an embodiment of the present invention.
  • the block diagram as shown in Figure 1 illustrates a technical architecture 100 of a computer which is suitable for implementing one or more embodiments herein.
  • the technical architecture 100 includes a processor 122 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 124 (such as disk drives), read only memory (ROM) 126, random access memory (RAM) 128.
  • the processor 122 may be implemented as one or more CPU chips.
  • the technical architecture 120 may further comprise input/output (I/O) devices 130, and network connectivity devices 132.
  • the technical architecture 100 further comprises activity table storage 140 which may be implemented as a hard disk drive or other type of storage device.
  • the secondary storage 124 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if RAM 128 is not large enough to hold all working data. Secondary storage 124 may be used to store programs which are loaded into RAM 128 when such programs are selected for execution.
  • the secondary storage 124 has a storage management module 124a, and a query processing module 124b comprising non-transitory instructions operative by the processor 122 to perform various operations of the method of the present disclosure. As depicted in Figure 3, the modules 124a-124b are distinct modules which perform respective functions implemented by the data processing system.
  • modules 124a-124b may alternatively be implemented as one or more hardware modules (such as field-programmable gate array(s) or application-specific integrated circuit(s)) comprising circuitry which implements equivalent functionality to that implemented in software.
  • the ROM 126 is used to store instructions and perhaps data which are read during program execution.
  • the secondary storage 124, the RAM 128, and/or the ROM 126 may be referred to in some contexts as computer readable storage media and/or non-transitory computer readable media.
  • the functionality of storage management module 124a controls the writing and modification of user activity data to the activity table storage 140.
  • the I/O devices may include printers, video monitors, liquid crystal displays (LCDs), plasma displays, touch screen displays, keyboards, keypads, switches, dials, mice, track balls, voice recognizers, card readers, paper tape readers, or other well-known input devices.
  • LCDs liquid crystal displays
  • plasma displays plasma displays
  • touch screen displays touch screen displays
  • keyboards keypads
  • switches dials
  • mice track balls
  • voice recognizers card readers, paper tape readers, or other well-known input devices.
  • the network connectivity devices 132 may take the form of modems, modem banks, Ethernet cards, universal serial bus (USB) interface cards, serial interfaces, token ring cards, fiber distributed data interface (FDDI) cards, wireless local area network (WLAN) cards, radio transceiver cards that promote radio communications using protocols such as code division multiple access (CDMA), global system for mobile communications (GSM), long-term evolution (LTE), worldwide interoperability for microwave access (WiMAX), near field communications (NFC), radio frequency identity (RFID), and/or other air interface protocol radio transceiver cards, and other well-known network devices. These network connectivity devices 132 may enable the processor 122 to communicate with the Internet or one or more intranets.
  • CDMA code division multiple access
  • GSM global system for mobile communications
  • LTE long-term evolution
  • WiMAX worldwide interoperability for microwave access
  • NFC near field communications
  • RFID radio frequency identity
  • RFID radio frequency identity
  • the processor 122 might receive information from the network, or might output information to the network in the course of performing the method operations described herein.
  • Such information which is often represented as a sequence of instructions to be executed using processor 122, may be received from and outputted to the network, for example, in the form of a computer data signal embodied in a carrier wave.
  • the processor 122 executes instructions, codes, computer programs, scripts which it accesses from hard disk, floppy disk, optical disk (these various disk based systems may all be considered secondary storage 124), flash drive, ROM 126, RAM 228, or the network connectivity devices 132. While only one processor 122 is shown, multiple processors may be present. Thus, while instructions may be discussed as executed by a processor, the instructions may be executed simultaneously, serially, or otherwise executed by one or multiple processors.
  • the technical architecture 100 is described with reference to a computer, it should be appreciated that the technical architecture may be formed by two or more computers in communication with each other that collaborate to perform a task.
  • an application may be partitioned in such a way as to permit concurrent and/or parallel processing of the instructions of the application.
  • the data processed by the application may be partitioned in such a way as to permit concurrent and/or parallel processing of different portions of a data set by the two or more computers.
  • virtualization software may be employed by the technical architecture 100 to provide the functionality of a number of servers that is not directly bound to the number of computers in the technical architecture 100.
  • the functionality disclosed above may be provided by executing the application and/or applications in a cloud computing environment.
  • Cloud computing may comprise providing computing services via a network connection using dynamically scalable computing resources.
  • a cloud computing environment may be established by an enterprise and/or may be hired on an as-needed basis from a third party provider.
  • FIG. 4 is a flowchart showing a method 400 of analysing user data according to an embodiment of the present invention.
  • the storage management module 124a stores user data in the activity table storage 140 as an activity table.
  • Step 402 may comprise sorting and compressing the user data as described in more detail below.
  • the activity table is stored as a fine-tuned hierarchical storage format for persisting activity tables.
  • the query processing module 224b evaluates a cohort analysis query on the activity table stored in the activity table storage 140.
  • Step 404 may comprise scanning the activity table with a modified scan operator, implementing cohort operations and optimizing queries as described in more detail below.
  • the modified table scan operator is capable of skipping age activity tuples of unqualified users.
  • the method may involve a native efficient implementation of cohort operators;
  • the queries may be optimised by a query planner capable of utilizing the cohort operator property (i.e., Equation (1 )) for optimization.
  • a query planner capable of utilizing the cohort operator property (i.e., Equation (1 )) for optimization.
  • Equation (1 ) the cohort operator property
  • This storage layout has two desirable properties: 1 ) activity tuples of the same user are clustered together; we refer to this as the clustering property; 2) The activity tuples of each user are stored in a chronological order; this is called the time ordering property.
  • Run-Length-Encoding RLE
  • the values in A u ⁇ s stored as a sequence of triples (u,f,n), where u is the user in A u , f is the position of the first appearance of u in the column, and n is the number of appearances of u in the column.
  • a modified table scan operator can directly process these triples and efficiently skip to the activity tuples of the next user if the birth activity tuple of the current user is not qualified with respect to the birth selection condition.
  • each value of A in that chunk can be represented as a chunk-id, which is the position of the global-id of that value in the chunk dictionary.
  • the chunk-ids are then persisted immediately after the chunk dictionary in the same order as the respective values appearing in A.
  • This two level encoding scheme enables efficient pruning of chunks where no users perform the birth action. For a given birth action e, we first perform a binary search on the global index to find its global-id g,. Then, for each data chunk, we perform a binary search for a:, in the chunk dictionary. If g, is not found, we can safely skip the current data chunk since no users in the data chunk perform e.
  • This section presents how to evaluate a cohort query over the activity table compressed with the techniques proposed in Section 4.1.
  • the overall query processing strategy is as follows. We first generate a logical query plan, and then optimize it by pushing down the birth selections along the plan. Next, the optimized query plan is executed against each data chunk. Finally, all partial results produced by the third step are merged together to produce the final result.
  • the final merging step is trivial and we shall only present the first three steps.
  • the cohort query plan we introduced in this paper is a tree of physical operators consisting of four operators: TableScan, birth selection age selection and cohort aggregation Like other columnar databases, the projection operation is
  • the root and the only leaf node are the aggregation operator
  • Figure 5 shows the query plan for the cohort query of Qi.
  • the root node of the query plan 500 is the cohort aggreration operator 502.
  • the leaf node of the query plan is the TableScan operator 504.
  • the birth selection operator 508 and the age selection operator 506 may be swapped.
  • a specially designed TableScan implementation can efficiently skip age activity tuples without further processing for users whose birth activity tuples do not satisfy the birth selection condition. Therefore, the cost of evaluating birth selection operators before age selection operators is always less than the cost incurred from the reverse evaluation sequence in terms of the number of activity tuples processed.
  • the resulting query plan will be executed against each data chunk.
  • the concrete processing strategy is presented in Section 4.1. In practice, we find that this intermediate filtering step is particularly useful if the birth action is highly selective (i.e., only a few users performed that birth action). We will present the implementation of the physical operators in the rest of this section.
  • the modified TableScan operator is implemented as follows. For each data chunk, in the query initialization stage, the TableScan operator collects all (compressed) chunk columns referenced in the query and maintains for each chunk column a file pointer which is initialized to point to the beginning of that chunk column.
  • the implementation of GetNext() function is identical to the standard TableScan operator of a columnar database.
  • the GetNextUser() is implemented by first retrieving the next triple (u,f,n) of Au column and then advancing the file pointer of each other referenced column to the beginning of the column segment corresponding to user u.
  • the SkipCurUser() is implemented in a similar way. When it is called, the SkipCurUser() function first calculates the number of remaining activity tuples of the current user, and then advances the file pointers of all columns by the same number.
  • the GetBirthTuple() function finds Fs birth activity tuple by iterating over each next tuple d s D and checks whether d belongs to / ' and whether d[A e ] is the birth action e (line 3). The first activity tuple d matching the condition is the required birth activity tuple.
  • u c (line 7 - line 8) which points to the user currently being processed.
  • the GetNext() function we return the next activity tuple d of u c if u c ⁇ s qualified with respect to the birth selection condition (line 11 ). If u c 's activity tuples are exhausted, we retrieve the next user block by calling the GetNextUser() function of the TableScan operator (line 13). Then, we find the birth activity tuple of the new user and check if it satisfies the birth selection condition (line 16 - line 17).
  • the new user is qualified, its birth activity tuple will be returned; otherwise all the activity tuples of this user will be skipped using the SkipCurUser() function so that its next user can be ready for processing. Therefore, one can continuously call the GetNext() function to retrieve the activity tuples of users that are qualified with respect to the birth selection condition.
  • Algorithm 2 presents the implementation of operator.
  • the main logic is implemented in the Open() function.
  • the function first initializes two hash tables and 9 which respectively store the cohort size and per data chunk aggregation result for each (cohort, age) partition (line 2 - line 6). Then, the Open() function iterates over each user block and updates /-ffor each qualified user (determined by )
  • cohort analysis involves counting distinct number of users for each (cohort, age) combination. This computation is very costly in terms of memory for fields with a large cardinality, such as A u . Fortunately, our proposed storage format has a nice property that the activity tuples of any user are included in only one chunk. We therefore implement a UserCount() aggregation function for the efficient counting of distinct users by performing counting against each chunk and returning the sum of the obtained numbers as the final result. 4.6. Analysis of Query Performance
  • Performance Study This section presents a performance study to evaluate the effectiveness of our proposed COHANA engine. We mainly perform two sets of experiments. First, we study the effectiveness of COHANA, and its optimization techniques. In the second set of experiments, we compare the performance of different query evaluation schemes.
  • the first query Q1 evaluates a single cohort aggregation operator.
  • the second query Q2 evaluates a combination of birth selection and cohort aggregation.
  • the third query Q3 evaluates a combination of age selection and cohort aggregation.
  • the fourth query Q4 evaluates a combination of all three cohort operators. For each query, we report the average execution time of five runs for each system.
  • Q1 and Q3 by adding to them a birth selection condition (resulting in Q5 and Q6) or an age selection condition (resulting in Q7 and Q8).
  • a birth selection condition resulting in Q5 and Q6
  • an age selection condition resulting in Q7 and Q8.
  • Figures 6 and 7 present the storage space COHANA requires for the activity table compressed with different chunk sizes, and the corresponding query performance.
  • Figures 6a to 6d show the query performance for queries Q1 to Q4 respectively with different chunk sizes.
  • Figure 7 shows the storage size requirements for different chunk sizes.
  • a small one leads to more part of the neighboring columns to be simultaneously read when reading a compressed chunk column, and hence results in a longer disk read time and a lower memory efficiency due to the memory contention between the useful columns and their unused neighbors within the same chunk.
  • FIG 8 presents the processing times of Q5 and Q6 which are respectively normalized by that of Q1 and Q3.
  • the cumulative distribution (CDF) of user births is also given in this figure.
  • CDF cumulative distribution
  • the processing time of Q5 highly coincides with the birth distribution.
  • the processing time of Q6, however, is not very sensitive to the birth distribution. This is because in Q6, users are born with respect to the shop action, and there is a cost in finding the birth activity tuple for each user. This cost is avoided in Q5 as the first activity tuple of each user is the birth activity tuple of this user (recall that the first action each user performed is launch).
  • FIG. 9 shows the effect of age selection.
  • Q7 and Q8 another variant of Q1 and Q3, on the dataset of scale 1 by varying g from 1 day to 14 days to study the query performance of COHANA under different age selection conditions.
  • Figure 9 presents the result of this experiment.
  • the processing times of Q7 and Q8 are also respectively normalized by that of Q1 and Q3. It can be seen from this figure that the processing times of Q7 and Q8 exhibit different trends.
  • Figure 10 shows a comparison of time for generating materialized view for COHANA, MonetDB and Postgres (PG).
  • MonetDB needs more than 60,000 seconds (16.7 hours) to generate the materialized view from the original activity table. This time cost is even more expensive in Postgres, which needs more than 100,000 seconds (27.8 hours) at scale 32.
  • the result for Postgres at scale 64 is not available as Postgres is not able to generate the materialized view before using up all free disk space, which also implies a high storage cost during the generation of the materialized view.
  • COHANA only needs 1.25 hours to compress the activity table of scale 64.
  • Figure 11 shows for each scale (factor) the execution time that each system takes to execute the four queries.
  • the results of the Postgres and the MonetDB databases are respectively shown in the lines labelled by "PG-S/M” and in those labelled by “MONET-S/M", where "S" and “M” respectively mean the SQL and the materialized view approaches.
  • the SQL based approach is the slowest as it needs multiple joins for processing cohort queries. With the elimination of joins, the materialized view based approach can reduce the query processing time by an order of magnitude.
  • This figure also shows the power of columnar storage in terms of cohort query processing.
  • MonetDB a state-of-the-art columnar database, can be up to two orders faster than Postgres.
  • COHANA is able to perform extremely well not only on small datasets, but also on large datasets.
  • COHANA is able to perform better than the MonetDB equipped with the materialized view at any scale.
  • the performance gap between them is one to two orders of magnitude in most cases, and can be up to three orders of magnitude (Q4 at scale 32).
  • Q1 and Q2 enjoy a larger performance gain than Q3 does and in part attribute it to the optimization Section 4.5 presents for user retention analysis.
  • the generation of the materialized view is much more expensive than COHANA.
  • Cohort analysis is a powerful tool for finding unusual user behavioral trends in large activity tables.
  • Embodiments of the present invention provide support for cohort analysis. Consequently, we have introduced an extended relation for modeling activity data and extended SQL with three new operators for composing cohort queries.
  • Our experimental results showed that COHANA can achieve two orders faster query performance than simply running SQL queries over conventional database systems, demonstrating the possible benefit of extending a database system for cohort queries over implementing cohort queries on top of it.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne des procédés, des systèmes et des structures de données pour effectuer une analyse de données de cohorte. Des données d'activité d'utilisateur sont stockées sous la forme d'une table d'activité qui comprend une pluralité de tuples. Chacun des tuples a une pluralité d'attributs comprenant un identifiant d'utilisateur, une estampille temporelle et un identifiant d'activité. La table d'activité est triée selon une clé primaire comprenant l'identifiant d'utilisateur, l'estampille temporelle et l'identifiant d'activité. Une interrogation de cohorte qui spécifie une condition de naissance indiquant une division de la pluralité d'utilisateurs en une pluralité de cohortes et une fonction d'au moins un attribut devant être évalué pour chaque cohorte de la pluralité de cohortes peuvent être évaluées par itération sur la table d'activité.
PCT/SG2017/050443 2016-09-06 2017-09-06 Procédés et systèmes d'analyse de données de cohorte et structures de données pour effectuer une analyse de données de cohorte WO2018048350A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10201607401Q 2016-09-06
SG10201607401Q 2016-09-06

Publications (1)

Publication Number Publication Date
WO2018048350A1 true WO2018048350A1 (fr) 2018-03-15

Family

ID=61562708

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2017/050443 WO2018048350A1 (fr) 2016-09-06 2017-09-06 Procédés et systèmes d'analyse de données de cohorte et structures de données pour effectuer une analyse de données de cohorte

Country Status (1)

Country Link
WO (1) WO2018048350A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112929185A (zh) * 2019-12-05 2021-06-08 瑞达凯特科技(加拿大)有限公司 一种信息回溯方法及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090319549A1 (en) * 2008-06-20 2009-12-24 Perfect Search Corporation Index compression
US20100332459A1 (en) * 2009-06-30 2010-12-30 Hitachi, Ltd. Data processing device and method
US20120005227A1 (en) * 2009-03-23 2012-01-05 Fujitsu Limited Content recommending method, recommendation information creating method, content recommendation program, content recommendation server, and content providing system
WO2015178697A1 (fr) * 2014-05-22 2015-11-26 주식회사 밸류포션 Procédé et dispositif de publicité utilisant une plate-forme d'analyse d'utilisateur et une plate-forme de commercialisation en fonction d'une cohorte

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090319549A1 (en) * 2008-06-20 2009-12-24 Perfect Search Corporation Index compression
US20120005227A1 (en) * 2009-03-23 2012-01-05 Fujitsu Limited Content recommending method, recommendation information creating method, content recommendation program, content recommendation server, and content providing system
US20100332459A1 (en) * 2009-06-30 2010-12-30 Hitachi, Ltd. Data processing device and method
WO2015178697A1 (fr) * 2014-05-22 2015-11-26 주식회사 밸류포션 Procédé et dispositif de publicité utilisant une plate-forme d'analyse d'utilisateur et une plate-forme de commercialisation en fonction d'une cohorte

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BARBOSA S. ET AL.: "Averaging Gone Wrong: Using Time-Aware Analyses to Better Understand Behavior", PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 15 April 2016 (2016-04-15), pages 829 - 841, XP058080297, [retrieved on 20170929] *
JIANG D. ET AL.: "Cohort Query Processing. Proceedings of the VLDB Endowment", COHORT QUERY PROCESSING. PROCEEDINGS OF THE VLDB ENDOWMENT, vol. 10, no. 1, 1 September 2016 (2016-09-01), pages 1 - 12, XP058308893, [retrieved on 20170929] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112929185A (zh) * 2019-12-05 2021-06-08 瑞达凯特科技(加拿大)有限公司 一种信息回溯方法及设备

Similar Documents

Publication Publication Date Title
US11762882B2 (en) System and method for analysis and management of data distribution in a distributed database environment
Yang et al. Qd-tree: Learning data layouts for big data analytics
CN108292315B (zh) 储存和检索数据立方体中的数据
US20200050968A1 (en) Interactive interfaces for machine learning model evaluations
US8065326B2 (en) System and method for building decision trees in a database
Chaudhuri et al. An overview of business intelligence technology
US10380269B2 (en) Sideways information passing
US10152502B2 (en) Systems, apparatuses, methods, and computer readable media for processing and analyzing big data using columnar index data format
CN102915365A (zh) 基于Hadoop的分布式搜索引擎构建方法
US20160098448A1 (en) Dynamic database query efficiency improvement
US20150149441A1 (en) Data Statistics in Data Management Systems
Jiang et al. Cohort query processing
US20180096006A1 (en) Workload-driven recommendations for columnstore and rowstore indexes in relational databases
WO2022021710A1 (fr) Procédé et appareil de vidage de données, dispositif, et support de stockage
Duda Business intelligence and NoSQL databases
Chambi et al. Optimizing druid with roaring bitmaps
Visheratin et al. Peregreen–modular database for efficient storage of historical time series in cloud environments
WO2018048350A1 (fr) Procédés et systèmes d'analyse de données de cohorte et structures de données pour effectuer une analyse de données de cohorte
Sinthong et al. AFrame: Extending DataFrames for large-scale modern data analysis (Extended Version)
Chądzyńska-Krasowska et al. A metadata diagnostic framework for a new approximate query engine working with granulated data summaries
US11531665B2 (en) Automated database index management
Li et al. Kddlog: Performance and scalability in knowledge discovery by declarative queries with aggregates
Zheng et al. BigData oriented open scalable relational data model
CN116010420B (zh) 一种基于位图存储的数据查询方法、系统、设备及介质
KR101795564B1 (ko) 컬럼-인덱스 데이터 포맷을 이용하여 빅데이터를 효율적으로 처리 및 분석하는 시스템 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17849207

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17849207

Country of ref document: EP

Kind code of ref document: A1