CN111090708B - User characteristic output method and system based on data warehouse - Google Patents

User characteristic output method and system based on data warehouse Download PDF

Info

Publication number
CN111090708B
CN111090708B CN201910962259.7A CN201910962259A CN111090708B CN 111090708 B CN111090708 B CN 111090708B CN 201910962259 A CN201910962259 A CN 201910962259A CN 111090708 B CN111090708 B CN 111090708B
Authority
CN
China
Prior art keywords
data
subset
dimension
dimensions
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910962259.7A
Other languages
Chinese (zh)
Other versions
CN111090708A (en
Inventor
周翱
胡研
党孟光
张一丁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AlipayCom Co ltd
Original Assignee
AlipayCom Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AlipayCom Co ltd filed Critical AlipayCom Co ltd
Priority to CN201910962259.7A priority Critical patent/CN111090708B/en
Publication of CN111090708A publication Critical patent/CN111090708A/en
Application granted granted Critical
Publication of CN111090708B publication Critical patent/CN111090708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Abstract

The present disclosure provides a data warehouse-based user feature yield method, comprising: acquiring transaction detail data of a plurality of users and dividing the transaction detail data into a plurality of subsets according to different dimensions; aggregating data in the first dimension subset per user; acquiring a second dimension for data screening; correlating the data in the second subset of dimensions to the data in the aggregated first subset of dimensions to generate data to be screened; screening the data to be screened to obtain user characteristics; and outputting at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions.

Description

User characteristic output method and system based on data warehouse
Technical Field
The present disclosure relates generally to data processing, and more particularly to data processing based on data aggregation.
Background
In the big data age, a big data association traceability technology is generally adopted to deeply mine massive transaction data, and an abnormal transaction network is found out based on transaction network structure discovery and abnormal transaction discovery.
In the field of big data, since the transaction data volume is quite huge and complex, grouping operation and data aggregation are often adopted in the data mining process to obtain results. Such a process is relatively simple and fast, but also has a limitation that when a final result of a certain feature is output, there is often no intermediate evidence information, so that the evidence or evidence needs to be collected manually later, which is a great deal of effort and is most likely to be omitted.
In this case, there is a strong demand for a more powerful tool for operators in the big data area to provide intermediate evidence information together when the final result of a certain feature is output quickly, thereby improving the efficiency and pertinence of data processing in the big data area.
Disclosure of Invention
In order to solve the technical problems, the present disclosure provides a user feature output scheme based on a data warehouse. The scheme can provide necessary intermediate evidence information when the final result of the user characteristic is output quickly, so that the efficiency and pertinence of data processing are improved.
In one embodiment of the present disclosure, there is provided a user feature yield method based on a data warehouse, including: acquiring transaction detail data of a plurality of users and dividing the transaction detail data into a plurality of subsets according to different dimensions; aggregating data in the first dimension subset per user; acquiring a second dimension for data screening; correlating the data in the second subset of dimensions to the data in the aggregated first subset of dimensions to generate data to be screened; screening the data to be screened to obtain user characteristics; and outputting at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions.
In another embodiment of the present disclosure, the plurality of subsets each include key-value data pairs.
In yet another embodiment of the present disclosure, the data warehouse is based on Spark, hadoop, mapReduce, hive or SQL.
In another embodiment of the present disclosure, splitting transaction detail data into subsets in different dimensions may employ a slicing or dicing operation.
In another embodiment of the present disclosure, outputting at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: when evidence is needed, user features, data in the first subset of dimensions, or data in the second subset of dimensions are output.
In yet another embodiment of the present disclosure, outputting at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: when evidence is needed, the user features, the key data in the first subset of dimensions, and the key data in the second subset of dimensions are output.
In one embodiment of the present disclosure, there is provided a data warehouse-based user feature yield system, comprising: the system comprises an acquisition module, a data filtering module and a data filtering module, wherein the acquisition module acquires transaction detail data of a plurality of users, divides the transaction detail data into a plurality of subsets according to different dimensions, and acquires a second dimension for data filtering; the aggregation and association module aggregates data in a first dimension subset according to users and associates data in a second dimension subset with the aggregated data in the first dimension subset to generate data to be screened; the screening module screens the data to be screened to obtain user characteristics; and an output module that outputs at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions.
In another embodiment of the present disclosure, the plurality of subsets each include key-value data pairs.
In yet another embodiment of the present disclosure, the data warehouse is based on Spark, hadoop, mapReduce, hive or SQL.
In another embodiment of the present disclosure, the splitting of the transaction detail data into subsets by the acquisition module in different dimensions may employ a slicing or dicing operation.
In another embodiment of the present disclosure, the outputting of at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions by the output module comprises: when evidence is needed, the output module outputs the user characteristics, the data in the first dimension subset, or the data in the second dimension subset.
In yet another embodiment of the present disclosure, the outputting module outputting at least one of the user characteristic, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: when evidence is needed, the output module outputs the user characteristics, the key data in the first dimension subset, and the key data in the second dimension subset.
In one embodiment of the present disclosure, a computer-readable storage medium is provided having stored thereon instructions that, when executed, cause a machine to perform a method as previously described.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
The foregoing summary of the disclosure and the following detailed description will be better understood when read in conjunction with the accompanying drawings. It is to be noted that the drawings are merely examples of the claimed invention. In the drawings, like reference numbers indicate identical or similar elements.
FIG. 1 illustrates a flow chart of a spark-based user feature yield method according to an embodiment of the present disclosure;
FIG. 2 illustrates a schematic diagram of a spark-based user feature yield method according to an embodiment of the present disclosure;
FIG. 3 illustrates a flow chart of a data warehouse-based user feature yield method in accordance with an embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of an application scenario of a data warehouse-based user feature yield method, according to an embodiment of the present disclosure;
FIG. 5 illustrates a block diagram of a data warehouse-based user feature yield system, in accordance with an embodiment of the present disclosure.
Detailed Description
In order to make the above objects, features and advantages of the present disclosure more comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein, and thus the present disclosure is not limited to the specific embodiments disclosed below.
In the big data age, the massive amounts of data that need to be processed and analyzed must be based on efficient storage. The currently adopted Hadoop (HDFS) distributed file system can effectively store very large data sets. The statistics and analysis of the data are essentially the computation of the data based on the efficient storage of the data. Common computing tools in the big data field are MapReduce, spark, etc. And the data query in the Hadoop mode can be adopted for effectively managing the big data. The data warehouse provides a foundation for data mining, and an expert system is constructed by establishing a model through classification, prediction and correlation analysis to perform pattern recognition and machine learning.
Hadoop can be used as a distributed file system (HDFS) for storing mass data, wherein MapReduce is responsible for calculating data, so that a distributed framework for processing the mass data is formed. HDFS allows data to span different machines and devices and use paths to manage data on different platforms.
MapReduce is used as a calculation engine and is responsible for distributing different data resources when processing different types of data, so as to solve the problems of data exchange and communication between different devices. And Spark is the second generation. Classification is required in the statistics, and the finer the classification, the more the statistics are participated in, the shorter the calculation time. In big data computation, hundreds or thousands of machines read each part of the object file at the same time, and then calculate statistics for each part. Therefore, the data calculation process is to classify and summarize on the basis of the HDFS.
Spark is a large data processing cluster mainly based on data calculation, wherein data exchange and disk reading and writing are very convenient, and therefore data processing capacity is further improved.
Hive is used as a data warehouse tool and is based on the HDFS, and maps structured data in the HDFS into tables in a database, so that the calculation result of MapReduce can be queried only through SQL sentences, and meanwhile, the data in a file system can be modified through SQL, so that the working efficiency is improved.
In the following description of the present disclosure, a data warehouse-based user feature yield method and system will be described taking Spark as an example. Those skilled in the art will appreciate that the disclosed technique is not limited to use with Spark, but is applicable to virtually any data warehouse. More advanced data warehouse techniques may also incorporate the techniques of this disclosure and are not described in detail herein.
User feature yield methods and systems based on a data warehouse according to various embodiments of the present disclosure will be described in detail below based on the accompanying drawings.
In a data warehouse, splitting or grouping (split) of data sets and applying (apply) different functions to each subset or group, as described above, is an important element in the data analysis effort. The data in the dataset is split into subsets or groupings according to the selected one or more keys. The splitting operation is performed on a particular dimension or axis of the data object. For example, a DataFrame may be split on its row (axis=0) or column (axis=1). A function is then applied to each subset or group and a new value is generated. Finally, the results of the execution of all these functions are aggregated or merged into the final feature result.
In such an aggregate computation, the result is typically output as a final feature only and does not contain intermediate data. In big data scenarios, when it is determined or judged that an end user feature is a "suspicious transaction user," it is often necessary to provide evidence information related to the user feature together. Finding evidence information after the calculation is completed often has inefficiency. It is therefore desirable in the art to output evidence information on demand while simultaneously outputting end user characteristics.
Fig. 1 is a flow chart of a Spark-based user feature yield method 100 according to an embodiment of the present disclosure. Fig. 2 shows a schematic diagram of a Spark-based user feature yield method according to an embodiment of the present disclosure. A Spark-based user feature yield method according to an embodiment of the present disclosure will be described with reference to fig. 1 and 2.
In one embodiment of the present disclosure, the target user characteristic to be generated by Spark-based user characteristic generation method 100 is the number of payments at the customer transaction location at the achievement.
At 102, transaction detail data for a plurality of users is obtained.
Transaction detail data for a plurality of users is obtained from a data store. The data store may be a Spark, hadoop, mapReduce, hive or SQL based data warehouse or the like.
The transaction detail data comprises data of a plurality of dimensions as a data set. As shown in fig. 2, subsets split in different dimensions will include different data. For example, subset 1 has a number of fields that are core to the client ID and transaction UUID, including the client ID, transaction UUID, transaction amount, payment IP, and transaction time. Subset 2 has two fields with payment IP as the core, including payment IP and transaction location.
For example, it can be seen in subset 1 that the user with customer ID a111 has two transactions (transaction UUID T100 and T101, respectively, payment IP 192.1.2.3 and 192.1.2.4, respectively.) it can also be seen in subset 1 that the user with customer ID a112 has two transactions (transaction UUID T102 and T103, respectively, payment IP 192.1.2.5 and 192.1.2.6, respectively).
In subset 2, it can be seen that the transaction locations of the two transactions with payment IP 192.1.2.3 and 192.1.2.4, respectively, are in the achievement. The transaction location for the transaction with payment IP 192.1.2.5 is also in the capital, while the transaction location for the transaction with payment IP192.1.2.6 is in the state.
At 104, data is aggregated in Spark by user dimension and contains transaction UUIDs.
When aggregating data, a key pair (key) is often constructed by taking data in a certain dimension as a key.
When aggregating the data in subset 1, a value set (transaction UUID, payment IP, … …) may be constructed with the user ID as key value. In the present embodiment shown in FIG. 2, for user A111, transactions aggregated to its name (transaction UUID: T100, payment IP: 192.1.2.3) and (transaction UUID: T101, payment IP 192.1.2.4). For user A112, transactions aggregated to its name (transaction UUID: T102, payment IP: 192.1.2.5) and (transaction UUID: T103, payment IP192.1.2.6).
At 106, data of another dimension is associated.
Since the target user characteristic to be produced is the number of payments at the user transaction location, aggregating subset 1 by user dimension alone will not produce the target user characteristic. In this case, it would be necessary to associate data of another dimension, namely an address mapping table (i.e. subset 2 of payment IP and transaction location).
This other dimension that needs to be correlated depends on the target user characteristics to be produced. When the screening condition is "the user's transaction location is in the achievement", since the transaction detail data of the user dimension (subset 1) does not contain transaction location data, in order to yield the target user characteristics, it will be necessary to establish an association of the relevant subsets.
And associating the subset 2 on the basis of the data of the subset 1 after aggregation to form the data to be screened. For user A111, the transactions to be screened are (transaction UUID: T100, payment IP:192.1.2.3, place: idiom) and (transaction UUID: T101, payment IP 192.1.2.4, place: idiom). For user A112, the transactions to be screened are (transaction UUID: T102, payment IP:192.1.2.5, location: adult) and (transaction UUID: T103, payment IP192.1.2.6, location: hangzhou).
At 108, the association data is filtered.
Screening the above associated data based on the screening condition that the transaction place of the user is in the achievement, wherein the screened data are: for user A111, the transactions screened are (transaction UUID: T100, payment IP:192.1.2.3, location: adult) and (transaction UUID: T101, payment IP 192.1.2.4, location: adult). For user A112, the transaction screened is (transaction UUID: T102, payment IP:192.1.2.5, place: idiom).
At 110, the user characteristics and transaction UUID are output.
For the transaction data screened at 108, calculating the number of payments at the user transaction location at the achievement, namely: for user A111, the transaction site has 2 transactions at the achievement party; for user a112, the transaction location has 1 transaction at the achievement party.
In a big data scenario, the transaction UUID may be output as evidence, namely:
for user A111, the transaction location has 2 transactions at the achievement, and the transaction UUIDs are T100 and T101; for user a112, the transaction location has 1 transaction at the achievement, the transaction UUID is T102.
Of course, one skilled in the art will appreciate that for different big data scenarios, different relevant transaction details may be output as evidence, and multiple relevant detail data may be output as evidence as desired.
In the big data area, the abnormal transfer structure discovery may be performed by discovering different transaction structures. Connectivity is the most common feature of today's networks and systems, and connections are not evenly distributed, nor static, from social relationship networks to transactional networks. Studying the connectivity and its dynamics will be fully described, or even predicting behavior in the connectivity system.
Different transaction structures found in the abnormal transfer structure discovery include chain transaction structures, nested loop transaction structures, centralized in-and-out transaction structures, decentralized in-and-out transaction structures, and the like. For such abnormal transaction structures, abnormal user features can be mined and corresponding evidence can be provided by adopting the user feature output scheme based on the data warehouse.
FIG. 3 is a flow chart of a data warehouse-based user feature yield method 300, according to an embodiment of the present disclosure. Fig. 4 is a schematic diagram of an application scenario of a data warehouse-based user feature yield method according to an embodiment of the present disclosure. A data warehouse-based user feature yield method according to an embodiment of the present disclosure will be described with reference to fig. 3 and 4.
In a data warehouse, multidimensional analysis may perform various analysis operations such as reeling, running down, slicing, dicing, rotating, etc., on data organized in a multidimensional manner to parse the data, yield characteristics, and enable analysts, administrators, and decision makers to view the data in the database from multiple angles and multiple sides to gain insight into the information and connotations contained in the data. The multi-dimensional analysis mode is suitable for the thinking mode of people, reduces confusion and reduces the possibility of misinterpretation.
At 302, transaction detail data for a plurality of users is obtained and the transaction detail data is split into a plurality of subsets in different dimensions.
For a multi-dimensional dataset stored in a data warehouse, it is necessary to split the dataset before multi-dimensional analysis is performed on it. The two-dimensional plane data can be divided according to one dimension, namely a slice (slice), and the slice result usually obtains the two-dimensional plane data. The method comprises the steps of carrying out a first treatment on the surface of the The data may also be split in several dimensions, namely dicing (dice), the result of which typically results in a subcube of data.
In one embodiment of the present disclosure, the transaction details of a plurality of users are broken down into subsets 1 by user dimension, i.e., a plurality of fields including a client ID, a transaction UUID, a transaction amount, a payment IP, and a transaction time; the payment IP dimension is split into subsets 2, i.e. comprising payment IP and transaction location.
At 304, data in the first subset of dimensions is aggregated per user.
When aggregating the data in subset 1, a value set (transaction UUID, payment IP, … …) may be constructed with the user ID as key value. For example, for user A111, transactions aggregated to its name (transaction UUID: T100, payment IP: 192.1.2.3) and (transaction UUID: T101, payment IP 192.1.2.4).
In a data warehouse, the dimensions of the data are hierarchical, e.g., the time dimension may be made up of years, months, days, and the hierarchy of dimensions actually reflects the degree of integration of the data. The higher the level of the dimension, the higher the degree of integration of the represented data, the less the details, and the less the data volume; the lower the level of dimension, the lower the degree of data synthesis represented, the more complete the detail, and the greater the amount of data. Data aggregation may be referred to as roll-up, with more generalized data being observed by rising in the dimension level or by eliminating one or more dimensions.
As an inverse operation to data aggregation, drill-down (drill-down) is also known as data drilling, with data being observed more carefully by dropping a hierarchy of dimensions or by introducing a dimension or dimensions. In addition, data from different perspectives can be obtained by data rotation (pivot). The data rotation operation corresponds to rotating the coordinate axes based on the plane data. For example, rotation may involve swapping rows and columns, or rotating one dimension into another dimension.
Those skilled in the art will appreciate that the data warehouse-based user feature yield method of the present disclosure should not be limited to data aggregation operations, but may also be extended to other analysis operations, which are not described in detail herein.
At 306, a second dimension for data screening is acquired.
Since the dataset is split into multiple subsets, the data will inevitably be scattered into multiple dimensions, and thus other dimension or dimensions related to the data filtering or filtering conditions need to be acquired when data filtering is performed.
In one embodiment of the present disclosure, the screening condition is "the user's transaction location is in a achievement", and then the transaction location data needs to be included in the dimensions associated with the screening condition. Thus, the second dimension obtained is the address mapping table (i.e., subset 2 of payment IP and transaction location).
At 308, the data in the second subset of dimensions is correlated to the data in the aggregated first subset of dimensions to generate data to be screened.
For the acquired second dimension, associating the data in the second subset of dimensions to the data in the aggregated first subset of dimensions, i.e. generating a data set that can be screened. In this embodiment, the data to be screened is: for user A111, the transactions to be screened are (transaction UUID: T100, payment IP:192.1.2.3, place: idiom) and (transaction UUID: T101, payment IP 192.1.2.4, place: idiom). For user A112, the transactions to be screened are (transaction UUID: T102, payment IP:192.1.2.5, location: adult) and (transaction UUID: T103, payment IP192.1.2.6, location: hangzhou).
At 310, data to be screened is screened for user characteristics.
In this embodiment, the data screened out are: for user A111, the transactions screened are (transaction UUID: T100, payment IP:192.1.2.3, location: adult) and (transaction UUID: T101, payment IP 192.1.2.4, location: adult). For user A112, the transaction screened is (transaction UUID: T102, payment IP:192.1.2.5, place: idiom).
For the transaction data screened at 108, calculating the number of payments at the user transaction location at the achievement, namely: for user A111, the transaction site has 2 transactions at the achievement party; for user a112, the transaction location has 1 transaction at the achievement party.
At 312, at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions is output.
In big data scenarios, there is a need to output relevant evidence in addition to user features.
Thus, in this embodiment, the data that can be output is: for user A111, the transaction location has 2 transactions at the achievement, and the transaction UUIDs are T100 and T101; for user a112, the transaction location has 1 transaction at the achievement, the transaction UUID is T102.
In a scenario where data is stored in key-value pairs, relevant evidence may be output in the form of key-values in the corresponding dimensions.
Those skilled in the art will appreciate that user characteristics and evidence data may be output as desired in different scenarios. That is, only the user characteristics may be output; user characteristics and necessary evidence can be output; user characteristics, necessary evidence, and auxiliary evidence may be output when important transactions are involved.
The user characteristic output method based on the data warehouse can provide necessary intermediate evidence information when the final result of the user characteristic is output rapidly, so that the efficiency and pertinence of data processing are improved. In the method, the user characteristics are produced and the evidence related to the characteristics is produced, so that the data is not required to be retrieved manually during reporting, and the efficiency and the integrity of the evidence are greatly improved.
FIG. 5 illustrates a block diagram of a data warehouse-based user feature yield system 500, in accordance with an embodiment of the present disclosure.
The data warehouse-based user feature yield system 500 includes an acquisition module 502, a screening module 504, an aggregation and association module 506, and an output module 508.
The acquisition module 502 acquires transaction detail data for a plurality of users and splits the transaction detail data into a plurality of subsets in different dimensions.
For a multi-dimensional dataset stored in a data warehouse, it is necessary to split the dataset before multi-dimensional analysis is performed on it. The two-dimensional plane data can be divided according to one dimension, namely a slice (slice), and the slice result usually obtains the two-dimensional plane data. The method comprises the steps of carrying out a first treatment on the surface of the The data may also be split in several dimensions, namely dicing (dice), the result of which typically results in a subcube of data.
In one embodiment of the present disclosure, the transaction details of a plurality of users are broken down into subsets 1 by user dimension, i.e., a plurality of fields including a client ID, a transaction UUID, a transaction amount, a payment IP, and a transaction time; splitting into subsets 2 according to payment IP dimension, i.e. comprising payment IP and transaction location
Further, the acquisition module 502 acquires a second dimension for data screening. Since the dataset is split into multiple subsets, the data will inevitably be scattered into multiple dimensions, and thus other dimension or dimensions related to the data filtering or filtering conditions need to be acquired when data filtering is performed.
In one embodiment of the present disclosure, the screening condition is "the user's transaction location is in a achievement", and then the transaction location data needs to be included in the dimensions associated with the screening condition. Thus, the second dimension obtained is the address mapping table (i.e., subset 2 of payment IP and transaction location).
The aggregation and association module 506 aggregates data in the first subset of dimensions by user and associates data in the second subset of dimensions to data in the aggregated first subset of dimensions to generate data to be filtered.
In a data warehouse, the dimensions of data are hierarchical, and the hierarchy of dimensions actually reflects the degree of integration of the data. The higher the level of the dimension, the higher the degree of integration of the represented data, the less the details, and the less the data volume; the lower the level of dimension, the lower the degree of data synthesis represented, the more complete the detail, and the greater the amount of data. Data aggregation observes more generalized data by rising in the dimension hierarchy or by eliminating one or some dimensions.
In aggregating the data in subset 1, aggregation and association module 506 constructs a value set (transaction UUID, payment IP, … …) with the user ID as key value. For example, for user A111, the transactions aggregated under its name are (transaction UUID: T100, payment IP: 192.1.2.3) and (transaction UUID: T101, payment IP 192.1.2.4).
For the acquired second dimension, the aggregation and association module 506 associates data in the second subset of dimensions to data in the aggregated first subset of dimensions, i.e., generates a data set that can be screened. In this embodiment, the data to be screened is: for user A111, the transactions to be screened are (transaction UUID: T100, payment IP:192.1.2.3, place: idiom) and (transaction UUID: T101, payment IP 192.1.2.4, place: idiom). For user A112, the transactions to be screened are (transaction UUID: T102, payment IP:192.1.2.5, location: adult) and (transaction UUID: T103, payment IP192.1.2.6, location: hangzhou).
The screening module 504 screens the data to be screened for user characteristics.
In this embodiment, the data screened by the screening module 504 is: for user A111, the transactions screened are (transaction UUID: T100, payment IP:192.1.2.3, location: adult) and (transaction UUID: T101, payment IP 192.1.2.4, location: adult). For user A112, the transaction screened is (transaction UUID: T102, payment IP:192.1.2.5, place: idiom).
For the transaction data screened at 108, screening module 504 calculates the number of payments at the user transaction location at the achievement, namely: for user A111, the transaction site has 2 transactions at the achievement party; for user a112, the transaction location has 1 transaction at the achievement party.
The output module 508 outputs at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions.
Thus, in this embodiment, the data that can be output is: for user A111, the transaction location has 2 transactions at the achievement, and the transaction UUIDs are T100 and T101; for user a112, the transaction location has 1 transaction at the achievement, the transaction UUID is T102.
In a scenario where data is stored in key-value pairs, relevant evidence may be output in the form of key-values in the corresponding dimensions.
The user characteristic output system based on the data warehouse can provide necessary intermediate evidence information when the final result of the user characteristic is output rapidly, so that the efficiency and pertinence of data processing are improved. In the system, the user characteristics are produced and the evidence related to the characteristics is produced, so that the data is not required to be retrieved manually during reporting, and the efficiency and the integrity of the evidence are improved.
The various steps and modules of the data warehouse-based user feature yield method and system described above may be implemented in hardware, software, or a combination thereof. If implemented in hardware, the various illustrative steps, modules, and circuits described in connection with the invention may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic component, a hardware component, or any combination thereof. A general purpose processor may be a processor, microprocessor, controller, microcontroller, state machine, or the like. If implemented in software, the various illustrative steps, modules, described in connection with the invention may be stored on or transmitted as one or more instructions or code on a computer readable medium. Software modules implementing various operations of the invention may reside in storage media such as RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, cloud storage, etc. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium, as well as execute corresponding program modules to implement the various steps of the present invention. Moreover, software-based embodiments may be uploaded, downloaded, or accessed remotely via suitable communication means. Such suitable communication means include, for example, the internet, world wide web, intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave and infrared communications), electronic communications, or other such communication means.
It is also noted that the embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. Additionally, the order of the operations may be rearranged.
The disclosed methods, apparatus, and systems should not be limited in any way. Rather, the invention encompasses all novel and non-obvious features and aspects of the various disclosed embodiments (both alone and in various combinations and subcombinations with one another). The disclosed methods, apparatus and systems are not limited to any specific aspect or feature or combination thereof, nor do any of the disclosed embodiments require that any one or more specific advantages be present or that certain or all technical problems be solved.
The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many modifications may be made by those of ordinary skill in the art without departing from the spirit of the present invention and the scope of the appended claims, which fall within the scope of the present invention.

Claims (13)

1. A data warehouse-based user feature yield method, comprising:
acquiring transaction detail data of a plurality of users and dividing the transaction detail data into a plurality of subsets according to different dimensions;
aggregating data in the first dimension subset per user;
acquiring a second dimension for data screening, wherein the second dimension subset corresponds to the same dimension and different dimension as the first dimension subset, the field value of the dimension which corresponds to the second dimension subset and is different from the first dimension is used for determining a final result, and the field value of the dimension which corresponds to the first dimension subset and is different from the second dimension is used as intermediate evidence information of the final result;
associating data in a second dimension subset to the aggregated data in the first dimension subset based on field values of the same dimensions corresponding to the second dimension subset as the first dimension subset to generate data to be screened;
screening the data to be screened based on field values of different dimensions corresponding to the second dimension subset and the first dimension subset to obtain user characteristics; and
outputting at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions.
2. The method of claim 1, wherein the plurality of subsets each comprise key-value data pairs.
3. The method of claim 1, wherein the data warehouse is based on Spark, hadoop, mapReduce, hive or SQL.
4. The method of claim 1, wherein splitting the transaction detail data into subsets in different dimensions may employ a slicing or dicing operation.
5. The method of claim 1, wherein outputting at least one of the user characteristic, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: outputting the user features, the data in the first subset of dimensions, or the data in the second subset of dimensions when evidence is required.
6. The method of claim 2, wherein outputting at least one of the user characteristic, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: and outputting the user characteristics, the key data in the first dimension subset and the key data in the second dimension subset when evidence is needed.
7. A data warehouse-based user feature yield system, comprising:
the system comprises an acquisition module, a data filtering module and a data filtering module, wherein the acquisition module acquires transaction detail data of a plurality of users, divides the transaction detail data into a plurality of subsets according to different dimensions, and acquires a second dimension for data filtering;
the aggregation and association module aggregates data in a first dimension subset according to users, associates the data in a second dimension subset with the aggregated data in the first dimension subset based on field values of the same dimension as the first dimension subset, which correspond to the same and different dimensions as the first dimension subset, wherein the field values of the different dimension as the first dimension are used for determining a final result, and the field values of the different dimension as the second dimension are used as intermediate evidence information of the final result;
the screening module is used for screening the data to be screened based on the field values of the dimensions, corresponding to the second dimension subset, which are different from the first dimension subset, so as to obtain user characteristics; and
and an output module outputting at least one of the user characteristics, the data in the first subset of dimensions, and the data in the second subset of dimensions.
8. The system of claim 7, wherein the plurality of subsets each comprise key-value data pairs.
9. The system of claim 7, the data warehouse is based on Spark, hadoop, mapReduce, hive or SQL.
10. The system of claim 7, wherein the obtaining module splits the transaction detail data into subsets in different dimensions using a slicing or dicing operation.
11. The system of claim 7, wherein the output module outputting at least one of the user characteristic, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: the output module outputs the user characteristic, the data in the first subset of dimensions, or the data in the second subset of dimensions when evidence is required.
12. The system of claim 7, wherein the output module outputting at least one of the user characteristic, the data in the first subset of dimensions, and the data in the second subset of dimensions comprises: when evidence is needed, the output module outputs the user feature, the key data in the first dimension subset, and the key data in the second dimension subset.
13. A computer readable storage medium storing instructions that, when executed, cause a machine to perform the method of any of claims 1-6.
CN201910962259.7A 2019-10-11 2019-10-11 User characteristic output method and system based on data warehouse Active CN111090708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910962259.7A CN111090708B (en) 2019-10-11 2019-10-11 User characteristic output method and system based on data warehouse

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910962259.7A CN111090708B (en) 2019-10-11 2019-10-11 User characteristic output method and system based on data warehouse

Publications (2)

Publication Number Publication Date
CN111090708A CN111090708A (en) 2020-05-01
CN111090708B true CN111090708B (en) 2023-07-14

Family

ID=70393007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910962259.7A Active CN111090708B (en) 2019-10-11 2019-10-11 User characteristic output method and system based on data warehouse

Country Status (1)

Country Link
CN (1) CN111090708B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344104A (en) * 2021-06-23 2021-09-03 支付宝(杭州)信息技术有限公司 Data processing method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
CN108205768A (en) * 2016-12-20 2018-06-26 百度在线网络技术(北京)有限公司 Database building method and data recommendation method and device, equipment and storage medium
CN110134722A (en) * 2019-05-22 2019-08-16 北京小度信息科技有限公司 Target user determines method, apparatus, equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7089266B2 (en) * 2003-06-02 2006-08-08 The Board Of Trustees Of The Leland Stanford Jr. University Computer systems and methods for the query and visualization of multidimensional databases
US8190555B2 (en) * 2009-01-30 2012-05-29 Hewlett-Packard Development Company, L.P. Method and system for collecting and distributing user-created content within a data-warehouse-based computational system
US9235633B2 (en) * 2011-12-06 2016-01-12 International Business Machines Corporation Processing data in a data warehouse
US9411874B2 (en) * 2012-06-14 2016-08-09 Melaleuca, Inc. Simplified interaction with complex database
CN106682173B (en) * 2016-12-28 2019-10-18 华南理工大学 A kind of social security big data OLAP preprocess method and on-line analysis querying method
CN107861981B (en) * 2017-09-28 2020-09-01 北京奇艺世纪科技有限公司 Data processing method and device
US10936627B2 (en) * 2017-10-27 2021-03-02 Intuit, Inc. Systems and methods for intelligently grouping financial product users into cohesive cohorts
CN109299199A (en) * 2018-10-15 2019-02-01 河北师范大学 Precursor chemicals dimensional analytic system and implementation method based on data warehouse

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647383B1 (en) * 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
CN108205768A (en) * 2016-12-20 2018-06-26 百度在线网络技术(北京)有限公司 Database building method and data recommendation method and device, equipment and storage medium
CN110134722A (en) * 2019-05-22 2019-08-16 北京小度信息科技有限公司 Target user determines method, apparatus, equipment and storage medium

Also Published As

Publication number Publication date
CN111090708A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
Fatima et al. Comparison of SQL, NoSQL and NewSQL databases for internet of things
JP5298117B2 (en) Data merging in distributed computing
US10223437B2 (en) Adaptive data repartitioning and adaptive data replication
CN104077218B (en) The test method and equipment of MapReduce distributed system
EP1610237A2 (en) Combining multidimensional expressions and data mining extensions to mine OLAP cubes
US10877995B2 (en) Building a distributed dwarf cube using mapreduce technique
US20140247267A1 (en) Visualization of parallel co-ordinates
US20220148016A1 (en) Graph based processing of multidimensional hierarchical data
DE112013000966T5 (en) Apparatus, program and method for clustering a plurality of documents
US10664477B2 (en) Cardinality estimation in databases
CN104820708A (en) Cloud computing platform based big data clustering method and device
Bala et al. P-ETL: Parallel-ETL based on the MapReduce paradigm
CN105556474A (en) Managing memory and storage space for a data operation
US10776368B1 (en) Deriving cardinality values from approximate quantile summaries
CN111090708B (en) User characteristic output method and system based on data warehouse
CN108073582B (en) Computing framework selection method and device
Nair et al. Data mining using hierarchical virtual k-means approach integrating data fragments in cloud computing environment
CN106776810B (en) Big data processing system and method
Zhou et al. Exploring Netfow data using hadoop
Li et al. Discovering approximate functional dependencies from distributed big data
Bante et al. Big data analytics using hadoop map reduce framework and data migration process
CN115658680A (en) Data storage method, data query method and related device
CN104361090A (en) Data query method and device
Nieke et al. Analysis of CERN computing infrastructure and monitoring data
Yadav et al. Big Data and cloud computing: An emerging perspective and future trends

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230112

Address after: 200120 Floor 15, No. 447, Nanquan North Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai

Applicant after: Alipay.com Co.,Ltd.

Address before: 310000 801-11 section B, 8th floor, 556 Xixi Road, Xihu District, Hangzhou City, Zhejiang Province

Applicant before: Alipay (Hangzhou) Information Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant