CN107798021B - Data association processing method and system and electronic equipment - Google Patents

Data association processing method and system and electronic equipment Download PDF

Info

Publication number
CN107798021B
CN107798021B CN201610807903.XA CN201610807903A CN107798021B CN 107798021 B CN107798021 B CN 107798021B CN 201610807903 A CN201610807903 A CN 201610807903A CN 107798021 B CN107798021 B CN 107798021B
Authority
CN
China
Prior art keywords
information table
article
users
key field
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610807903.XA
Other languages
Chinese (zh)
Other versions
CN107798021A (en
Inventor
刘俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201610807903.XA priority Critical patent/CN107798021B/en
Publication of CN107798021A publication Critical patent/CN107798021A/en
Application granted granted Critical
Publication of CN107798021B publication Critical patent/CN107798021B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data association processing method, a data association processing system and electronic equipment. A data association processing method comprises the following steps: generating a first information table according to basic log data, wherein one record in the first information table is uniquely identified by a first key field and a second key field; within a preset time period, self-associating the second key field corresponding to the first key field of the first information table to generate a second information table; counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table to generate a third information table; and generating a fourth information table according to the first information table, the second information table and the third information table, wherein the fourth information table comprises an association degree index representing at least two second key fields.

Description

Data association processing method and system and electronic equipment
Technical Field
The present invention relates to data association processing technologies, and in particular, to a data association processing method, a data association processing system, and an electronic device.
Background
Under a big data environment, many companies select a big data association processing architecture of hadoop, and hive is more commonly used as a data query language on hadoop. For the exploration of large data volume, sometimes a simple query logic cannot meet the deep data exploration requirement in an actual scene, for example, some complex relational operations such as association rules in data mining. In many cases, a user needs to query basic data through hadoop, export or download (down) the basic data to the local, and then import the basic data into relevant software for data mining for further processing.
In the processing mode, data is separated from processing software, manual processing work is too much, data safety is low, data export consumption resources are large, and the advantages of a hadoop parallel algorithm cannot be fully utilized.
Therefore, a new data association processing method, system and electronic device are needed.
The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
The invention provides a data association processing method, a data association processing system and electronic equipment, which are used for at least partially or completely solving the problems in the prior art.
Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.
According to an aspect of the present disclosure, there is provided a data association processing method, including: generating a first information table according to basic log data, wherein one record in the first information table is uniquely identified by a first key field and a second key field; within a preset time period, self-associating the second key field corresponding to the first key field of the first information table to generate a second information table; counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table to generate a third information table; and generating a fourth information table according to the first information table, the second information table and the third information table, wherein the fourth information table comprises an association degree index representing at least two second key fields.
In an exemplary embodiment of the present disclosure, the generating of the first information table from the base log data includes: cleaning the basic log data, and eliminating risk data and/or invalid data and/or repeated data in the basic log data; and integrating the cleaned basic log data to generate the first information table.
In an exemplary embodiment of the present disclosure, the first key field includes a user code, the second key field includes an item code, and a user code and an item code correspond to a record in the first information table; wherein, the user code and the article code are in one-to-one or one-to-many relationship.
In an exemplary embodiment of the present disclosure, self-associating the second key fields corresponding to the first key fields of the first information table, and generating the second information table includes: and self-associating the first information table to obtain at least two associated articles between the article codes corresponding to the user codes.
In an exemplary embodiment of the present disclosure, counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table, and generating a third information table includes: counting the number of common purchasing users of the at least two associated items in the second information table; excluding the same item and the duplicate item of the at least two associated items to generate the third information table.
In an exemplary embodiment of the present disclosure, generating a fourth information table according to the first information table, the second information table, and the third information table, where the fourth information table includes an index representing a degree of association between at least two of the second key fields, and the index includes: acquiring a total number of purchasing users according to the first key field in the first information table and/or the second information table; respectively acquiring the number of users purchasing a first article and the number of users purchasing a second article according to the first key field and the second key field in the first information table; and obtaining a promotion degree according to the total number of the users who buy the first article, the number of the users who buy the first article and the number of the users who buy the second article, wherein the promotion degree is used as an index for representing the association degree between the first article and the second article.
In an exemplary embodiment of the present disclosure, further comprising: presetting a support degree threshold; obtaining a support degree according to the total number of purchasing users and the number of the common purchasing users in the third information table; and filtering the articles corresponding to the article codes according to the support threshold and the support.
In an exemplary embodiment of the present disclosure, further comprising: presetting a confidence threshold; obtaining a confidence coefficient according to the number of purchasing users of the first article or the second article and the number of the common purchasing users in the third information table; filtering the first item or the second item according to the confidence threshold and the confidence.
In an exemplary embodiment of the present disclosure, further comprising: and sequencing the lifting degrees according to the numerical value of each lifting degree.
According to an aspect of the present disclosure, there is provided a data association processing system including: the data preprocessing module is used for generating a first information table according to basic log data, and one record in the first information table is uniquely identified by a first key field and a second key field; the self-correlation module is used for performing self-correlation on the second key field corresponding to the first key field of the first information table within a preset time period to generate a second information table; the counting module is used for counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table to generate a third information table; and the association index calculation module is used for generating a fourth information table according to the first information table, the second information table and the third information table, wherein the fourth information table comprises an association index representing at least two second key fields.
In an exemplary embodiment of the present disclosure, the data preprocessing module includes: the data cleaning unit is used for cleaning the basic log data and eliminating risk data and/or invalid data and/or repeated data in the basic log data; and the data integration unit is used for integrating the cleaned basic log data to generate the first information table.
In an exemplary embodiment of the present disclosure, the first key field includes a user code, the second key field includes an item code, and a user code and an item code correspond to a record in the first information table; wherein, the user code and the article code are in one-to-one or one-to-many relationship.
In an exemplary embodiment of the present disclosure, the self-association module includes: and the associated article obtaining unit is used for self-associating the first information table to obtain at least two associated articles between the article codes corresponding to the user codes.
In an exemplary embodiment of the present disclosure, the statistics module includes: the summarizing unit is used for counting the number of common purchasing users of the at least two associated articles in the second information table; an excluding unit for excluding identical and duplicate articles of the at least two associated articles.
In an exemplary embodiment of the present disclosure, the association degree index calculation module includes: the first calculation unit is used for obtaining the total number of purchasing users according to the first key field in the first information table and/or the second information table; the second calculating unit is used for respectively obtaining the number of users purchasing a first article and the number of users purchasing a second article according to the first key field and the second key field in the first information table; and the third calculating unit is used for obtaining a promotion degree according to the total number of the users who buy the first article, the number of the users who buy the first article and the number of the users who buy the second article, wherein the promotion degree is used as an index for representing the degree of association between the first article and the second article.
In an exemplary embodiment of the present disclosure, the apparatus further comprises a first filtering module, the first filtering module comprising: the first preset unit is used for presetting a support degree threshold; a support degree calculation unit for obtaining a support degree according to the total number of purchased users and the number of common purchased users in the third information table; and the first filtering unit is used for filtering the articles corresponding to the article codes according to the support threshold and the support.
In an exemplary embodiment of the present disclosure, the apparatus further comprises a second filtering module, the second filtering module comprising: the second preset unit is used for presetting a confidence coefficient threshold; a confidence coefficient calculation unit, configured to obtain a confidence coefficient according to the number of users purchasing the first item or the second item, and the number of users purchasing the first item or the second item together in the third information table; and the second filtering unit is used for filtering the first article or the second article according to the confidence coefficient threshold value and the confidence coefficient.
According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: generating a first information table according to basic log data, wherein one record in the first information table is uniquely identified by a first key field and a second key field; within a preset time period, self-associating the second key field corresponding to the first key field of the first information table to generate a second information table; counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table to generate a third information table; and generating a fourth information table according to the first information table, the second information table and the third information table, wherein the fourth information table comprises an association degree index representing at least two second key fields.
According to the data association processing method, the data association processing system and the electronic equipment, the calculation and the iteration work of the data mining evaluation index promotion degree are realized through designing a reasonable data table structure.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 schematically illustrates a flow chart of a data association processing method according to an example embodiment of the invention;
fig. 2 schematically shows a flowchart of a correlation index acquisition method in a data correlation processing method according to an exemplary embodiment of the present invention;
FIG. 3 schematically shows a flow chart of a data association processing method according to another example embodiment of the invention;
FIG. 4 schematically shows a block diagram of a data association processing system according to an example embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, systems, steps, and so forth. In other instances, well-known structures, methods, systems, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor systems and/or microcontroller systems.
The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with certain aspects of the invention, as detailed in the appended claims.
First, a brief explanation of several concepts involved in the present invention will be provided.
Hadoop is a software framework that enables distributed processing of large amounts of data. A user can develop a distributed program without knowing the distributed underlying details. The power of the cluster is fully utilized to carry out high-speed operation and storage. The most core design of the Hadoop framework is as follows: HDFS (Hadoop Distributed File System, Distributed File System) and MapReduce. The HDFS provides storage for massive data, and the MapReduce provides calculation for the massive data.
Hadoop is efficient because it works in parallel, speeding up processing by parallel processing. Hadoop has a framework written in Java language, so that the Hadoop is very ideal to run on a Linux production platform. Applications on Hadoop may also be written in other languages, such as C + +.
hive is a data warehouse tool based on Hadoop, can map structured data files into a database table, provides a simple sql query function, and can convert sql statements into MapReduce tasks for operation. The method has the advantages that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, special MapReduce application does not need to be developed, and the method is very suitable for statistical analysis of a data warehouse.
The association rule is an implication in the form of X → Y, where X and Y are referred to as the antecedent or left-hand-side (LHS) and successor (RHS) of the association rule, respectively. Wherein, the rule XY is associated, and the support degree and the confidence degree exist.
A data association processing method according to an exemplary embodiment of the present invention is described with reference to fig. 1 and 2.
As shown in fig. 1, in step S11, a first information table is generated from the base log data, and one record in the first information table is uniquely identified by a first key field and a second key field.
Wherein the base log data may be from a base log database. In the field of e-commerce, the basic log database may include a plurality of database tables, such as table 1: order detail data; table 2: commodity attribute data; table 3: item attribute data; table 4: user attribute data; table 5: risk attribute data; up to table n. n is a positive integer of 1 or more.
It should be noted that, although the following embodiments are illustrated based on the field of electronic commerce, the data association processing method of the present invention can be applied to any field, such as financial industry, shopping mall, and the present invention is not limited thereto.
In an exemplary embodiment, the generating the first information table from the base log data includes: cleaning the basic log data, and eliminating risk data and/or invalid data and/or repeated data in the basic log data; and integrating the cleaned basic log data to generate the first information table.
Due to the fact that the amount of information purchased by the user is large and scattered, a large amount of invalid information is often present in the purchase data. It is necessary to integrate a data model with good data quality for each user.
For example, the user information data in the order form is subjected to data cleansing, and the items of cleansing may include:
(1) removing risk data such as: large order (500000 RMB or more), risk user order. The risk user can judge according to the historical behavior record of the user.
(2) Removing irrelevant data such as: invalid order data because cancelled orders do not reflect well the correlation between purchased items; the method can be used for removing non-copper silver gold diamond member users such as enterprise users, and the enterprise users can not well restore normal member purchasing behaviors due to mass purchasing.
(3) The default account and the bound account member information are combined, because the electronic shopping website may cooperate with a third party, the account has a corrected business background, and the same user has the possibility of purchasing the default account and the registered account before registration, the purchasing behaviors of the two different accounts need to be combined and unified in order to completely describe the whole purchasing behavior of the user.
In an exemplary embodiment, the first key field includes a user code, the second key field includes an item code, and a user code and an item code correspond to a record in the first information table; wherein, the user code and the article code are in one-to-one or one-to-many relationship.
And performing data updating, cleaning and integration on historical order detail data according to commodity attribute data, user attribute data, risk attribute data and the like of commodities in the basic log database, and establishing a user purchase statistical model under the commodities. And setting key fields capable of supporting the association model in the user purchase statistical model.
In an exemplary embodiment, the key field may include: the method comprises the steps that user codes are used, and a unique identification account number of a user on an electronic shopping website can be any characters, numbers, letters, symbols or any combination thereof, wherein the characters, numbers, letters, symbols and the like can uniquely identify the identity of the user, such as a user login name, a user ID, a mobile phone number and the like; an item code (item id) divided according to items set in the business; the first purchase order condition may include an order number and a first order time of the first order; recent purchase order conditions, which may include the order number and the recent order time of the recent order; an effective order amount, an effective order amount between the first order time and a most recent order time; a number of valid items; the valid order amount. In some embodiments, some or all of the above key fields may be included, and other key fields may be added as needed.
Table 1 shows a first information table generated based on the set key fields.
Figure BDA0001110980130000081
TABLE 1 first information Table
Each user has one and only one record under an item if there is a valid purchase. If the same user has effectively purchased under multiple articles, there are several articles corresponding to several records.
In some embodiments, the item id is divided into three levels, for example, the item a has the primary code 911, the secondary code 918, and the tertiary code 1954, i.e., the item a has the actual item id of 9119181954. Similarly, item B has a primary code of 1713, a secondary code of 3262, and a tertiary code of 10009; the primary code of the item C is 670, the secondary code is 686, and the tertiary code is 1047; many article codes may also be included, such as 6706861048, 1315134612023, 171332719275, 6707294837, 501950208792, and the like. Of course, the present invention is not limited thereto, and the specific item id may be set independently according to different corresponding services.
It should be noted that, although the above embodiment is described by taking an example in which the item codes are divided into three-level codes, that is, assuming that all the items (the minimum unit SKU) are respectively divided into different categories, the categories are divided into a first-level category, a second-level category and a third-level category, and the association degree between the three-level categories purchased by the user is found, in other embodiments, the method described in the embodiments of the present invention may also be applied to find the association degree between specific items purchased by the user, where the item codes at this time are SKUs of the items; even finding the association degree between the secondary categories or the association degree between the primary categories, and the like.
In some embodiments, the purchasing activity of the user may be obtained based on the number of orders available in the time period between the time of the first order and the time of the most recent order.
In step S12, within a preset time period, self-associating the second key fields corresponding to the first key fields of the first information table, so as to generate a second information table.
In an exemplary embodiment, self-associating the second key fields corresponding to the first key fields of the first information table, and generating the second information table includes: and performing self-association on the first information table to obtain at least two associated articles between the article codes corresponding to the user codes. For example, the first information table may be self-associated by a hive join statement.
In an exemplary embodiment, with the hive calculation association model, the user and the items purchased by the user can be self-associated together by using the hive self-contained join sentences. The time period for selecting and purchasing the commodity can be set autonomously according to the latest order time in the first information table by forward calculation. One year statistical period data or two year statistical period data may be selected.
And after the first information table is subjected to self-correlation by using a join statement, generating the second information table. The following embodiments are all described by taking two related items as an example, but actually three related items or any other number of related items may be calculated.
For example, a second table of information for a dual item is shown in Table 2 below.
Figure BDA0001110980130000091
Figure BDA0001110980130000101
TABLE 2 second information Table
Wherein the first group of valid orders refers to the valid orders of the associated item 1 within a one-year statistical period, for example; the second set of valid orders refers to the number of valid orders for the associated item 2 over, for example, a one year statistical period.
As shown in table 2 above, assuming that the user qcw12320 purchased three items a, B, and C (although only these three items are illustrated, more items or fewer items may actually be included), there are 9 combinations after two items are self-associated by a join sentence, which are: AA, AB, AC, BB, BA, BC, CC, CA, CB. Similarly, assuming that the user sunnyay purchases two items a and B, after the two items self-associate, there are 4 combinations, which are: AA, AB, BA, BB.
Here, although the above-described embodiment and the following embodiments are each exemplified by a two-by-two combination between articles, actually, a plurality of articles purchased by the same user may be combined by three articles, four articles, and the like, and for example, if the user purchases five articles a, B, C, and D, the following combinations may exist in the self-association: AA, AB, AC, AD, BA, BB, BC, BD, CA, CB, CC, CD; ABC, ABD, ACD, BCD; ABCD.
In step S13, the number of the first key fields corresponding to at least two self-associated second key fields in the second information table is counted to generate a third information table.
In an exemplary embodiment, the counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table, and the generating a third information table includes: and counting the number of the common purchasing users of the at least two associated items in the second information table.
In an exemplary embodiment, the method further comprises: excluding identical and duplicate ones of the at least two associated items. This step may be performed between step S12 and step S13, that is, the second information table is generated first, and after the data in the second information table is integrated through this step, the step of statistically generating the third information table is performed.
As can be seen from the second information table, there are identical items and substantially duplicate items between two associated items, where the associated item 1-id is regarded as identical item by the associated item 2-id, and for example, AA, BB, CC, etc. are regarded as the association between identical items, and should be removed therefrom, because the association rule is to find the association between other items than itself. Duplicate items such as AB and BA appear superficially as two different sets of related items, but are substantially identical, and in order to avoid duplicate statistics, duplicate items need to be culled, and only AB or BA is counted.
Then, using hive to sum the number of co-purchased users under two associated items, a third table of information is generated as shown in table 3 below:
Figure BDA0001110980130000111
TABLE 3 third information Table
The number of common purchase users in table 3 is the number of users who purchased related item 1 and related item 2 at the same time. For example, in the second information table, the user qcw12320 and the user sunnyay have purchased the related item a and the related item B at the same time, and the number of users who have purchased the related item a and the related item B at the same time in the second information table is counted to obtain the number of users who have purchased the related item a and the related item B at the same time. Others may be analogized.
In step S14, a fourth information table is generated according to the first information table, the second information table, and the third information table, where the fourth information table includes an index representing a degree of association between at least two of the second key fields.
In an exemplary embodiment, further comprising: and sequencing the lifting degrees according to the numerical value of each lifting degree.
In an exemplary embodiment, further comprising: presetting a support degree threshold; obtaining a support degree according to the total number of purchasing users and the number of the common purchasing users in the third information table; and filtering the articles corresponding to the article codes according to the support threshold and the support.
The support degree refers to the probability P (A ≈ B) of both A and B, and all frequent item sets can be found according to the support degree:
Figure BDA0001110980130000121
with the above scenario, a is associated item 1-id, B is associated item 2-id, then n (a ≈ B) is the number of events that have purchased items 1 and 2 at the same time, i.e. the number of users who have purchased jointly in the third information table, e.g. 238; n is the total number of purchased users who have effectively purchased in the statistical period of the electronic shopping website. Both of these two indexes can be directly calculated by hive, for example, the value of the number of commonly purchased users can be obtained by in table 3; the value of n can be found by count (discrete user code in the second information table).
In some embodiments, an initial value of the support threshold may be set to 0.001, and the support threshold may be modified accordingly according to the result of the calculated lifting degree. The support degree > support degree threshold value calculated by the above equation (1) needs to be satisfied here.
In an exemplary embodiment, further comprising: presetting a confidence threshold; obtaining a confidence coefficient according to the number of purchasing users of the first article or the second article and the number of the common purchasing users in the third information table; filtering the first item or the second item according to the confidence threshold and the confidence.
Wherein the confidence level refers to the probability P (B | A) that B occurs at the same time in the event that A occurs:
Figure BDA0001110980130000122
with the above scenario, the confidence is the number of users n (A ≈ B) purchased together/the number of users n (A) purchased with the item id-1. n (A ≈ B) is calculated according to the above-mentioned support degree, and n (A) can be obtained by counting the number of different user codes of the first information table for purchasing the article A.
In some embodiments, an initial value of the confidence threshold may be set to 0.01, and the confidence threshold may be modified accordingly according to the result of the calculated lifting degree. Here, the confidence > confidence threshold calculated by the above formula (2) needs to be satisfied. The confidence threshold is greater than the support threshold.
In other embodiments, a front support degree n (a)/n and/or a rear support degree n (b)/n may also be set, and the setting of these indexes (confidence, support degree, front support degree, rear support degree) may be limited by the where condition inside hive, and by limiting these indexes, some small articles with a high promotion degree but a very small total number of articles may be eliminated, and these small articles may not be practical in application.
The following is a method of obtaining the association index.
As shown in fig. 2, in step S141, a total number of purchased users is obtained according to the first key field in the first information table and/or the second information table.
For example, the total number n of users can be obtained by counting the total number of different user codes in the first information table.
In step S142, the number of users purchasing a first article and the number of users purchasing a second article are obtained according to the first key field and the second key field in the first information table, respectively.
For example, the number n (a) of users who purchased the first item may be obtained by counting the number of different user codes that purchased the item a in the first information table; and counting the number of different user codes for purchasing the article B in the second information table to obtain the number n (B) of the users purchasing the second article. Of course, the first item is not limited to item a, the second item is not limited to item B, and the first item may be any item in the first information table.
In step S143, a lift is obtained according to the total number n of users purchased, the number n (a ≈ B) of users purchased commonly, the number n (a) of users purchased for the first article, and the number n (B) of users purchased for the second article, and the lift is used as an index representing a degree of association between the first article a and the second article B.
Wherein the degree of lift is:
Figure BDA0001110980130000131
where P (B) is the probability of B occurring.
The lift degree lift can be calculated by the above equations (1) to (3):
Figure BDA0001110980130000132
wherein when the lift degree is 1, it indicates that there is no correlation between a and B. Therefore, what is significant for the purposes of the present invention is the association rule between associated items with a degree of lift greater than 1.
The fourth information table is obtained by calculating a plurality of associations among the plurality of tables in the hive together, and is shown in table 4:
Figure BDA0001110980130000141
TABLE 4 fourth information Table
And (3) providing the minimum confidence and the minimum support limit by using a hive calculation function, calculating the value of the lift degree (lift), and sequencing the lift degree based on a preset rule. For example, in order from small to large or from large to small. The aim is to find out the strongly related articles A and B of A- > B.
In an exemplary embodiment, since the item association is realized based on hive, the result can also be synchronized to a Data Mart (Data Mart) in real time, and the hive-based Data result presentation can enable common business personnel to easily use and operate.
A data mart, also called a data market, is a repository that collects data from manipulated and other data sources that serve a particular group of professionals. In scope, data is extracted from enterprise-wide databases, data warehouses, or more specialized data warehouses.
The data association processing method provided by the embodiment of the invention can realize association degree analysis by directly utilizing the hive environment of the hadoop, fully utilizes the computing power of hadoop parallel operation and has high response speed; and the association rule among the articles can be fully solved through a historical data quantification way, and data support except experience is provided for business in article cooperation.
Fig. 3 schematically shows a flow chart of a data association processing method according to another exemplary embodiment of the present invention.
As shown in fig. 3, in step S21, the base log data is flushed.
In step S22, the cleaned basic log data is integrated to generate a first information table.
In step S23, the first information table is self-associated by the hive join sentence, and two associated items between the item codes corresponding to the respective user codes are obtained.
In step S24, the number of users who have purchased the same two related items in the second information table is counted.
In step S25, the same item and duplicate items of the two associated items are excluded.
In step S26, a support degree threshold is preset.
In step S27, a support degree is obtained based on the total number of purchased users and the number of commonly purchased users.
In step S28, the corresponding item is encoded according to the support threshold and the support filter item.
In step S29, a confidence threshold is preset.
In step S210, a confidence level is obtained according to the number of users purchasing the first item or the second item, and the number of users purchasing the first item or the second item together.
In step S211, the first item or the second item is filtered according to the confidence threshold and the confidence.
In step S212, the degree of lift between the associated items is calculated.
In step S213, the degrees of lifting are sorted.
FIG. 4 schematically shows a block diagram of a data association processing system according to an example embodiment of the present invention.
As shown in fig. 4, the data association processing system includes a data preprocessing module 11, a self-association module 12, a statistic module 13, and an association degree index calculation module 14.
The data preprocessing module 11 is configured to generate a first information table according to the basic log data, where one record in the first information table is uniquely identified by a first key field and a second key field.
In an exemplary embodiment, the data preprocessing module 11 includes: the data cleaning unit is used for cleaning the basic log data and eliminating risk data and/or invalid data and/or repeated data in the basic log data; and the data integration unit is used for integrating the cleaned basic log data to generate the first information table.
In an exemplary embodiment, the first key field includes a user code, the second key field includes an item code, and a user code and an item code correspond to a record in the first information table; wherein, the user code and the article code are in one-to-one or one-to-many relationship.
The self-association module 12 is configured to perform self-association on the second key fields corresponding to the first key fields of the first information table within a preset time period, so as to generate a second information table.
In an exemplary embodiment, the self-association module 12 includes: and the associated article obtaining unit is used for self-associating the first information table through a hive join statement to obtain at least two associated articles between the article codes corresponding to the user codes.
The counting module 13 is configured to count the number of the first key fields corresponding to at least two self-associated second key fields in the second information table, and generate a third information table.
In an exemplary embodiment, the statistics module 13 includes: and the summarizing unit is used for counting the number of the common purchasing users of the at least two associated articles in the second information table.
In an exemplary embodiment, the statistics module 13 includes: an excluding unit configured to exclude a same item and a duplicate item from the at least two associated items to generate the third information table.
The relevance index calculating module 14 is configured to generate a fourth information table according to the first information table, the second information table, and the third information table, where the fourth information table includes a relevance index representing at least two of the second key fields.
In an exemplary embodiment, the relevancy index calculation module 14 includes: the first calculation unit is used for obtaining the total number of purchasing users according to the first key field in the first information table and/or the second information table; the second calculating unit is used for respectively obtaining the number of users purchasing a first article and the number of users purchasing a second article according to the first key field and the second key field in the first information table; and the third calculating unit is used for obtaining a promotion degree according to the total number of the users who buy the first article, the number of the users who buy the first article and the number of the users who buy the second article, wherein the promotion degree is used as an index for representing the degree of association between the first article and the second article.
In an exemplary embodiment, the system further comprises a first filtering module, the first filtering module comprising: the first preset unit is used for presetting a support degree threshold; a support degree calculation unit for obtaining a support degree according to the total number of purchased users and the number of common purchased users in the third information table; and the first filtering unit is used for filtering the articles corresponding to the article codes according to the support threshold and the support.
In an exemplary embodiment, the system further comprises a second filtering module, the second filtering module comprising: the second preset unit is used for presetting a confidence coefficient threshold; a confidence coefficient calculation unit, configured to obtain a confidence coefficient according to the number of users purchasing the first item or the second item, and the number of users purchasing the first item or the second item together in the third information table; and the second filtering unit is used for filtering the first article or the second article according to the confidence coefficient threshold value and the confidence coefficient.
The modules in the embodiments of the present invention correspond to the contents in the above-described method embodiments, and are not described in detail herein.
An embodiment of the present invention further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: generating a first information table according to basic log data, wherein one record in the first information table is uniquely identified by a first key field and a second key field; within a preset time period, self-associating the second key field corresponding to the first key field of the first information table to generate a second information table; counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table to generate a third information table; and generating a fourth information table according to the first information table, the second information table and the third information table, wherein the fourth information table comprises an association degree index representing at least two second key fields.
The electronic device may be a computer system or server, in the form of a general purpose computing device. Components of the computer system/server may include, but are not limited to: one or more processors or processing units, a system memory, and a bus connecting the various system components (including the system memory and the processing units).
A bus represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
The computer system/server typically includes a variety of computer system readable media. Such media may be any available media that is accessible by the computer system/server and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory. The computer system/server may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, the storage system may be used to read from and write to non-removable, nonvolatile magnetic media (commonly referred to as "hard disk drives"). A magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus by one or more data media interfaces. The memory may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility having a set (at least one) of program modules may be stored, for example, in memory, such program modules including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination may comprise an implementation of a network environment. The program modules generally perform the functions and/or methodologies of the described embodiments of the invention.
The computer system/server may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), with one or more devices that enable a user to interact with the computer system/server, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. Also, the computer system/server may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via a network adapter. As shown, the network adapter communicates with the other modules of the computer system/server via a bus. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Fig. 1, 2 and 3 show flowcharts of a data association processing method according to an exemplary embodiment of the present invention. The method may be implemented, for example, using a data association processing system as shown in fig. 4, although the invention is not limited thereto. It is noted that fig. 1, 2 and 3 are merely schematic illustrations of processes involved in methods according to example embodiments of the invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in 1, 2 and 3 do not indicate or limit the chronological order of these processes. In addition, it will also be readily appreciated that such processing may be performed, for example, synchronously or asynchronously across multiple modules/processes/threads.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiment of the present invention.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
Exemplary embodiments of the present invention are specifically illustrated and described above. It is to be understood that the invention is not limited to the precise construction, arrangements, or instrumentalities described herein; on the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (7)

1. A data association processing method is characterized by comprising the following steps:
generating a first information table according to basic log data, wherein one record in the first information table is uniquely identified by a first key field and a second key field; the first key field comprises a user code, the second key field comprises an article code, and the user code and the article code correspond to one record in the first information table;
within a preset time period, self-associating the second key field corresponding to the first key field of the first information table to generate a second information table; self-associating the second key field corresponding to the first key field of the first information table, and generating a second information table includes: performing self-association on the first information table to obtain at least two associated articles between the article codes corresponding to the user codes;
counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table to generate a third information table; counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table, and generating a third information table comprises: counting the number of common purchasing users of the at least two associated items in the second information table; excluding the same item and the duplicate item from the at least two associated items to generate the third information table;
generating a fourth information table according to the first information table, the second information table and the third information table, wherein the fourth information table comprises an association degree index representing at least two second key fields;
generating a fourth information table according to the first information table, the second information table, and the third information table, where the fourth information table includes an index representing a degree of association between at least two of the second key fields, and includes:
acquiring a total number of purchasing users according to the first key field in the first information table and/or the second information table;
respectively acquiring the number of users purchasing a first article and the number of users purchasing a second article according to the first key field and the second key field in the first information table;
and obtaining a promotion degree according to the total number of the users who buy the first article, the number of the users who buy the first article and the number of the users who buy the second article, wherein the promotion degree is used as an index for representing the association degree between the first article and the second article.
2. The method of claim 1, wherein the user code and the item code are in a one-to-one or one-to-many relationship.
3. The method of claim 1, further comprising:
presetting a support degree threshold;
obtaining a support degree according to the total number of purchasing users and the number of the common purchasing users in the third information table;
filtering the articles corresponding to the article codes according to the support degree threshold value and the support degree; and/or
Presetting a confidence threshold;
obtaining a confidence coefficient according to the number of purchasing users of the first article or the second article and the number of the common purchasing users in the third information table;
filtering the first item or the second item according to the confidence threshold and the confidence.
4. A data association processing system, comprising:
the data preprocessing module is used for generating a first information table according to basic log data, and one record in the first information table is uniquely identified by a first key field and a second key field; the first key field comprises a user code, the second key field comprises an article code, and the user code and the article code correspond to one record in the first information table;
the self-correlation module is used for performing self-correlation on the second key field corresponding to the first key field of the first information table within a preset time period to generate a second information table; the self-association module comprises: the related article obtaining unit is used for performing self-correlation on the first information table to obtain at least two related articles between the article codes corresponding to the user codes;
the counting module is used for counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table to generate a third information table; the statistic module comprises: the summarizing unit is used for counting the number of common purchasing users of the at least two associated articles in the second information table; an excluding unit configured to exclude a same item and a duplicate item from the at least two associated items to generate the third information table;
a relevancy index calculation module, configured to generate a fourth information table according to the first information table, the second information table, and the third information table, where the fourth information table includes a relevancy index representing at least two second key fields;
wherein the correlation index calculation module comprises:
the first calculation unit is used for obtaining the total number of purchasing users according to the first key field in the first information table and/or the second information table;
the second calculating unit is used for respectively obtaining the number of users purchasing a first article and the number of users purchasing a second article according to the first key field and the second key field in the first information table;
and the third calculating unit is used for obtaining a promotion degree according to the total number of the users who buy the first article, the number of the users who buy the first article and the number of the users who buy the second article, wherein the promotion degree is used as an index for representing the degree of association between the first article and the second article.
5. The system of claim 4, wherein the user code and the item code are in a one-to-one or one-to-many relationship.
6. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
generating a first information table according to basic log data, wherein one record in the first information table is uniquely identified by a first key field and a second key field; the first key field comprises a user code, the second key field comprises an article code, and the user code and the article code correspond to one record in the first information table;
within a preset time period, self-associating the second key field corresponding to the first key field of the first information table to generate a second information table; self-associating the second key field corresponding to the first key field of the first information table, and generating a second information table includes: performing self-association on the first information table to obtain at least two associated articles between the article codes corresponding to the user codes;
counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table to generate a third information table; counting the number of the first key fields corresponding to at least two self-associated second key fields in the second information table, and generating a third information table comprises: counting the number of common purchasing users of the at least two associated items in the second information table; excluding the same item and the duplicate item from the at least two associated items to generate the third information table;
generating a fourth information table according to the first information table, the second information table and the third information table, wherein the fourth information table comprises an association degree index representing at least two second key fields;
generating a fourth information table according to the first information table, the second information table, and the third information table, where the fourth information table includes an index representing a degree of association between at least two of the second key fields, and includes:
acquiring a total number of purchasing users according to the first key field in the first information table and/or the second information table;
respectively acquiring the number of users purchasing a first article and the number of users purchasing a second article according to the first key field and the second key field in the first information table;
and obtaining a promotion degree according to the total number of the users who buy the first article, the number of the users who buy the first article and the number of the users who buy the second article, wherein the promotion degree is used as an index for representing the association degree between the first article and the second article.
7. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1-3.
CN201610807903.XA 2016-09-07 2016-09-07 Data association processing method and system and electronic equipment Active CN107798021B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610807903.XA CN107798021B (en) 2016-09-07 2016-09-07 Data association processing method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610807903.XA CN107798021B (en) 2016-09-07 2016-09-07 Data association processing method and system and electronic equipment

Publications (2)

Publication Number Publication Date
CN107798021A CN107798021A (en) 2018-03-13
CN107798021B true CN107798021B (en) 2021-04-30

Family

ID=61530878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610807903.XA Active CN107798021B (en) 2016-09-07 2016-09-07 Data association processing method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN107798021B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078680B (en) * 2018-10-18 2023-09-26 杭州海康威视数字技术股份有限公司 Table information processing method, apparatus, electronic device and readable storage medium
CN109977139B (en) * 2019-03-18 2022-12-02 京东科技控股股份有限公司 Data processing method and device based on class structured query statement
CN116843394B (en) * 2023-09-01 2023-11-21 星河视效科技(北京)有限公司 AI-based advertisement pushing method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1542642A (en) * 2003-04-30 2004-11-03 明基电通股份有限公司 Data associative analysis system and method and computer readable storage medium
US7451155B2 (en) * 2005-10-05 2008-11-11 At&T Intellectual Property I, L.P. Statistical methods and apparatus for records management
CN105589900A (en) * 2014-11-21 2016-05-18 中国银联股份有限公司 Data mining method based on multi-dimensional analysis

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298626A (en) * 2011-08-24 2011-12-28 中国有色矿业集团有限公司 Data association processing system
CN103208073B (en) * 2012-01-17 2016-09-28 阿里巴巴集团控股有限公司 Obtain Recommendations information and the method for merchandise news, device are provided
EP3117347B1 (en) * 2014-03-10 2020-09-23 Interana, Inc. Systems and methods for rapid data analysis
CN105159952A (en) * 2015-08-14 2015-12-16 安徽新华博信息技术股份有限公司 Data processing method based on frequent item set mining
CN105512210A (en) * 2015-11-27 2016-04-20 网神信息技术(北京)股份有限公司 Correlated event type detection method and device
CN105893421A (en) * 2015-12-02 2016-08-24 乐视网信息技术(北京)股份有限公司 UV calculation method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1542642A (en) * 2003-04-30 2004-11-03 明基电通股份有限公司 Data associative analysis system and method and computer readable storage medium
US7451155B2 (en) * 2005-10-05 2008-11-11 At&T Intellectual Property I, L.P. Statistical methods and apparatus for records management
CN105589900A (en) * 2014-11-21 2016-05-18 中国银联股份有限公司 Data mining method based on multi-dimensional analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
大数据时代之统计数据挖掘实证;崔冬梅;《统计与决策》;20160430(第4期);全文 *

Also Published As

Publication number Publication date
CN107798021A (en) 2018-03-13

Similar Documents

Publication Publication Date Title
CN106649455B (en) Standardized system classification and command set system for big data development
US10504120B2 (en) Determining a temporary transaction limit
US7908242B1 (en) Systems and methods for optimizing database queries
US20040167910A1 (en) Integrated data products of processes of integrating mixed format data
US20190370601A1 (en) Machine learning model that quantifies the relationship of specific terms to the outcome of an event
US10067964B2 (en) System and method for analyzing popularity of one or more user defined topics among the big data
US20160004757A1 (en) Data management method, data management device and storage medium
US10795895B1 (en) Business data lake search engine
CN107798021B (en) Data association processing method and system and electronic equipment
US12066992B2 (en) Maintaining a dataset based on periodic cleansing of raw source data
Hammond et al. Cloud based predictive analytics: text classification, recommender systems and decision support
US20230090866A1 (en) Sorted parallel processing of a large dataset
KR20210033294A (en) Automatic manufacturing apparatus for reports, and control method thereof
CN114253939A (en) Data model construction method and device, electronic equipment and storage medium
US10719561B2 (en) System and method for analyzing popularity of one or more user defined topics among the big data
Cho et al. Mining association rules using RFM scoring method for personalized u-commerce recommendation system in emerging data
Benny et al. Hadoop framework for entity resolution within high velocity streams
KR20170094935A (en) System for providing enterprise information and method
CN109062551A (en) Development Framework based on big data exploitation command set
US8250024B2 (en) Search relevance in business intelligence systems through networked ranking
US10877998B2 (en) Highly atomized segmented and interrogatable data systems (HASIDS)
JP2016197332A (en) Information processing system, information processing method, and computer program
CN114860819A (en) Method, device, equipment and storage medium for constructing business intelligent system
CN118227767B (en) Knowledge graph driven large model business intelligent decision question-answering system and method
US12086146B2 (en) Tables time zone adjuster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant