CN108268586B - Data processing method, device, medium and computing equipment across multiple data tables - Google Patents

Data processing method, device, medium and computing equipment across multiple data tables Download PDF

Info

Publication number
CN108268586B
CN108268586B CN201710866877.2A CN201710866877A CN108268586B CN 108268586 B CN108268586 B CN 108268586B CN 201710866877 A CN201710866877 A CN 201710866877A CN 108268586 B CN108268586 B CN 108268586B
Authority
CN
China
Prior art keywords
data
data table
tables
target
data tables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710866877.2A
Other languages
Chinese (zh)
Other versions
CN108268586A (en
Inventor
李光明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN201710866877.2A priority Critical patent/CN108268586B/en
Publication of CN108268586A publication Critical patent/CN108268586A/en
Priority to PCT/CN2018/105090 priority patent/WO2019056964A1/en
Application granted granted Critical
Publication of CN108268586B publication Critical patent/CN108268586B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data processing method, a device, a medium and a computing device for crossing multiple data tables. The method comprises the following steps: acquiring a plurality of first data tables; converting each row of each first data table in a plurality of first data tables into a sub data table, wherein each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data table corresponding to the first data table forms a second data table; and taking the characteristic data in the second data tables as connection keys, and performing table connection on the second data tables corresponding to each first data table to obtain the target data table. The method provided by the application converts angles according to actual service requirements, takes the characteristic data as a foothold, associates the objects in the data tables under the condition of not introducing redundant data, can be distributed to a plurality of reducers for execution, has higher data processing capacity and data processing efficiency, and can be efficiently competent for large data processing.

Description

Data processing method, device, medium and computing equipment across multiple data tables
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, medium, and computing device across multiple data tables.
Background
With the rapid development of the internet and big data technology, the influence of data mining analysis on human activities is increasingly remarkable, correlation analysis between objects can be performed through big data, the intrinsic incidence relation between different objects is determined, and then the life quality of a user is improved through ways such as interest recommendation.
When performing correlation analysis or recommendation, it is often necessary to analyze correlations between users across multiple data tables, and a commonly used method at present is to combine multiple data tables into one data table by a cartesian product and then analyze the data table by a mapping convention (MapReduce). However, in practical applications, as the data volume is larger and larger, the efficiency of operating the mass data tables based on the cartesian product is lower and lower, for example, the number of the same articles read between two groups of users is counted, and according to the method, the data tables corresponding to the two groups of users need to be subjected to the cartesian product operation and then analyzed by a Reducer (task machine executing a specification task). However, since the cartesian product has no connection key, only one Reducer can be used to complete the analysis task, and when the data size is large, the constraint of the processing capability of the single Reducer is imposed, which easily causes the situations that the execution result of the Reducer task is incorrect, and even the task cannot be executed and completed.
To address the above problems, there are currently two solutions:
one solution is: if one of the two tables to be subjected to the cartesian product operation is a small table (the data volume is much smaller than the other table), the data of the small table can be loaded into the memory, so that the processing speed of the cartesian product is accelerated. But one important limitation of this solution is: due to the memory capacity, a table that is much smaller than the other table must exist to be valid. Thus, this solution is clearly insufficient for the handling of large data tables.
The other solution is as follows: additionally constructing join keys (connection keys), replacing Cartesian product operation by table connection operation, specifically expanding a small table into a row of join keys, and copying entries of the small table for multiple times, wherein the join keys are different; and expanding the large table by a list of join keys to be random numbers in the range of the total amount of data after expanding the small table by multiple times. For example, assuming that there are only 1 piece of data in the small table and 1000 pieces of data in the large table, a column of join keys is added to the small table, the value of the join keys is set to be 1, the data is expanded by four times, the values of the join keys are respectively 2 to 5, random numbers between 1 and 5 are used for the join keys in the large table, and if there are 200 pieces of 1, 200 pieces of 2, 200 pieces of 3, 200 pieces of 4 and 200 pieces of 5, at this time, when the join operation of two tables is performed according to the join keys, 5 reducers are generated, and the data of the large table is randomly divided into 5 pieces (multiple after expansion of the small table), thereby solving the above problem. However, the nature of this solution is still the same as the cartesian product operation, with the disadvantages of data redundancy, and the relatively cumbersome and inefficient practical implementation.
In summary, there is an urgent need for an efficient data processing method across multiple data tables, which is capable of handling large data.
Disclosure of Invention
The application provides a data processing method, device, medium and computing equipment spanning multiple data tables, so that big data can be efficiently processed.
In one aspect, the present application provides a data processing method across multiple data tables, including:
obtaining a plurality of first data tables, wherein each row of each first data table in the plurality of first data tables comprises an object identifier and a plurality of characteristic data of an object identified by the object identifier;
converting each row of each first data table in a plurality of first data tables into a sub data table, wherein each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data table corresponding to the first data table forms a second data table;
and performing table connection on the second data tables corresponding to each first data table by taking the feature data in the second data tables as a connection key to obtain a target data table, wherein each row in the target data table comprises one feature data and at least one object identifier corresponding to the feature data.
In some possible embodiments, the converting each row of each of the plurality of first data tables into one sub data table includes:
according to a plurality of characteristic data included in each line of the first data table, dividing each line of the first data table into sub data tables including a plurality of lines, wherein the number of the lines of the sub data tables is the same as the number of the plurality of characteristic data.
In some possible embodiments, the performing table join on the second data table corresponding to each first data table by using the feature data in the second data table as a join key to obtain a target data table includes:
and selecting one second data table from the second data tables corresponding to each first data table as a main table, using the rest second data tables as auxiliary tables, using the characteristic data in each second data table as a connecting key, and connecting the auxiliary table to the main table to obtain a target data table.
In some possible embodiments, the method further comprises:
and determining the association relation between the objects identified by the object identifications from different first data tables in the target data table by taking the characteristic data in the target data table as a basis.
In some possible embodiments, the determining the association relationship between the objects identified by the object identifiers from the different first data tables includes:
determining the number of the same characteristic data among the objects identified by a plurality of object identifications in the target data table, wherein the plurality of object identifications are respectively from a plurality of different first data tables.
In some possible embodiments, the determining the association relationship between the objects identified by the object identifiers from the different first data tables includes:
determining an object identified by a data identifier from a different first data table in the target data table corresponding to target feature data; or
And determining the object identification of other objects which correspond to the target object and have the same characteristic data in the target data table.
In some possible embodiments, the determining, based on the feature data in the target data table, an association relationship between objects identified by object identifiers from different first data tables in the target data table includes:
obtaining a target task, wherein the target task comprises: determining an association relation between objects identified by object identifications from different first data tables in the target data table;
and according to the target task, mapping the target data table into a plurality of protocol tasks by taking the characteristic data as a main key, and completing the target task by the plurality of protocol tasks through distributed operation.
In another aspect, the present application provides a data processing apparatus across multiple data tables, comprising:
the acquisition module is used for acquiring a plurality of first data tables, wherein each row of each first data table in the plurality of first data tables comprises an object identifier and a plurality of characteristic data of an object identified by the object identifier;
the conversion module is used for converting each row of each first data table in a plurality of first data tables into a sub data table, each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data tables corresponding to the first data tables form a second data table;
and the connection module is used for performing table connection on the second data tables corresponding to each first data table by taking the feature data in the second data tables as connection keys to obtain a target data table, wherein each row in the target data table comprises one feature data and at least one object identifier corresponding to the feature data.
In some possible embodiments, the conversion module includes:
the data table splitting unit is configured to split each line of the first data table into sub data tables including multiple lines according to multiple pieces of feature data included in each line of the first data table, where the number of lines of the sub data tables is the same as the number of the multiple pieces of feature data.
In some possible embodiments, the connection module includes:
and the left connecting unit is used for selecting one second data table from the second data tables corresponding to each first data table as a main table, using the rest second data tables as auxiliary tables, using the characteristic data in each second data table as a connecting key, and connecting the auxiliary tables to the main table to obtain a target data table.
In some possible embodiments, the apparatus further comprises:
and the association relation determining module is used for determining the association relation between the objects identified by the object identifications from different first data tables in the target data table according to the characteristic data in the target data table.
In some possible embodiments, the association relation determining module includes:
and the same characteristic quantity determining unit is used for determining the quantity of the same characteristic data which is contained between the objects identified by a plurality of object identifications in the target data table, wherein the plurality of object identifications are respectively from a plurality of different first data tables.
In some possible embodiments, the association relation determining module includes:
the target characteristic association query unit is used for determining an object identified by data identification, corresponding to target characteristic data, from different first data tables in the target data table; or
And the target object association query unit is used for determining the object identifiers of other objects which correspond to the target object and have the same characteristic data in the target data table.
In some possible embodiments, the association relation determining module includes:
a task obtaining unit, configured to obtain a target task, where the target task includes: determining an association relation between objects identified by object identifications from different first data tables in the target data table;
and the task running unit is used for mapping the target data table into a plurality of protocol tasks by taking the characteristic data as a main key according to the target task, and completing the target task through the plurality of protocol tasks through distributed operation.
In yet another aspect, the present application provides a computer-readable storage medium having a computer program stored therein, which when executed by a processor performs the data processing method across multiple data tables provided by the present application.
In yet another aspect, the present application provides a computing device comprising: the data processing system comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program and executes the data processing method across multiple data tables.
According to the data processing method, the data processing device, the data processing medium and the computing equipment, each row of each first data table is divided into the sub data tables, so that a plurality of second data tables with object identifications and feature data in one-to-one correspondence are obtained, then the feature data in the second data tables are used as connecting keys, the plurality of second data tables are subjected to table connection, so that a target data table with feature data as clues and object identifications from different first data tables integrated and associated together is obtained, and based on the target data table, the association relation among the objects identified by the object identifications from different first data tables can be determined by taking the feature data as a foothold, so that further data analysis is facilitated. Compared with the prior art, the traditional data processing method which takes an object as a foothold, applies a Cartesian product or introduces new connection keys and other redundant data is abandoned, angles are innovatively converted according to actual business requirements, characteristic data are taken as the foothold, objects in a plurality of data tables are associated in the mode of line splitting, table connection and the like without introducing redundant data, and the generated target data table can be distributed to a plurality of reducers to be executed due to the connection keys, so that the data processing method has higher data processing capacity and data processing efficiency. Based on the above description, the data processing method, apparatus, medium, and computing device across multiple data tables provided by the present application may be efficiently competent for large data processing.
Drawings
FIG. 1 is a flowchart of a data processing method across multiple data tables according to an embodiment of the present application;
FIG. 2 is a block diagram of a data processing apparatus across multiple data tables according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a computer-readable storage medium provided by an embodiment of the present application;
fig. 4 is a schematic diagram of an exemplary computing device according to an embodiment of the present application.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are merely for illustrating the technical solutions of the present application more clearly, and therefore are only examples, and the protection scope of the present application is not limited thereby.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.
In addition, the terms "first" and "second" are used to distinguish different objects, and are not used to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Hereinafter, some terms in the present application are explained to facilitate understanding by those skilled in the art.
Data table: the database table is a table stored in the database, and is one of the most important components of the database, and the data distributed according to rows and columns are stored in the database table.
Object: entities such as people, things, objects, etc. that exist in the objective world, e.g., users, products, etc. may all be referred to as objects.
And (3) object identification: which refers to identification data representing an object in a data table, such as a user name, a user ID (identification, chinese name: unique ID), a product code, and the like.
Characteristic data: data describing one or some characteristics of the subject, such as height, hobbies, things done, and things related to the same, can be considered as characteristic data for the user.
Table connection: the method comprises the steps that a plurality of data tables are transversely connected into a new data table and can be divided into equal-value connection and non-equal-value connection, wherein the equal-value connection refers to the operation that the data in the fields are used as connection keys, and the data in rows with equal key values are connected by comparing the key values of the connection keys; non-equal value join refers to an operation of directly joining without comparing data.
And (3) left connection: the method is a mode of equal value connection, and the specific operation method comprises the following steps: and selecting one of the data tables as a main table, using the rest data tables as slave tables, placing the main table to the leftmost side, and performing isovalent connection on the right side of the main table by the slave tables.
Cartesian product: the method is also called direct product, and in database processing, the method refers to an operation of connecting rows in a plurality of data tables one by one to generate a new data table, and belongs to non-equal value connection, and the number of rows of the new data table obtained by Cartesian product is the product of the number of rows of the original data table.
"plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
Embodiments of the present application are described below with reference to the drawings.
Fig. 1 is a flowchart of a data processing method across multiple data tables according to an embodiment of the present application. As shown in fig. 1, the data processing method across multiple data tables includes the following steps:
step S101: obtaining a plurality of first data tables, wherein each row of each first data table in the plurality of first data tables comprises an object identifier and a plurality of characteristic data of an object identified by the object identifier.
Each line of data of the first data table describes a corresponding relationship between an object identifier of an object and a plurality of feature data of the object, and the plurality of feature data may be separated by using separators in the same column or may be located in different columns.
Step S102: converting each row of each first data table in the plurality of first data tables into a sub data table, wherein each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data table corresponding to the first data table forms a second data table.
In each row of the first data table, there is a one-to-many correspondence between the object identifier and the feature data, and this step S102 processes the first data table, so that there is a one-to-one correspondence between the object identifier and the feature data in each row of the processed second data table.
When each line of the first data table is converted into one sub data table, each line of the first data table may be split into sub data tables including a plurality of lines according to a plurality of feature data included in each line of the first data table, and the number of lines of the sub data tables is the same as the number of the plurality of feature data. And arranging the split sub data tables according to the sequence of the rows in the corresponding first data table to form a second data table.
In the specific implementation of step S102, each row of the first data table may be converted into one sub data table by a row-column conversion method of the data table, for example, in Hive (Hive is a data warehouse tool based on Hadoop, a structured data file may be mapped into one database table, a simple sql query function is provided, and an sql statement may be converted into a MapReduce task to run), an expode function may be adopted to split a plurality of feature data located in row a into a plurality of rows, and a lateralview function is simultaneously used to copy an object identifier in the row a into each split row, which corresponds to each feature data one to one, so as to obtain the sub data table.
It should be noted that different row-column conversion algorithms may have different requirements for the first data table to be processed, and therefore, it may be necessary to pre-process the first data table according to the requirements of the adopted row-column conversion algorithm and then process the first data table according to the row-column conversion algorithm, for example, in the first data table to be processed according to the requirement of the above-mentioned explore function, a plurality of feature data of each row are stored in the same cell in an array form, and therefore, if the plurality of feature data of each row are located in different cells in the initial state of the first data table, the plurality of feature data of each row need to be merged into the same cell and stored in an array form through pre-processing; if the feature data of each row are not stored in the form of an array although they are stored in the same cell, they need to be converted into the form of an array by preprocessing.
Step S103: and performing table connection on the second data tables corresponding to each first data table by taking the feature data in the second data tables as a connection key to obtain a target data table, wherein each row in the target data table comprises one feature data and at least one object identifier corresponding to the feature data.
Compared with the original first data table, each row of the second data table generated in step S102 has only one feature data, so that the feature data can be used as a connection key to connect the plurality of second data tables in an equivalent manner to obtain a target data table, where each row of the target data table includes one feature data and at least one object identifier corresponding to the feature data, and the object identifiers are from different first data tables.
It is easy to understand that the target data table integrates and associates object identifiers from different first data tables together by taking feature data as a clue and a foothold, so that the association relationship between the object identifiers is clearer, and correlation problems between the objects can be analyzed more easily based on the target data table.
Considering that in practical application, not all feature data are necessarily recorded in each first data table, for example, there are 3 first data tables, and feature data a may be recorded in only one of the first data tables, then the correlation between objects identified by object identifiers from different first data tables cannot be obtained by using the feature data a, so that, according to practical requirements, in the execution process of the above steps S101-S103, the feature data of this type may also be filtered, for example: on the basis of any implementation manner of the first embodiment of the present application, the step S103 may include: and selecting one second data table from the second data tables corresponding to each first data table as a main table, using the rest second data tables as auxiliary tables, using the characteristic data in each second data table as a connecting key, and connecting the auxiliary table to the main table to obtain a target data table. In the embodiment, because the table connection is performed in a left connection mode, only the feature data in the main table and the object identifiers corresponding to the feature data can be stored in the target data table, so that part of the feature data which are not recorded in the main table but are only recorded in the auxiliary table and the object identifiers corresponding to the feature data can be filtered, on one hand, the interference of unnecessary feature data on further analysis can be avoided, on the other hand, the data volume of the target data table is reduced, and the overall data processing efficiency is improved.
In the data processing method across multiple data tables provided in the embodiment of the application, each row of each first data table is divided into the sub data tables, so that multiple second data tables in which the object identifiers and the feature data are in one-to-one correspondence are obtained, then the feature data in the second data tables are used as the connection keys, and the multiple second data tables are subjected to table connection, so that a target data table in which the object identifiers from different first data tables are integrated and associated together by using the feature data as a clue is obtained, and based on the target data table, the association relationship between the objects identified by the object identifiers from different first data tables can be determined by using the feature data as a foothold, so as to facilitate further data analysis. Compared with the prior art, the traditional data processing method which takes an object as a foothold and applies a Cartesian product or introduces new connection keys and other redundant data is abandoned, angles are innovatively converted according to actual business requirements, characteristic data are taken as the foothold, objects in a plurality of data tables are associated in the mode of line splitting, table connection and the like without introducing redundant data, and the generated target data table can be distributed to a plurality of reducers to be executed due to the connection keys, so that the data processing method has higher data processing capacity and data processing efficiency. Based on the above description, the data processing method across multiple data tables provided by the embodiment of the application can be effectively competent for large data processing.
As described above, the target data table obtained in step S103 integrates and associates the object identifiers from different first data tables with the feature data as a clue and a point of interest, so that the association relationship between the object identifiers is more clear, and the correlation problem between the objects can be more easily analyzed based on the target data table. Therefore, on the basis of the embodiment shown in fig. 1, the data processing method across multiple data tables may further include: and determining the association relation between the objects identified by the object identifications from different first data tables in the target data table by taking the characteristic data in the target data table as a basis. The association relationship includes a statistical result of correlation between objects determined by using the feature data as a medium, which may be represented by a quantity, or may be represented in a form of a list, a text, or the like, and the embodiment of the present application is not limited to a specific form thereof.
Correspondingly, the determining of the association relationship between the objects identified by the object identifiers from the different first data tables in the target data table may include determining the number of the same feature data among the objects identified by the object identifiers in the target data table, where the object identifiers are from the different first data tables respectively. Such as determining the number of identical articles read between two users in the two groups.
For another example, the determining an association relationship between objects identified by object identifications from different first data tables may further include: determining an object identified by a data identifier from a different first data table in the target data table corresponding to target feature data; or determining the object identification of other objects which correspond to the target object and have the same characteristic data in the target data table. Such as finding users in two groups who read the same article, or finding a list of other users who read the same article as user a.
As the target data table is a data table obtained by performing equivalence connection through a connection key, MapReduce may be used to perform data analysis through distributed operation, and on the basis of any implementation manner of the embodiment of the present application, determining an association relationship between objects identified by object identifiers from different first data tables in the target data table based on feature data in the target data table may include:
obtaining a target task, wherein the target task comprises: determining an association relation between objects identified by object identifications from different first data tables in the target data table;
and according to the target task, mapping the target data table into a plurality of protocol tasks by taking the characteristic data as a main key, and completing the target task by the plurality of protocol tasks through distributed operation.
The embodiment can effectively improve the data processing efficiency through distributed operation, and can avoid the problem that a single task processor is suspended due to overflow of the operation amount when a large data is operated.
Next, on the basis of the embodiment shown in fig. 1, the data processing method across multiple data tables provided in the present application is explained with reference to a specific example as follows:
in real life, there are many requirements for analyzing relevance, for example, a content publisher may recommend articles, videos and other contents that are liked by each other to associated users by analyzing common interests and hobbies among the users; for another example, the friend-making website can recommend the user B with the same interests and hobbies as a potential friend to the user a by analyzing the common interests and hobbies among the users; and so on.
For the correlation analysis requirement, a large number of correlation problems are proposed, such as: and counting the number of the same articles read between two groups of users. The second embodiment is described as follows.
In the prior art, on the one hand, a common method for performing correlation analysis is as follows: combining a plurality of data tables into one data table through Cartesian product and then analyzing the data table through a mapping convention (MapReduce); on the other hand, Hive can well support MapReduce to perform distributed operation, and thus becomes an important tool for performing big data analysis in the industry at present. However, because the cartesian product is a non-equal join, the non-equal join is very difficult to be transformed to a MapReduce task for execution, and Hive has weak support for cartesian product operation, Hive can only use 1 Reducer to complete tasks without a connection key, in practical application, it is ineffective even if the number of reducers is changed by setting Hive.
Taking statistics of two groups of users and the number of the same articles read between two groups as an example, it is assumed that two groups of users respectively exist in two data tables seed _ user and all _ user, both of the two data tables include fields user _ ID and item _ IDs, wherein the field user _ ID stores a unique identifier (i.e., an object identifier) of the user, the field item _ IDs stores a list of article IDs read by the user (i.e., feature data), the item _ IDs is string-type data, and a plurality of article IDs are separated by commas.
For convenience of understanding, examples of the data tables seed _ user and all _ user are given in table 1 and table 2 below, respectively, where each row of data in table 1 and table 2 describes a corresponding relationship between user _ id and item _ ids:
TABLE 1
user_id item_ids
A1 C1,C3,C5
A2 C2,C4,C5
A3 C3,C6,C7
TABLE 2
user_id item_ids
B1 C1,C2,C6
B2 C3,C4,C7
B3 C1,C3,C4
Adopting a correlation analysis method based on Cartesian product, and the processing process is as follows:
1. extracting user _ ID, item _ IDs and the number length1 of chapter IDs in item _ IDs from the data table seed _ user to generate a data table t 1;
2. extracting user _ ID, item _ IDs and the number length2 of chapter IDs in item _ IDs from the data table all _ user to generate a data table t 2;
3. and (3) carrying out Cartesian product on the data table t1 and the data table t2 to obtain a Cartesian product table, wherein the Cartesian product table is shown in the following table 3:
TABLE 3
seed_user_id seed_item_ids length1 all_user_id all_item_ids length2
A1 C1,C3,C5 3 B1 C1,C2,C6 3
A1 C1,C3,C5 3 B2 C3,C4,C7 3
A1 C1,C3,C5 3 B3 C1,C3,C4 3
A2 C2,C4,C5 3 B1 C1,C2,C6 3
A2 C2,C4,C5 3 B2 C3,C4,C7 3
A2 C2,C4,C5 3 B3 C1,C3,C4 3
A3 C3,C6,C7 3 B1 C1,C2,C6 3
A3 C3,C6,C7 3 B2 C3,C4,C7 3
A3 C3,C6,C7 3 B3 C1,C3,C4 3
4. Extracting a union set a _ item _ ids, length1 and length2 of seed _ user _ id, all _ user _ id, seed _ item _ ids and all _ item _ ids after de-duplication from a Cartesian product table to generate a data table a, wherein the data table a is shown in the following table 4:
TABLE 4
seed_user_id all_user_id a_item_ids length1 length2
A1 B1 C1,C2,C3,C5,C6 3 3
A1 B2 C1,C3,C4,C5,C7 3 3
A1 B3 C1,C3,C4,C5 3 3
A2 B1 C1,C2,C4,C5,C6 3 3
A2 B2 C2,C3,C4,C5,C7 3 3
A2 B3 C1,C2,C3,C4,C5 3 3
A3 B1 C1,C2,C3,C6,C7 3 3
A3 B2 C3,C4,C6,C7 3 3
A3 B3 C1,C3,C4,C6,C7 3 3
5. Extracting seed _ user _ ID, all _ user _ ID, number of chapter IDs in a _ item _ IDs, length1, and length2 from data table a to generate data table t, which is shown in table 5 below:
TABLE 5
seed_user_id all_user_id length length1 length2
A1 B1 5 3 3
A1 B2 5 3 3
A1 B3 4 3 3
A2 B1 5 3 3
A2 B2 5 3 3
A2 B3 5 3 3
A3 B1 5 3 3
A3 B2 4 3 3
A3 B3 5 3 3
6. Extracting data of three fields of seed _ user _ id, all _ user _ id, length1+ length 2-length from the table t, and grouping to obtain the number of the same articles read between the seed _ user _ id and the all _ user _ id, which is 1+ length 2-length, as shown in the following table 6:
TABLE 6
seed_user_id all_user_id length1+length2–length
A1 B1
1
A1 B2 1
A1 B3 2
A2 B1 1
A2 B2 1
A2 B3 1
A3 B1 1
A3 B2 2
A3 B3 1
In Hive can be executed with reference to the following execution statement:
Figure BDA0001416319030000141
Figure BDA0001416319030000151
it should be noted that the above execution statements are not completely consistent with the above examples, and those skilled in the art may implement the above execution statements.
According to the above exemplary description of the correlation analysis method based on the cartesian product, it can be seen that, on one hand, redundant data such as length1, length2, length, etc. are introduced in the processing process, which undoubtedly increases the operation load of the processor and reduces the operation efficiency; on the other hand, the table data volume generated by the cartesian product is also large, especially for the processing of large data, the data volume of the generated cartesian product table is too large to be processed, and since the cartesian product is a non-equal value connection, it is difficult to perform distributed operation by MapReduce, so as to sum up, the solution is inefficient and is difficult to be competent for the correlation analysis of large data.
By adopting the method provided by the embodiment of the application, the processing process is as follows:
1. the data tables seed _ user and all _ user are first data tables, the data tables seed _ user and all _ user are split according to item _ ids respectively to obtain corresponding second data tables t1 and t2, and the second data tables t1 and t2 are respectively shown in the following tables 7 and 8:
TABLE 7
seed_user_id item_ids
A1 C1
A1 C3
A1 C5
A2 C2
A2 C4
A2 C5
A3 C3
A3 C6
A3 C7
TABLE 8
all_user_id item_ids
B1 C1
B1 C2
B1 C6
B2 C3
B2 C4
B2 C7
B3 C1
B3 C3
B3 C4
2. And left connecting the second data tables t1 and t2 by taking the field item _ ids as a connecting key to obtain a target data table a, wherein the target data table a is shown in the following table 9:
TABLE 9
item_ids seed_user_id all_user_id
C1 A1 B1
C1 A1 B3
C2 A2 B1
C3 A1 B2
C3 A1 B3
C3 A3 B2
C3 A3 B3
C4 A2 B2
C4 A2 B3
C5 A1
C5 A2
C6 A3 B1
C7 A3 B2
3. Item _ ids is counted under the condition that both the seed _ user _ id and the all _ user _ id are the same, so as to obtain the number count _ num of the same articles read between each two of the seed _ user _ id and the all _ user _ id, as shown in the following table 10:
watch 10
seed_user_id all_user_id count_num
A1 B1 1
A1 B2 1
A1 B3 2
A2 B1 1
A2 B2 1
A2 B3 1
A3 B1 1
A3 B2 2
A3 B3 1
In Hive can be executed with reference to the following execution statement:
Figure BDA0001416319030000171
Figure BDA0001416319030000181
it should be noted that the above execution statements are not completely consistent with the above examples, and further involve preprocessing of the first data table, for example, by using a split (item _ ids, ',') function to convert the item _ ids field into an array form, which facilitates row-column conversion using the explicit function; as another example, by left outer join, where, etc., some unnecessary item _ ids are filtered out (articles that the user in the seed _ user table has not read, are not considered), speed up the efficiency of task execution, etc. Those skilled in the art may implement the above-described execution statements.
In addition, it should be noted that one problem that may be brought about by the above process is that a data tilt problem may be generated, that is, a popular article may be seen by many people, and this problem may be solved by filtering out the popular article, for example, in the above example, filtering out the item _ id of the audience beyond a certain amount, or the item _ id of the audience top-k, and by deleting the extremely small amount of item _ id, the final credible statistical result may not be affected, that is, the correlation analysis result may not be greatly affected.
According to the above exemplary description, it can be seen that one of the technical ideas of the above method provided by the embodiments of the present application lies in: when the field B needs to be processed by carrying out Cartesian product operation according to the field A, the field B is converted into a connection operation aiming at the field B. Based on the concept, on one hand, the method provided by the embodiment of the application does not need to introduce extra redundant data, and can avoid the problems of increasing the operation load and reducing the operation efficiency caused by introducing redundant data in the prior art; on the other hand, the embodiment of the application skillfully converts the problem solving angle, and analyzes the correlation problem between two groups of users by taking the article ID (namely the feature data) rather than the user ID as a base point, so that the steps of the whole data processing process are simpler, the data processing capacity is smaller, and the data processing efficiency can be effectively improved; in addition, in the embodiment of the application, the item _ ids is used as a connection key for equivalent connection, so that MapReduce can be used for distributed operation in specific implementation, the influence of the operation capability and efficiency of a single processor is avoided, and the correlation analysis of large data can be efficiently performed.
It is easily understood that the target data table a generated in the above processing procedure lists the detailed association of seed _ user _ id, all _ user _ id and item _ ids, and according to the target data table, besides counting the number of the same articles read between two, other correlation analyses can be performed, for example, finding out the users B1 and B3 who read the same article C1 with the user a1, or finding out the users a2 and B1 who read the article C2 at the same time, and so on, so the target data table obtained by the embodiment of the present application can be used for correlation analysis of more items, and has wider application.
Fig. 2 is a block diagram of a data processing apparatus across multiple data tables according to an embodiment of the present application. The data processing apparatus for multiple data tables provided in this embodiment and the data processing method for multiple data tables provided in the foregoing embodiment have the same inventive concept, and therefore, some contents are not described again, and please refer to the foregoing embodiment for understanding.
As shown in fig. 2, an embodiment of the present application provides a data processing apparatus 2 across multiple data tables, including:
an obtaining module 21, configured to obtain a plurality of first data tables, where each row of each first data table in the plurality of first data tables includes an object identifier and a plurality of feature data of an object identified by the object identifier;
a conversion module 22, configured to convert each row of each of the plurality of first data tables into a sub data table, where each row of the sub data table includes the object identifier and a feature data of the object identified by the object identifier, and the sub data table corresponding to the first data table forms a second data table;
the connection module 23 is configured to perform table connection on the second data tables corresponding to each first data table by using the feature data in the second data tables as a connection key to obtain a target data table, where each row in the target data table includes one feature data and at least one object identifier corresponding to the feature data.
Optionally, on the basis of any implementation manner of the embodiment of the present application, the conversion module 22 includes:
the data table splitting unit is configured to split each line of the first data table into sub data tables including multiple lines according to multiple pieces of feature data included in each line of the first data table, where the number of lines of the sub data tables is the same as the number of the multiple pieces of feature data.
Considering that in practical application, not all feature data are necessarily recorded in each first data table, for example, there are 3 first data tables, and feature data a may only be recorded in one of the first data tables, so that no correlation between objects identified by object identifiers from different first data tables is obtained by using the feature data a, and therefore, according to practical requirements, the feature data may also be filtered, and on the basis of any embodiment of the present application, the connection module 23 includes:
and the left connecting unit is used for selecting one second data table from the second data tables corresponding to each first data table as a main table, using the rest second data tables as auxiliary tables, using the characteristic data in each second data table as a connecting key, and connecting the auxiliary tables to the main table to obtain a target data table.
In the embodiment, because the table connection is performed in a left connection mode, only the feature data in the main table and the object identifiers corresponding to the feature data can be stored in the target data table, so that part of the feature data which are not recorded in the main table but are only recorded in the auxiliary table and the object identifiers corresponding to the feature data can be filtered, on one hand, the interference of unnecessary feature data on further analysis can be avoided, on the other hand, the data volume of the target data table is reduced, and the overall data processing efficiency is improved.
The target data table output by the connection module 23 integrates and associates object identifiers from different first data tables together by using feature data as a clue and a foothold, so that the association relationship between the object identifiers is clearer, and based on the target data table, the correlation problem between the objects can be analyzed more easily, and on the basis of any implementation manner of the embodiment of the present application, the apparatus further includes:
and the association relation determining module is used for determining the association relation between the objects identified by the object identifications from different first data tables in the target data table according to the characteristic data in the target data table.
On the basis of any implementation manner of the embodiment of the present application, the association relation determining module includes:
and the same characteristic quantity determining unit is used for determining the quantity of the same characteristic data which is contained between the objects identified by a plurality of object identifications in the target data table, wherein the plurality of object identifications are respectively from a plurality of different first data tables.
Optionally, on the basis of any implementation manner of the embodiment of the present application, the association relationship determining module may further include:
the target characteristic association query unit is used for determining an object identified by data identification, corresponding to target characteristic data, from different first data tables in the target data table; or
And the target object association query unit is used for determining the object identifiers of other objects which correspond to the target object and have the same characteristic data in the target data table.
Because the target data table is obtained by performing equivalence connection through a connection key, data analysis may be performed through distributed operation using MapReduce, and on the basis of any implementation manner of the embodiment of the present application, the association relationship determining module includes:
a task obtaining unit, configured to obtain a target task, where the target task includes: determining an association relation between objects identified by object identifications from different first data tables in the target data table;
and the task running unit is used for mapping the target data table into a plurality of protocol tasks by taking the characteristic data as a main key according to the target task, and completing the target task through the plurality of protocol tasks through distributed operation.
Based on the embodiment, the data processing efficiency can be effectively improved through distributed operation, and the problem that when a single task processor operates large data, the operation is suspended due to overflow of the operation amount can be avoided.
The data processing device of the cross-multiple data tables provided by the embodiment of the application and the data processing method of the cross-multiple data tables provided by the embodiment of the application have the same inventive concept and have the same beneficial effects.
Fig. 3 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application. As shown in fig. 3, a computer program is stored in the computer-readable storage medium 3, and when being executed by a processor, the computer program can implement the data processing method across multiple data tables provided in the present application.
The computer-readable storage media 3, including both permanent and non-permanent, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer-readable storage medium provided by the embodiment of the application and the data processing method across multiple data tables provided by the embodiment of the application have the same beneficial effects from the same inventive concept.
Fig. 4 is a schematic diagram of a computing device according to an embodiment of the present application. As shown in fig. 4, the computing device 4 includes: a processor 40, a memory 41, a bus 42 and a communication interface 43, wherein the processor 40, the communication interface 43 and the memory 41 are connected through the bus 42; the memory 41 stores a computer program that can be executed on the processor 40, and the processor 40 executes the data processing method across multiple data tables provided in the present application when executing the computer program.
The Memory 41 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 43 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.
The bus 42 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 41 is configured to store a program, and the processor 40 executes the program after receiving an execution instruction, and the data processing method across multiple data tables disclosed in any embodiment of the present application may be applied to the processor 40, or implemented by the processor 40.
The processor 40 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 40. The Processor 40 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 41, and the processor 40 reads the information in the memory 41 and completes the steps of the method in combination with the hardware thereof.
The computing device provided by the embodiment of the application and the data processing method across multiple data tables provided by the embodiment of the application have the same beneficial effects due to the same inventive concept.
It should be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed computing device, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory 104 (ROM), a Random Access Memory 104 (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present disclosure, and the present disclosure should be construed as being covered by the claims and the specification.

Claims (16)

1. A method for processing data across multiple data tables, comprising:
obtaining a plurality of first data tables, wherein each row of each first data table in the plurality of first data tables comprises an object identifier and a plurality of characteristic data of an object identified by the object identifier;
converting each row of each first data table in a plurality of first data tables into a sub data table, wherein each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data table corresponding to the first data table forms a second data table;
and performing table connection on the second data tables corresponding to each first data table by taking the feature data in the second data tables as a connection key to obtain a target data table, wherein each row in the target data table comprises one feature data and at least one object identifier corresponding to the feature data.
2. The method of claim 1, wherein converting each row of each of the plurality of first data tables into a sub-data table comprises:
according to a plurality of characteristic data included in each line of the first data table, dividing each line of the first data table into sub data tables including a plurality of lines, wherein the number of the lines of the sub data tables is the same as the number of the plurality of characteristic data.
3. The method according to claim 1, wherein performing table join on the second data tables corresponding to each first data table by using the feature data in the second data tables as a join key to obtain a target data table, includes:
and selecting one second data table from the second data tables corresponding to each first data table as a main table, using the rest second data tables as auxiliary tables, using the characteristic data in each second data table as a connecting key, and connecting the auxiliary table to the main table to obtain a target data table.
4. The method of claim 1, further comprising:
and determining the association relation between the objects identified by the object identifications from different first data tables in the target data table by taking the characteristic data in the target data table as a basis.
5. The method of claim 4, wherein determining the association between objects identified by object identifications from different first data tables comprises:
determining the number of the same characteristic data among the objects identified by a plurality of object identifications in the target data table, wherein the plurality of object identifications are respectively from a plurality of different first data tables.
6. The method of claim 4, wherein determining the association between objects identified by object identifications from different first data tables comprises:
determining an object identified by a data identifier from a different first data table in the target data table corresponding to target feature data; or
And determining the object identification of other objects which correspond to the target object and have the same characteristic data in the target data table.
7. The method according to any one of claims 4 to 6, wherein the determining the association relationship between the objects identified by the object identifiers from different first data tables in the target data table based on the feature data in the target data table comprises:
obtaining a target task, wherein the target task comprises: determining an association relation between objects identified by object identifications from different first data tables in the target data table;
and according to the target task, mapping the target data table into a plurality of protocol tasks by taking the characteristic data as a main key, and completing the target task by the plurality of protocol tasks through distributed operation.
8. A data processing apparatus across multiple data tables, comprising:
the acquisition module is used for acquiring a plurality of first data tables, wherein each row of each first data table in the plurality of first data tables comprises an object identifier and a plurality of characteristic data of an object identified by the object identifier;
the conversion module is used for converting each row of each first data table in a plurality of first data tables into a sub data table, each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data tables corresponding to the first data tables form a second data table;
and the connection module is used for performing table connection on the second data tables corresponding to each first data table by taking the feature data in the second data tables as connection keys to obtain a target data table, wherein each row in the target data table comprises one feature data and at least one object identifier corresponding to the feature data.
9. The apparatus of claim 8, wherein the conversion module comprises:
the data table splitting unit is configured to split each line of the first data table into sub data tables including multiple lines according to multiple pieces of feature data included in each line of the first data table, where the number of lines of the sub data tables is the same as the number of the multiple pieces of feature data.
10. The apparatus of claim 8, wherein the connection module comprises:
and the left connecting unit is used for selecting one second data table from the second data tables corresponding to each first data table as a main table, using the rest second data tables as auxiliary tables, using the characteristic data in each second data table as a connecting key, and connecting the auxiliary tables to the main table to obtain a target data table.
11. The apparatus of claim 8, further comprising:
and the association relation determining module is used for determining the association relation between the objects identified by the object identifications from different first data tables in the target data table according to the characteristic data in the target data table.
12. The apparatus of claim 11, wherein the association determination module comprises:
and the same characteristic quantity determining unit is used for determining the quantity of the same characteristic data which is contained between the objects identified by a plurality of object identifications in the target data table, wherein the plurality of object identifications are respectively from a plurality of different first data tables.
13. The apparatus of claim 11, wherein the association determination module comprises:
the target characteristic association query unit is used for determining an object identified by data identification, corresponding to target characteristic data, from different first data tables in the target data table; or
And the target object association query unit is used for determining the object identifiers of other objects which correspond to the target object and have the same characteristic data in the target data table.
14. The apparatus according to any one of claims 11-13, wherein the association determination module comprises:
a task obtaining unit, configured to obtain a target task, where the target task includes: determining an association relation between objects identified by object identifications from different first data tables in the target data table;
and the task running unit is used for mapping the target data table into a plurality of protocol tasks by taking the characteristic data as a main key according to the target task, and completing the target task through the plurality of protocol tasks through distributed operation.
15. A computer-readable storage medium having a computer program stored therein, the computer program, when executed by a processor, performing the method of data processing across multiple data tables of any of claims 1-7.
16. A computing device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to perform the data processing method across multiple data tables according to any of claims 1-7.
CN201710866877.2A 2017-09-22 2017-09-22 Data processing method, device, medium and computing equipment across multiple data tables Active CN108268586B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710866877.2A CN108268586B (en) 2017-09-22 2017-09-22 Data processing method, device, medium and computing equipment across multiple data tables
PCT/CN2018/105090 WO2019056964A1 (en) 2017-09-22 2018-09-11 Cross-multiple-data table data processing method, device, medium and computing apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710866877.2A CN108268586B (en) 2017-09-22 2017-09-22 Data processing method, device, medium and computing equipment across multiple data tables

Publications (2)

Publication Number Publication Date
CN108268586A CN108268586A (en) 2018-07-10
CN108268586B true CN108268586B (en) 2020-06-16

Family

ID=62770935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710866877.2A Active CN108268586B (en) 2017-09-22 2017-09-22 Data processing method, device, medium and computing equipment across multiple data tables

Country Status (2)

Country Link
CN (1) CN108268586B (en)
WO (1) WO2019056964A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268586B (en) * 2017-09-22 2020-06-16 阿里巴巴(中国)有限公司 Data processing method, device, medium and computing equipment across multiple data tables
CN111221839A (en) * 2018-11-23 2020-06-02 北京京东金融科技控股有限公司 Data processing method, system, electronic device and computer readable storage medium
CN109558578A (en) * 2018-11-26 2019-04-02 成都四方伟业软件股份有限公司 Report conversion method and device
CN110457593B (en) * 2019-07-29 2022-03-04 平安科技(深圳)有限公司 Method and system for analyzing friend data of user and related equipment
CN111367914B (en) * 2020-03-04 2023-09-12 网易(杭州)网络有限公司 Data processing method, device, equipment and storage medium
CN111459937B (en) * 2020-03-27 2024-06-07 中国平安人寿保险股份有限公司 Data table association method, device, server and storage medium
CN111767265B (en) * 2020-05-14 2021-03-19 中邮消费金融有限公司 Data tilting method and system in connection operation and computer equipment
CN111708809B (en) * 2020-06-23 2024-05-03 中国平安财产保险股份有限公司 Associated query method, device, equipment and storage medium based on data inclination
CN112860955B (en) * 2021-03-01 2022-03-08 杨皓淳 Business data management system and method based on cloud computing and big data

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521303B (en) * 2011-11-30 2016-08-10 北京人大金仓信息技术股份有限公司 A kind of single-table multi-column sequence storage method for a column database
IN2013CH05424A (en) * 2013-11-26 2015-05-29 Inmobi Pte Ltd
US9436672B2 (en) * 2013-12-11 2016-09-06 Power Modes Pty. Ltd. Representing and manipulating hierarchical data
CN104090954B (en) * 2014-07-04 2019-02-05 用友网络科技股份有限公司 The connection method of meter reading and the connection system of meter reading
CN105094707B (en) * 2015-08-18 2018-03-13 华为技术有限公司 A kind of data storage, read method and device
CN106484740B (en) * 2015-09-01 2019-08-30 北京国双科技有限公司 A kind of tables of data connection method and device
CN106933919B (en) * 2015-12-31 2020-03-03 北京国双科技有限公司 Data table connection method and device
CN105701215B (en) * 2016-01-13 2019-03-22 北京中交兴路信息科技有限公司 Data connecting method and device based on Hadoop MapReduce
CN106021386B (en) * 2016-05-12 2019-02-05 西北工业大学 Non-equivalent connection method towards magnanimity distributed data
CN108268586B (en) * 2017-09-22 2020-06-16 阿里巴巴(中国)有限公司 Data processing method, device, medium and computing equipment across multiple data tables

Also Published As

Publication number Publication date
WO2019056964A1 (en) 2019-03-28
CN108268586A (en) 2018-07-10

Similar Documents

Publication Publication Date Title
CN108268586B (en) Data processing method, device, medium and computing equipment across multiple data tables
US10402427B2 (en) System and method for analyzing result of clustering massive data
CN107251017B (en) Efficient join path determination via radix estimation
CN109299164B (en) Data query method, computer readable storage medium and terminal equipment
US20120254089A1 (en) Vector throttling to control resource use in computer systems
US20100313258A1 (en) Identifying synonyms of entities using a document collection
US20050108189A1 (en) System and method for building a large index
CN106970929B (en) Data import method and device
WO2017096892A1 (en) Index construction method, search method, and corresponding device, apparatus, and computer storage medium
CN102129425B (en) The access method of big object set table and device in data warehouse
CN111400392B (en) Multi-source heterogeneous data processing method and device
US10496645B1 (en) System and method for analysis of a database proxy
Mătăcuţă et al. Big Data Analytics: Analysis of Features and Performance of Big Data Ingestion Tools.
US20070239663A1 (en) Parallel processing of count distinct values
US10990627B1 (en) Sharing character data across lookups to identify matches to a regular expression
CN109471893B (en) Network data query method, equipment and computer readable storage medium
CN107451204B (en) Data query method, device and equipment
US20180060392A1 (en) Batch data query method and apparatus
CN114741368A (en) Log data statistical method based on artificial intelligence and related equipment
Shahrivari et al. Fast Parallel All‐Subgraph Enumeration Using Multicore Machines
CN114691356A (en) Data parallel processing method and device, computer equipment and readable storage medium
CN111459937B (en) Data table association method, device, server and storage medium
US11620311B1 (en) Transformation of directed graph into relational data
CN107562533B (en) Data loading processing method and device
CN116126862A (en) Data table association method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200417

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 13 layer self unit 01

Applicant before: GUANGZHOU SHENMA MOBILE INFORMATION TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant