CN108268586B

CN108268586B - Data processing method, device, medium and computing equipment across multiple data tables

Info

Publication number: CN108268586B
Application number: CN201710866877.2A
Authority: CN
Inventors: 李光明
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2017-09-22
Filing date: 2017-09-22
Publication date: 2020-06-16
Anticipated expiration: 2037-09-22
Also published as: WO2019056964A1; CN108268586A

Abstract

The application provides a data processing method, a device, a medium and a computing device for crossing multiple data tables. The method comprises the following steps: acquiring a plurality of first data tables; converting each row of each first data table in a plurality of first data tables into a sub data table, wherein each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data table corresponding to the first data table forms a second data table; and taking the characteristic data in the second data tables as connection keys, and performing table connection on the second data tables corresponding to each first data table to obtain the target data table. The method provided by the application converts angles according to actual service requirements, takes the characteristic data as a foothold, associates the objects in the data tables under the condition of not introducing redundant data, can be distributed to a plurality of reducers for execution, has higher data processing capacity and data processing efficiency, and can be efficiently competent for large data processing.

Description

Data processing method, device, medium and computing equipment across multiple data tables

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, medium, and computing device across multiple data tables.

Background

With the rapid development of the internet and big data technology, the influence of data mining analysis on human activities is increasingly remarkable, correlation analysis between objects can be performed through big data, the intrinsic incidence relation between different objects is determined, and then the life quality of a user is improved through ways such as interest recommendation.

When performing correlation analysis or recommendation, it is often necessary to analyze correlations between users across multiple data tables, and a commonly used method at present is to combine multiple data tables into one data table by a cartesian product and then analyze the data table by a mapping convention (MapReduce). However, in practical applications, as the data volume is larger and larger, the efficiency of operating the mass data tables based on the cartesian product is lower and lower, for example, the number of the same articles read between two groups of users is counted, and according to the method, the data tables corresponding to the two groups of users need to be subjected to the cartesian product operation and then analyzed by a Reducer (task machine executing a specification task). However, since the cartesian product has no connection key, only one Reducer can be used to complete the analysis task, and when the data size is large, the constraint of the processing capability of the single Reducer is imposed, which easily causes the situations that the execution result of the Reducer task is incorrect, and even the task cannot be executed and completed.

To address the above problems, there are currently two solutions:

one solution is: if one of the two tables to be subjected to the cartesian product operation is a small table (the data volume is much smaller than the other table), the data of the small table can be loaded into the memory, so that the processing speed of the cartesian product is accelerated. But one important limitation of this solution is: due to the memory capacity, a table that is much smaller than the other table must exist to be valid. Thus, this solution is clearly insufficient for the handling of large data tables.

The other solution is as follows: additionally constructing join keys (connection keys), replacing Cartesian product operation by table connection operation, specifically expanding a small table into a row of join keys, and copying entries of the small table for multiple times, wherein the join keys are different; and expanding the large table by a list of join keys to be random numbers in the range of the total amount of data after expanding the small table by multiple times. For example, assuming that there are only 1 piece of data in the small table and 1000 pieces of data in the large table, a column of join keys is added to the small table, the value of the join keys is set to be 1, the data is expanded by four times, the values of the join keys are respectively 2 to 5, random numbers between 1 and 5 are used for the join keys in the large table, and if there are 200 pieces of 1, 200 pieces of 2, 200 pieces of 3, 200 pieces of 4 and 200 pieces of 5, at this time, when the join operation of two tables is performed according to the join keys, 5 reducers are generated, and the data of the large table is randomly divided into 5 pieces (multiple after expansion of the small table), thereby solving the above problem. However, the nature of this solution is still the same as the cartesian product operation, with the disadvantages of data redundancy, and the relatively cumbersome and inefficient practical implementation.

In summary, there is an urgent need for an efficient data processing method across multiple data tables, which is capable of handling large data.

Disclosure of Invention

The application provides a data processing method, device, medium and computing equipment spanning multiple data tables, so that big data can be efficiently processed.

In one aspect, the present application provides a data processing method across multiple data tables, including:

obtaining a plurality of first data tables, wherein each row of each first data table in the plurality of first data tables comprises an object identifier and a plurality of characteristic data of an object identified by the object identifier;

converting each row of each first data table in a plurality of first data tables into a sub data table, wherein each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data table corresponding to the first data table forms a second data table;

and performing table connection on the second data tables corresponding to each first data table by taking the feature data in the second data tables as a connection key to obtain a target data table, wherein each row in the target data table comprises one feature data and at least one object identifier corresponding to the feature data.

In some possible embodiments, the converting each row of each of the plurality of first data tables into one sub data table includes:

according to a plurality of characteristic data included in each line of the first data table, dividing each line of the first data table into sub data tables including a plurality of lines, wherein the number of the lines of the sub data tables is the same as the number of the plurality of characteristic data.

In some possible embodiments, the performing table join on the second data table corresponding to each first data table by using the feature data in the second data table as a join key to obtain a target data table includes:

and selecting one second data table from the second data tables corresponding to each first data table as a main table, using the rest second data tables as auxiliary tables, using the characteristic data in each second data table as a connecting key, and connecting the auxiliary table to the main table to obtain a target data table.

In some possible embodiments, the method further comprises:

and determining the association relation between the objects identified by the object identifications from different first data tables in the target data table by taking the characteristic data in the target data table as a basis.

In some possible embodiments, the determining the association relationship between the objects identified by the object identifiers from the different first data tables includes:

determining the number of the same characteristic data among the objects identified by a plurality of object identifications in the target data table, wherein the plurality of object identifications are respectively from a plurality of different first data tables.

determining an object identified by a data identifier from a different first data table in the target data table corresponding to target feature data; or

And determining the object identification of other objects which correspond to the target object and have the same characteristic data in the target data table.

In some possible embodiments, the determining, based on the feature data in the target data table, an association relationship between objects identified by object identifiers from different first data tables in the target data table includes:

obtaining a target task, wherein the target task comprises: determining an association relation between objects identified by object identifications from different first data tables in the target data table;

and according to the target task, mapping the target data table into a plurality of protocol tasks by taking the characteristic data as a main key, and completing the target task by the plurality of protocol tasks through distributed operation.

In another aspect, the present application provides a data processing apparatus across multiple data tables, comprising:

the acquisition module is used for acquiring a plurality of first data tables, wherein each row of each first data table in the plurality of first data tables comprises an object identifier and a plurality of characteristic data of an object identified by the object identifier;

the conversion module is used for converting each row of each first data table in a plurality of first data tables into a sub data table, each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data tables corresponding to the first data tables form a second data table;

and the connection module is used for performing table connection on the second data tables corresponding to each first data table by taking the feature data in the second data tables as connection keys to obtain a target data table, wherein each row in the target data table comprises one feature data and at least one object identifier corresponding to the feature data.

In some possible embodiments, the conversion module includes:

the data table splitting unit is configured to split each line of the first data table into sub data tables including multiple lines according to multiple pieces of feature data included in each line of the first data table, where the number of lines of the sub data tables is the same as the number of the multiple pieces of feature data.

In some possible embodiments, the connection module includes:

and the left connecting unit is used for selecting one second data table from the second data tables corresponding to each first data table as a main table, using the rest second data tables as auxiliary tables, using the characteristic data in each second data table as a connecting key, and connecting the auxiliary tables to the main table to obtain a target data table.

In some possible embodiments, the apparatus further comprises:

and the association relation determining module is used for determining the association relation between the objects identified by the object identifications from different first data tables in the target data table according to the characteristic data in the target data table.

In some possible embodiments, the association relation determining module includes:

and the same characteristic quantity determining unit is used for determining the quantity of the same characteristic data which is contained between the objects identified by a plurality of object identifications in the target data table, wherein the plurality of object identifications are respectively from a plurality of different first data tables.

the target characteristic association query unit is used for determining an object identified by data identification, corresponding to target characteristic data, from different first data tables in the target data table; or

And the target object association query unit is used for determining the object identifiers of other objects which correspond to the target object and have the same characteristic data in the target data table.

a task obtaining unit, configured to obtain a target task, where the target task includes: determining an association relation between objects identified by object identifications from different first data tables in the target data table;

and the task running unit is used for mapping the target data table into a plurality of protocol tasks by taking the characteristic data as a main key according to the target task, and completing the target task through the plurality of protocol tasks through distributed operation.

In yet another aspect, the present application provides a computer-readable storage medium having a computer program stored therein, which when executed by a processor performs the data processing method across multiple data tables provided by the present application.

In yet another aspect, the present application provides a computing device comprising: the data processing system comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program and executes the data processing method across multiple data tables.

According to the data processing method, the data processing device, the data processing medium and the computing equipment, each row of each first data table is divided into the sub data tables, so that a plurality of second data tables with object identifications and feature data in one-to-one correspondence are obtained, then the feature data in the second data tables are used as connecting keys, the plurality of second data tables are subjected to table connection, so that a target data table with feature data as clues and object identifications from different first data tables integrated and associated together is obtained, and based on the target data table, the association relation among the objects identified by the object identifications from different first data tables can be determined by taking the feature data as a foothold, so that further data analysis is facilitated. Compared with the prior art, the traditional data processing method which takes an object as a foothold, applies a Cartesian product or introduces new connection keys and other redundant data is abandoned, angles are innovatively converted according to actual business requirements, characteristic data are taken as the foothold, objects in a plurality of data tables are associated in the mode of line splitting, table connection and the like without introducing redundant data, and the generated target data table can be distributed to a plurality of reducers to be executed due to the connection keys, so that the data processing method has higher data processing capacity and data processing efficiency. Based on the above description, the data processing method, apparatus, medium, and computing device across multiple data tables provided by the present application may be efficiently competent for large data processing.

Drawings

FIG. 1 is a flowchart of a data processing method across multiple data tables according to an embodiment of the present application;

FIG. 2 is a block diagram of a data processing apparatus across multiple data tables according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a computer-readable storage medium provided by an embodiment of the present application;

fig. 4 is a schematic diagram of an exemplary computing device according to an embodiment of the present application.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are merely for illustrating the technical solutions of the present application more clearly, and therefore are only examples, and the protection scope of the present application is not limited thereby.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

In addition, the terms "first" and "second" are used to distinguish different objects, and are not used to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Hereinafter, some terms in the present application are explained to facilitate understanding by those skilled in the art.

Data table: the database table is a table stored in the database, and is one of the most important components of the database, and the data distributed according to rows and columns are stored in the database table.

Object: entities such as people, things, objects, etc. that exist in the objective world, e.g., users, products, etc. may all be referred to as objects.

And (3) object identification: which refers to identification data representing an object in a data table, such as a user name, a user ID (identification, chinese name: unique ID), a product code, and the like.

Characteristic data: data describing one or some characteristics of the subject, such as height, hobbies, things done, and things related to the same, can be considered as characteristic data for the user.

Table connection: the method comprises the steps that a plurality of data tables are transversely connected into a new data table and can be divided into equal-value connection and non-equal-value connection, wherein the equal-value connection refers to the operation that the data in the fields are used as connection keys, and the data in rows with equal key values are connected by comparing the key values of the connection keys; non-equal value join refers to an operation of directly joining without comparing data.

And (3) left connection: the method is a mode of equal value connection, and the specific operation method comprises the following steps: and selecting one of the data tables as a main table, using the rest data tables as slave tables, placing the main table to the leftmost side, and performing isovalent connection on the right side of the main table by the slave tables.

Cartesian product: the method is also called direct product, and in database processing, the method refers to an operation of connecting rows in a plurality of data tables one by one to generate a new data table, and belongs to non-equal value connection, and the number of rows of the new data table obtained by Cartesian product is the product of the number of rows of the original data table.

"plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Embodiments of the present application are described below with reference to the drawings.

Fig. 1 is a flowchart of a data processing method across multiple data tables according to an embodiment of the present application. As shown in fig. 1, the data processing method across multiple data tables includes the following steps:

step S101: obtaining a plurality of first data tables, wherein each row of each first data table in the plurality of first data tables comprises an object identifier and a plurality of characteristic data of an object identified by the object identifier.

Each line of data of the first data table describes a corresponding relationship between an object identifier of an object and a plurality of feature data of the object, and the plurality of feature data may be separated by using separators in the same column or may be located in different columns.

Step S102: converting each row of each first data table in the plurality of first data tables into a sub data table, wherein each row of the sub data table comprises the object identifier and characteristic data of the object identified by the object identifier, and the sub data table corresponding to the first data table forms a second data table.

In each row of the first data table, there is a one-to-many correspondence between the object identifier and the feature data, and this step S102 processes the first data table, so that there is a one-to-one correspondence between the object identifier and the feature data in each row of the processed second data table.

When each line of the first data table is converted into one sub data table, each line of the first data table may be split into sub data tables including a plurality of lines according to a plurality of feature data included in each line of the first data table, and the number of lines of the sub data tables is the same as the number of the plurality of feature data. And arranging the split sub data tables according to the sequence of the rows in the corresponding first data table to form a second data table.

In the specific implementation of step S102, each row of the first data table may be converted into one sub data table by a row-column conversion method of the data table, for example, in Hive (Hive is a data warehouse tool based on Hadoop, a structured data file may be mapped into one database table, a simple sql query function is provided, and an sql statement may be converted into a MapReduce task to run), an expode function may be adopted to split a plurality of feature data located in row a into a plurality of rows, and a lateralview function is simultaneously used to copy an object identifier in the row a into each split row, which corresponds to each feature data one to one, so as to obtain the sub data table.

It should be noted that different row-column conversion algorithms may have different requirements for the first data table to be processed, and therefore, it may be necessary to pre-process the first data table according to the requirements of the adopted row-column conversion algorithm and then process the first data table according to the row-column conversion algorithm, for example, in the first data table to be processed according to the requirement of the above-mentioned explore function, a plurality of feature data of each row are stored in the same cell in an array form, and therefore, if the plurality of feature data of each row are located in different cells in the initial state of the first data table, the plurality of feature data of each row need to be merged into the same cell and stored in an array form through pre-processing; if the feature data of each row are not stored in the form of an array although they are stored in the same cell, they need to be converted into the form of an array by preprocessing.

Step S103: and performing table connection on the second data tables corresponding to each first data table by taking the feature data in the second data tables as a connection key to obtain a target data table, wherein each row in the target data table comprises one feature data and at least one object identifier corresponding to the feature data.

Compared with the original first data table, each row of the second data table generated in step S102 has only one feature data, so that the feature data can be used as a connection key to connect the plurality of second data tables in an equivalent manner to obtain a target data table, where each row of the target data table includes one feature data and at least one object identifier corresponding to the feature data, and the object identifiers are from different first data tables.

It is easy to understand that the target data table integrates and associates object identifiers from different first data tables together by taking feature data as a clue and a foothold, so that the association relationship between the object identifiers is clearer, and correlation problems between the objects can be analyzed more easily based on the target data table.

Considering that in practical application, not all feature data are necessarily recorded in each first data table, for example, there are 3 first data tables, and feature data a may be recorded in only one of the first data tables, then the correlation between objects identified by object identifiers from different first data tables cannot be obtained by using the feature data a, so that, according to practical requirements, in the execution process of the above steps S101-S103, the feature data of this type may also be filtered, for example: on the basis of any implementation manner of the first embodiment of the present application, the step S103 may include: and selecting one second data table from the second data tables corresponding to each first data table as a main table, using the rest second data tables as auxiliary tables, using the characteristic data in each second data table as a connecting key, and connecting the auxiliary table to the main table to obtain a target data table. In the embodiment, because the table connection is performed in a left connection mode, only the feature data in the main table and the object identifiers corresponding to the feature data can be stored in the target data table, so that part of the feature data which are not recorded in the main table but are only recorded in the auxiliary table and the object identifiers corresponding to the feature data can be filtered, on one hand, the interference of unnecessary feature data on further analysis can be avoided, on the other hand, the data volume of the target data table is reduced, and the overall data processing efficiency is improved.

In the data processing method across multiple data tables provided in the embodiment of the application, each row of each first data table is divided into the sub data tables, so that multiple second data tables in which the object identifiers and the feature data are in one-to-one correspondence are obtained, then the feature data in the second data tables are used as the connection keys, and the multiple second data tables are subjected to table connection, so that a target data table in which the object identifiers from different first data tables are integrated and associated together by using the feature data as a clue is obtained, and based on the target data table, the association relationship between the objects identified by the object identifiers from different first data tables can be determined by using the feature data as a foothold, so as to facilitate further data analysis. Compared with the prior art, the traditional data processing method which takes an object as a foothold and applies a Cartesian product or introduces new connection keys and other redundant data is abandoned, angles are innovatively converted according to actual business requirements, characteristic data are taken as the foothold, objects in a plurality of data tables are associated in the mode of line splitting, table connection and the like without introducing redundant data, and the generated target data table can be distributed to a plurality of reducers to be executed due to the connection keys, so that the data processing method has higher data processing capacity and data processing efficiency. Based on the above description, the data processing method across multiple data tables provided by the embodiment of the application can be effectively competent for large data processing.

As described above, the target data table obtained in step S103 integrates and associates the object identifiers from different first data tables with the feature data as a clue and a point of interest, so that the association relationship between the object identifiers is more clear, and the correlation problem between the objects can be more easily analyzed based on the target data table. Therefore, on the basis of the embodiment shown in fig. 1, the data processing method across multiple data tables may further include: and determining the association relation between the objects identified by the object identifications from different first data tables in the target data table by taking the characteristic data in the target data table as a basis. The association relationship includes a statistical result of correlation between objects determined by using the feature data as a medium, which may be represented by a quantity, or may be represented in a form of a list, a text, or the like, and the embodiment of the present application is not limited to a specific form thereof.

Correspondingly, the determining of the association relationship between the objects identified by the object identifiers from the different first data tables in the target data table may include determining the number of the same feature data among the objects identified by the object identifiers in the target data table, where the object identifiers are from the different first data tables respectively. Such as determining the number of identical articles read between two users in the two groups.

For another example, the determining an association relationship between objects identified by object identifications from different first data tables may further include: determining an object identified by a data identifier from a different first data table in the target data table corresponding to target feature data; or determining the object identification of other objects which correspond to the target object and have the same characteristic data in the target data table. Such as finding users in two groups who read the same article, or finding a list of other users who read the same article as user a.

As the target data table is a data table obtained by performing equivalence connection through a connection key, MapReduce may be used to perform data analysis through distributed operation, and on the basis of any implementation manner of the embodiment of the present application, determining an association relationship between objects identified by object identifiers from different first data tables in the target data table based on feature data in the target data table may include:

The embodiment can effectively improve the data processing efficiency through distributed operation, and can avoid the problem that a single task processor is suspended due to overflow of the operation amount when a large data is operated.

Next, on the basis of the embodiment shown in fig. 1, the data processing method across multiple data tables provided in the present application is explained with reference to a specific example as follows:

in real life, there are many requirements for analyzing relevance, for example, a content publisher may recommend articles, videos and other contents that are liked by each other to associated users by analyzing common interests and hobbies among the users; for another example, the friend-making website can recommend the user B with the same interests and hobbies as a potential friend to the user a by analyzing the common interests and hobbies among the users; and so on.

For the correlation analysis requirement, a large number of correlation problems are proposed, such as: and counting the number of the same articles read between two groups of users. The second embodiment is described as follows.

In the prior art, on the one hand, a common method for performing correlation analysis is as follows: combining a plurality of data tables into one data table through Cartesian product and then analyzing the data table through a mapping convention (MapReduce); on the other hand, Hive can well support MapReduce to perform distributed operation, and thus becomes an important tool for performing big data analysis in the industry at present. However, because the cartesian product is a non-equal join, the non-equal join is very difficult to be transformed to a MapReduce task for execution, and Hive has weak support for cartesian product operation, Hive can only use 1 Reducer to complete tasks without a connection key, in practical application, it is ineffective even if the number of reducers is changed by setting Hive.

Taking statistics of two groups of users and the number of the same articles read between two groups as an example, it is assumed that two groups of users respectively exist in two data tables seed _ user and all _ user, both of the two data tables include fields user _ ID and item _ IDs, wherein the field user _ ID stores a unique identifier (i.e., an object identifier) of the user, the field item _ IDs stores a list of article IDs read by the user (i.e., feature data), the item _ IDs is string-type data, and a plurality of article IDs are separated by commas.

For convenience of understanding, examples of the data tables seed _ user and all _ user are given in table 1 and table 2 below, respectively, where each row of data in table 1 and table 2 describes a corresponding relationship between user _ id and item _ ids:

TABLE 1

user_id	item_ids
		A1	C1,C3,C5
A2	C2,C4,C5
		A3	C3,C6,C7

TABLE 2

user_id	item_ids
		B1	C1,C2,C6
B2	C3,C4,C7
		B3	C1,C3,C4

Adopting a correlation analysis method based on Cartesian product, and the processing process is as follows:

1. extracting user _ ID, item _ IDs and the number length1 of chapter IDs in item _ IDs from the data table seed _ user to generate a data table t 1;

2. extracting user _ ID, item _ IDs and the number length2 of chapter IDs in item _ IDs from the data table all _ user to generate a data table t 2;

3. and (3) carrying out Cartesian product on the data table t1 and the data table t2 to obtain a Cartesian product table, wherein the Cartesian product table is shown in the following table 3:

TABLE 3

seed_user_id	seed_item_ids	length1	all_user_id	all_item_ids	length2
						A1	C1,C3,C5	3	B1	C1,C2,C6	3
A1	C1,C3,C5	3	B2	C3,C4,C7	3
						A1	C1,C3,C5	3	B3	C1,C3,C4	3
A2	C2,C4,C5	3	B1	C1,C2,C6	3
						A2	C2,C4,C5	3	B2	C3,C4,C7	3
A2	C2,C4,C5	3	B3	C1,C3,C4	3
						A3	C3,C6,C7	3	B1	C1,C2,C6	3
A3	C3,C6,C7	3	B2	C3,C4,C7	3
						A3	C3,C6,C7	3	B3	C1,C3,C4	3

4. Extracting a union set a _ item _ ids, length1 and length2 of seed _ user _ id, all _ user _ id, seed _ item _ ids and all _ item _ ids after de-duplication from a Cartesian product table to generate a data table a, wherein the data table a is shown in the following table 4:

TABLE 4

seed_user_id	all_user_id	a_item_ids	length1	length2
					A1	B1	C1,C2,C3,C5,C6	3	3
A1	B2	C1,C3,C4,C5,C7	3	3
					A1	B3	C1,C3,C4,C5	3	3
A2	B1	C1,C2,C4,C5,C6	3	3
					A2	B2	C2,C3,C4,C5,C7	3	3
A2	B3	C1,C2,C3,C4,C5	3	3
					A3	B1	C1,C2,C3,C6,C7	3	3
A3	B2	C3,C4,C6,C7	3	3
					A3	B3	C1,C3,C4,C6,C7	3	3

5. Extracting seed _ user _ ID, all _ user _ ID, number of chapter IDs in a _ item _ IDs, length1, and length2 from data table a to generate data table t, which is shown in table 5 below:

TABLE 5

seed_user_id	all_user_id	length	length1	length2
					A1	B1	5	3	3
A1	B2	5	3	3
					A1	B3		4	3	3
A2	B1	5	3	3
					A2	B2	5	3	3
A2	B3	5	3	3
					A3	B1	5	3	3
A3	B2		4	3						3
					A3	B3	5	3	3

6. Extracting data of three fields of seed _ user _ id, all _ user _ id, length1+ length 2-length from the table t, and grouping to obtain the number of the same articles read between the seed _ user _ id and the all _ user _ id, which is 1+ length 2-length, as shown in the following table 6:

TABLE 6

seed_user_id	all_user_id	length1+length2–length
			A1	B1
	1
		A1	B2		1
A1	B3					2
		A2	B1		1
A2	B2					1
		A2	B3		1
A3	B1					1
		A3	B2		2
A3	B3					1

In Hive can be executed with reference to the following execution statement:

it should be noted that the above execution statements are not completely consistent with the above examples, and those skilled in the art may implement the above execution statements.

According to the above exemplary description of the correlation analysis method based on the cartesian product, it can be seen that, on one hand, redundant data such as length1, length2, length, etc. are introduced in the processing process, which undoubtedly increases the operation load of the processor and reduces the operation efficiency; on the other hand, the table data volume generated by the cartesian product is also large, especially for the processing of large data, the data volume of the generated cartesian product table is too large to be processed, and since the cartesian product is a non-equal value connection, it is difficult to perform distributed operation by MapReduce, so as to sum up, the solution is inefficient and is difficult to be competent for the correlation analysis of large data.

By adopting the method provided by the embodiment of the application, the processing process is as follows:

1. the data tables seed _ user and all _ user are first data tables, the data tables seed _ user and all _ user are split according to item _ ids respectively to obtain corresponding second data tables t1 and t2, and the second data tables t1 and t2 are respectively shown in the following tables 7 and 8:

TABLE 7

seed_user_id	item_ids
		A1	C1
A1	C3
		A1	C5
A2	C2
		A2	C4
A2	C5
		A3	C3
A3	C6
		A3	C7

TABLE 8

all_user_id	item_ids
		B1	C1
B1	C2
		B1	C6
B2	C3
		B2	C4
B2	C7
		B3	C1
B3	C3
		B3	C4

2. And left connecting the second data tables t1 and t2 by taking the field item _ ids as a connecting key to obtain a target data table a, wherein the target data table a is shown in the following table 9:

TABLE 9

item_ids	seed_user_id	all_user_id
			C1	A1	B1
C1	A1	B3
			C2	A2	B1
C3	A1	B2
			C3	A1	B3
C3	A3	B2
			C3	A3	B3
C4	A2	B2
			C4	A2	B3
C5	A1
			C5	A2
C6	A3	B1
			C7	A3	B2

3. Item _ ids is counted under the condition that both the seed _ user _ id and the all _ user _ id are the same, so as to obtain the number count _ num of the same articles read between each two of the seed _ user _ id and the all _ user _ id, as shown in the following table 10:

watch 10

seed_user_id	all_user_id	count_num
			A1	B1	1
A1	B2					1
			A1	B3	2
A2	B1					1
			A2	B2	1
A2	B3					1
			A3	B1	1
A3	B2					2
			A3	B3	1

In Hive can be executed with reference to the following execution statement:

it should be noted that the above execution statements are not completely consistent with the above examples, and further involve preprocessing of the first data table, for example, by using a split (item _ ids, ',') function to convert the item _ ids field into an array form, which facilitates row-column conversion using the explicit function; as another example, by left outer join, where, etc., some unnecessary item _ ids are filtered out (articles that the user in the seed _ user table has not read, are not considered), speed up the efficiency of task execution, etc. Those skilled in the art may implement the above-described execution statements.

In addition, it should be noted that one problem that may be brought about by the above process is that a data tilt problem may be generated, that is, a popular article may be seen by many people, and this problem may be solved by filtering out the popular article, for example, in the above example, filtering out the item _ id of the audience beyond a certain amount, or the item _ id of the audience top-k, and by deleting the extremely small amount of item _ id, the final credible statistical result may not be affected, that is, the correlation analysis result may not be greatly affected.

According to the above exemplary description, it can be seen that one of the technical ideas of the above method provided by the embodiments of the present application lies in: when the field B needs to be processed by carrying out Cartesian product operation according to the field A, the field B is converted into a connection operation aiming at the field B. Based on the concept, on one hand, the method provided by the embodiment of the application does not need to introduce extra redundant data, and can avoid the problems of increasing the operation load and reducing the operation efficiency caused by introducing redundant data in the prior art; on the other hand, the embodiment of the application skillfully converts the problem solving angle, and analyzes the correlation problem between two groups of users by taking the article ID (namely the feature data) rather than the user ID as a base point, so that the steps of the whole data processing process are simpler, the data processing capacity is smaller, and the data processing efficiency can be effectively improved; in addition, in the embodiment of the application, the item _ ids is used as a connection key for equivalent connection, so that MapReduce can be used for distributed operation in specific implementation, the influence of the operation capability and efficiency of a single processor is avoided, and the correlation analysis of large data can be efficiently performed.

It is easily understood that the target data table a generated in the above processing procedure lists the detailed association of seed _ user _ id, all _ user _ id and item _ ids, and according to the target data table, besides counting the number of the same articles read between two, other correlation analyses can be performed, for example, finding out the users B1 and B3 who read the same article C1 with the user a1, or finding out the users a2 and B1 who read the article C2 at the same time, and so on, so the target data table obtained by the embodiment of the present application can be used for correlation analysis of more items, and has wider application.

Fig. 2 is a block diagram of a data processing apparatus across multiple data tables according to an embodiment of the present application. The data processing apparatus for multiple data tables provided in this embodiment and the data processing method for multiple data tables provided in the foregoing embodiment have the same inventive concept, and therefore, some contents are not described again, and please refer to the foregoing embodiment for understanding.

As shown in fig. 2, an embodiment of the present application provides a data processing apparatus 2 across multiple data tables, including:

an obtaining module 21, configured to obtain a plurality of first data tables, where each row of each first data table in the plurality of first data tables includes an object identifier and a plurality of feature data of an object identified by the object identifier;

a conversion module 22, configured to convert each row of each of the plurality of first data tables into a sub data table, where each row of the sub data table includes the object identifier and a feature data of the object identified by the object identifier, and the sub data table corresponding to the first data table forms a second data table;

the connection module 23 is configured to perform table connection on the second data tables corresponding to each first data table by using the feature data in the second data tables as a connection key to obtain a target data table, where each row in the target data table includes one feature data and at least one object identifier corresponding to the feature data.

Optionally, on the basis of any implementation manner of the embodiment of the present application, the conversion module 22 includes:

Considering that in practical application, not all feature data are necessarily recorded in each first data table, for example, there are 3 first data tables, and feature data a may only be recorded in one of the first data tables, so that no correlation between objects identified by object identifiers from different first data tables is obtained by using the feature data a, and therefore, according to practical requirements, the feature data may also be filtered, and on the basis of any embodiment of the present application, the connection module 23 includes:

In the embodiment, because the table connection is performed in a left connection mode, only the feature data in the main table and the object identifiers corresponding to the feature data can be stored in the target data table, so that part of the feature data which are not recorded in the main table but are only recorded in the auxiliary table and the object identifiers corresponding to the feature data can be filtered, on one hand, the interference of unnecessary feature data on further analysis can be avoided, on the other hand, the data volume of the target data table is reduced, and the overall data processing efficiency is improved.

The target data table output by the connection module 23 integrates and associates object identifiers from different first data tables together by using feature data as a clue and a foothold, so that the association relationship between the object identifiers is clearer, and based on the target data table, the correlation problem between the objects can be analyzed more easily, and on the basis of any implementation manner of the embodiment of the present application, the apparatus further includes:

On the basis of any implementation manner of the embodiment of the present application, the association relation determining module includes:

Optionally, on the basis of any implementation manner of the embodiment of the present application, the association relationship determining module may further include:

Because the target data table is obtained by performing equivalence connection through a connection key, data analysis may be performed through distributed operation using MapReduce, and on the basis of any implementation manner of the embodiment of the present application, the association relationship determining module includes:

Based on the embodiment, the data processing efficiency can be effectively improved through distributed operation, and the problem that when a single task processor operates large data, the operation is suspended due to overflow of the operation amount can be avoided.

The data processing device of the cross-multiple data tables provided by the embodiment of the application and the data processing method of the cross-multiple data tables provided by the embodiment of the application have the same inventive concept and have the same beneficial effects.

Fig. 3 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application. As shown in fig. 3, a computer program is stored in the computer-readable storage medium 3, and when being executed by a processor, the computer program can implement the data processing method across multiple data tables provided in the present application.

The computer-readable storage media 3, including both permanent and non-permanent, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer-readable storage medium provided by the embodiment of the application and the data processing method across multiple data tables provided by the embodiment of the application have the same beneficial effects from the same inventive concept.

Fig. 4 is a schematic diagram of a computing device according to an embodiment of the present application. As shown in fig. 4, the computing device 4 includes: a processor 40, a memory 41, a bus 42 and a communication interface 43, wherein the processor 40, the communication interface 43 and the memory 41 are connected through the bus 42; the memory 41 stores a computer program that can be executed on the processor 40, and the processor 40 executes the data processing method across multiple data tables provided in the present application when executing the computer program.

The Memory 41 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 43 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

The bus 42 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 41 is configured to store a program, and the processor 40 executes the program after receiving an execution instruction, and the data processing method across multiple data tables disclosed in any embodiment of the present application may be applied to the processor 40, or implemented by the processor 40.

The processor 40 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 40. The Processor 40 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 41, and the processor 40 reads the information in the memory 41 and completes the steps of the method in combination with the hardware thereof.

The computing device provided by the embodiment of the application and the data processing method across multiple data tables provided by the embodiment of the application have the same beneficial effects due to the same inventive concept.

It should be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed computing device, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory 104 (ROM), a Random Access Memory 104 (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present disclosure, and the present disclosure should be construed as being covered by the claims and the specification.

Claims

1. A method for processing data across multiple data tables, comprising:

2. The method of claim 1, wherein converting each row of each of the plurality of first data tables into a sub-data table comprises:

3. The method according to claim 1, wherein performing table join on the second data tables corresponding to each first data table by using the feature data in the second data tables as a join key to obtain a target data table, includes:

4. The method of claim 1, further comprising:

5. The method of claim 4, wherein determining the association between objects identified by object identifications from different first data tables comprises:

6. The method of claim 4, wherein determining the association between objects identified by object identifications from different first data tables comprises:

7. The method according to any one of claims 4 to 6, wherein the determining the association relationship between the objects identified by the object identifiers from different first data tables in the target data table based on the feature data in the target data table comprises:

8. A data processing apparatus across multiple data tables, comprising:

9. The apparatus of claim 8, wherein the conversion module comprises:

10. The apparatus of claim 8, wherein the connection module comprises:

11. The apparatus of claim 8, further comprising:

12. The apparatus of claim 11, wherein the association determination module comprises:

13. The apparatus of claim 11, wherein the association determination module comprises:

14. The apparatus according to any one of claims 11-13, wherein the association determination module comprises:

15. A computer-readable storage medium having a computer program stored therein, the computer program, when executed by a processor, performing the method of data processing across multiple data tables of any of claims 1-7.

16. A computing device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to perform the data processing method across multiple data tables according to any of claims 1-7.