CN111694876A - Method and device for realizing ID mapping based on Spark framework - Google Patents

Method and device for realizing ID mapping based on Spark framework Download PDF

Info

Publication number
CN111694876A
CN111694876A CN201910199055.2A CN201910199055A CN111694876A CN 111694876 A CN111694876 A CN 111694876A CN 201910199055 A CN201910199055 A CN 201910199055A CN 111694876 A CN111694876 A CN 111694876A
Authority
CN
China
Prior art keywords
aggregation
initial number
subset
subsets
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910199055.2A
Other languages
Chinese (zh)
Inventor
赵林
马征
王斌峰
李晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201910199055.2A priority Critical patent/CN111694876A/en
Publication of CN111694876A publication Critical patent/CN111694876A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0273Determination of fees for advertising
    • G06Q30/0275Auctions

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Theoretical Computer Science (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for realizing ID mapping based on Spark framework. The method mainly comprises the following steps: carrying out data preprocessing on the two-dimensional ID relation table to obtain an initial number-ID pair relation table; taking the ID as a key, splitting and aggregating the initial number-ID pair relation table to obtain a plurality of initial number primary aggregation subsets; taking the initial numbers as keys, and splitting and aggregating the multiple initial number once aggregation subsets to obtain initial number aggregation subset results; numbering the initial number aggregation subset results by using a uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset; and obtaining the corresponding relation between the uniform identifier and the ID according to the uniform identifier-initial number aggregation subset relation table and the initial number-ID pair relation table, so as to realize uniform representation of the ID. The invention realizes the operations of storage, filtration, splitting, aggregation and the like of the mass user data sets, and improves the efficiency, the accuracy and the reliability of the ID mapping algorithm.

Description

Method and device for realizing ID mapping based on Spark framework
Technical Field
The invention relates to the technical field of big data, in particular to a method and a device for realizing ID mapping based on a Spark framework, a computer storage medium and computing equipment.
Background
ID Mapping is a basic and very critical technology in the field of big data. In short, ID-Mapping is to identify several data from different sources as the same user or subject by some technical means. For example, a user opens three, uses an AA mobile phone assistant on a first mobile phone, uses a Baidu map on a second mobile phone, watches the love art video on a tablet personal computer, uses an AA browser on the personal computer, the first mobile phone, the second mobile phone and the tablet personal computer often share the same wifi, and the second mobile phone is often connected with the personal computer through a data line, so how to determine that the 4 objects are the same user according to the behaviors of the 4 objects on the 4 devices and the relation among the 4 objects is a main problem to be solved by ID Mapping.
ID Mapping has wide application scenarios and commercial value. The behavior information and attribute data of a user are dispersed on data of many different sources, and only a certain part of characteristics of the user can be seen by analyzing the data of a single source. The fragmented partial features of the user can be all concatenated by ID Mapping to provide a complete user representation. For example, according to the behavior of watching the love art video on the tablet personal computer by zhang san in the above example, the love art app and related movies can be recommended to the user at the mobile phone terminal. For example, in a programmatic transaction, an important part of the process is to match the user currently requesting the advertisement with the user's historical interest Data in the first party DMP (Data Management Platform). If no ID Mapping exists, the programmed transaction is blind, and real-time bidding and accurate delivery cannot be realized.
At present, the main technical bottleneck of ID Mapping is that a Mapping result cannot be obtained quickly and accurately when massive user data is processed. Because the number of user IDs is large, and certain relations exist among different IDs, the IDs and the relations form an ID network with a complex relation, how to extract a sub-network from the complex ID network, and then effectively separate or process the sub-network so as to obtain a reliable sub-network is a main problem to be solved in ID Mapping engineering. Therefore, a method for efficiently and reliably realizing ID Mapping is needed.
Disclosure of Invention
In view of the above, the present invention has been made to provide a method and apparatus, a computer storage medium, and a computing device for implementing ID mapping based on a Spark framework, which overcome or at least partially solve the above-mentioned problems.
According to an aspect of the embodiments of the present invention, a method for implementing ID mapping based on a Spark framework is provided, including:
step S1: acquiring a two-dimensional ID relation table comprising a plurality of ID pairs, numbering each ID pair, and acquiring an initial number-ID pair relation table;
step S2: taking the ID as a key, splitting and aggregating the initial number-ID pair relation table to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers;
step S3: taking the initial numbers as keys, splitting and aggregating the multiple initial number once aggregated subsets to obtain initial number aggregated subset results, wherein no intersection exists between any two initial number aggregated subsets in the initial number aggregated subset results; numbering the initial number aggregation subset results by using a uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset;
step S4: and obtaining the corresponding relation between the uniform identifier and the ID according to the uniform identifier-initial number aggregation subset relation table and the initial number-ID pair relation table, so as to realize uniform representation of the ID.
Optionally, taking the initial number as a key, splitting and aggregating the multiple initial number once aggregation subsets to obtain an initial number aggregation subset result, including:
taking the initial numbers as keys, taking the initial number once aggregation subsets as splitting and aggregation objects of the initial iterative operation, and performing the iterative operation of splitting and aggregation to obtain initial number aggregation subset results; in each iteration operation, outputting an initial number aggregation subset which is obtained after aggregation and has no intersection with other initial number aggregation subsets, and taking the remaining initial number aggregation subsets as splitting and aggregation objects of the next iteration operation; and outputting the remaining initial number aggregation subsets until the aggregation can not be performed among the remaining initial number aggregation subsets in one iteration operation, terminating the iteration operation, and integrating the initial number aggregation subsets output in each iteration operation to obtain an initial number aggregation subset result.
Optionally, step S2 specifically includes:
splitting the initial number-ID pair relation table into an initial number-ID relation table;
aggregating the initial number-ID relation table by taking the ID as a key to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers;
and renumbering the primary aggregation subset with the initial number to obtain a secondary number-primary aggregation subset relation table with the initial number.
Optionally, after the aggregation is performed to obtain a plurality of initial number primary aggregation subsets, step S2 further includes:
judging whether each subset in the multiple primary aggregation subsets with the initial numbers is an isolated subset without intersection with other primary aggregation subsets with the initial numbers;
if so, outputting the initial number primary aggregation subset as a target initial number aggregation subset, and renumbering the remaining initial number primary aggregation subsets.
Optionally, the determining whether each subset in the plurality of primary aggregation subsets of initial numbers is an isolated subset without intersection with other primary aggregation subsets of initial numbers includes:
counting the occurrence times and the number of contained elements of the aggregation subset of each initial number;
the initial numbered one-time aggregation subset having the number of occurrences of 2 and the number of elements of 1 is judged as an isolated subset.
Optionally, step S3 specifically includes:
step S31: splitting the secondary number-initial number primary aggregation subset relation table into a secondary number-initial number primary aggregation subset relation table;
step S32: aggregating the secondary number-initial number-primary aggregation subset relation table by taking the initial number as a key to obtain one or more initial number secondary aggregation subsets;
step S33: filtering and outputting the initial number secondary aggregation subsets without intersection with other initial number secondary aggregation subsets as target initial number aggregation subsets;
step S34: carrying out duplicate removal on the remaining initial number secondary aggregation subsets, and numbering again to obtain a relation table of the third number-initial number secondary aggregation subsets;
step S35: repeating the steps S31 to S34 to carry out iterative operation for n times until the number of the remaining initial number n +2 aggregation subsets is 0 or 1, and outputting the remaining initial number n +2 aggregation subsets as the target initial number aggregation subsets, wherein n is a natural number;
step S36: and integrating the target initial number aggregation subsets output in the previous step to obtain an initial number aggregation subset result, numbering the initial number aggregation subset result by using the uniform identifier, and obtaining a relationship table of uniform identifier-initial number aggregation subset.
Optionally, step S32 specifically includes:
aggregating the secondary number-initial number-primary number aggregation subset relation table by taking the initial number as key to obtain one or more initial number secondary aggregation subsets and corresponding secondary number aggregation subsets, wherein each secondary number aggregation subset is composed of secondary numbers, and each initial number secondary aggregation subset is formed by merging the initial number primary aggregation subsets corresponding to each secondary number in the corresponding secondary number aggregation subsets;
step S33 specifically includes:
and judging whether each subset in the one or more secondary number aggregation subsets is an isolated subset without intersection with other secondary number aggregation subsets, if so, performing duplication elimination on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset, and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
Optionally, determining whether each subset of the one or more secondary numbered aggregated subsets is an isolated subset that does not intersect with other secondary numbered aggregated subsets comprises:
counting the number of elements contained in each secondary number aggregation subset;
and judging the secondary number aggregation subset with the element number of 1 as an isolated subset.
Optionally, after counting the number of elements included in each twice-numbered aggregation subset, step S33 further includes:
judging whether the number of elements contained in each secondary number aggregation subset is greater than a given threshold value or not;
and if so, performing duplicate removal on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
Optionally, step S4 specifically includes:
splitting the uniform identifier-initial number aggregation subset relation table into an initial number-uniform identifier relation table;
obtaining a unified identifier-ID relation table according to the initial number-unified identifier relation table and the initial number-ID pair relation table;
and aggregating the uniform identifier-ID relation table by taking the uniform identifier as a key to obtain a uniform representation table of the ID.
Optionally, obtaining a uniform identifier-ID relationship table according to the initial number-uniform identifier relationship table and the initial number-ID pair relationship table, including:
and executing a leftOutJoin command on the initial number-uniform identifier relation table and the initial number-ID pair relation table, and separating a uniform identifier and an ID through a map command to obtain a uniform identifier-ID relation table.
Optionally, obtaining a two-dimensional ID relationship table including a plurality of ID pairs includes:
and integrating a plurality of two-dimensional ID relation source data tables into the two-dimensional ID relation table comprising a plurality of ID pairs.
Optionally, the numbered operation is performed by a zipWithUniqueId command.
Optionally, the operation of aggregation is performed by a reduceByKey command.
According to another aspect of the embodiments of the present invention, there is also provided an apparatus for implementing ID mapping based on a Spark framework, including:
the data preprocessing module is suitable for acquiring a two-dimensional ID relation table comprising a plurality of ID pairs, numbering each ID pair and acquiring an initial number-ID pair relation table;
the ID relation aggregation module is suitable for splitting and aggregating the initial number-ID pair relation table by taking the ID as a key to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers;
the number relation aggregation module is suitable for splitting and aggregating the multiple initial number once aggregated subsets by taking the initial numbers as keys to obtain initial number aggregated subset results, wherein no intersection exists between any two initial number aggregated subsets in the initial number aggregated subset results; numbering the initial number aggregation subset results by using a uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset; and
and the ID unified representation module is suitable for obtaining the corresponding relation between the unified identifier and the ID according to the unified identifier-initial number aggregation subset relation table and the initial number-ID pair relation table, so as to realize the unified representation of the ID.
Optionally, the numbering relationship aggregation module is further adapted to:
taking the initial numbers as keys, taking the initial number once aggregation subsets as splitting and aggregation objects of the initial iterative operation, and performing the iterative operation of splitting and aggregation to obtain initial number aggregation subset results; in each iteration operation, outputting an initial number aggregation subset which is obtained after aggregation and has no intersection with other initial number aggregation subsets, and taking the remaining initial number aggregation subsets as splitting and aggregation objects of the next iteration operation; and outputting the remaining initial number aggregation subsets until the aggregation can not be performed among the remaining initial number aggregation subsets in one iteration operation, terminating the iteration operation, and integrating the initial number aggregation subsets output in each iteration operation to obtain an initial number aggregation subset result.
Optionally, the ID relationship aggregation module includes:
a first splitting unit adapted to split the initial number-ID pair relationship table into an initial number-ID relationship table;
the first aggregation unit is suitable for aggregating the initial number-ID relation table by taking the ID as a key to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers; and
and the first numbering unit is suitable for renumbering the primary aggregation subset with the initial number to obtain a secondary number-primary aggregation subset relation table with the initial number.
Optionally, the ID relationship aggregation module further includes:
the first filtering output unit is suitable for judging whether each subset in the plurality of initial number primary aggregation subsets is an isolated subset which does not have intersection with other initial number primary aggregation subsets after the first aggregation unit carries out aggregation to obtain a plurality of initial number primary aggregation subsets;
if so, outputting the initial number primary aggregation subset as a target initial number aggregation subset, and triggering the first numbering unit to renumber the remaining initial number primary aggregation subsets.
Optionally, the first filtered output unit is further adapted to:
counting the occurrence times and the number of contained elements of the aggregation subset of each initial number;
the initial numbered one-time aggregation subset having the number of occurrences of 2 and the number of elements of 1 is judged as an isolated subset.
Optionally, the numbering relationship aggregation module includes:
the second splitting unit is suitable for splitting the secondary number-initial number primary aggregation subset relation table into a secondary number-initial number primary aggregation subset relation table;
the second aggregation unit is suitable for aggregating the secondary number-initial number primary aggregation subset relation table by taking the initial number as a key to obtain one or more initial number secondary aggregation subsets;
the second filtering output unit is suitable for filtering and outputting the initial number secondary aggregation subsets without intersection with other initial number secondary aggregation subsets as target initial number aggregation subsets;
the second numbering unit is suitable for numbering the remaining initial number secondary aggregation subsets again to obtain a relationship table of the third-time number-initial number secondary aggregation subsets;
the analogizing iteration unit is suitable for triggering the second splitting unit, the second aggregation unit, the second filtering output unit and the second numbering unit to carry out n times of iterative operation in an analogizing way until the number of the remaining initial number n +2 times of aggregation subsets is 0 or 1, and outputting the remaining initial number n +2 times of aggregation subsets as target initial number aggregation subsets, wherein n is a natural number; and
and the output result integration unit is suitable for integrating the output target initial number aggregation subset to obtain an initial number aggregation subset result, and numbering the initial number aggregation subset result by using the uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset.
Optionally, the second polymerization unit is further adapted to:
aggregating the secondary number-initial number-primary number aggregation subset relation table by taking the initial number as key to obtain one or more initial number secondary aggregation subsets and corresponding secondary number aggregation subsets, wherein each secondary number aggregation subset is composed of secondary numbers, and each initial number secondary aggregation subset is formed by merging the initial number primary aggregation subsets corresponding to each secondary number in the corresponding secondary number aggregation subsets;
the second filtered output unit is further adapted to:
and judging whether each subset in the one or more secondary number aggregation subsets is an isolated subset without intersection with other secondary number aggregation subsets, if so, performing duplication elimination on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset, and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
Optionally, the second filtered output unit is further adapted to:
counting the number of elements contained in each secondary number aggregation subset;
and judging the secondary number aggregation subset with the element number of 1 as an isolated subset.
Optionally, the second filtered output unit is further adapted to:
after the number of elements contained in each secondary number aggregation subset is counted, whether the number of the elements contained in each secondary number aggregation subset is larger than a given threshold value is judged;
and if so, performing duplicate removal on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
Optionally, the ID uniform representation module includes:
a third splitting unit adapted to split the unified identifier-initial number aggregation subset relation table into an initial number-unified identifier relation table;
the relation connection unit is suitable for obtaining a unified identifier-ID relation table according to the initial number-unified identifier relation table and the initial number-ID pair relation table; and
and the third aggregation unit is suitable for aggregating the uniform identifier-ID relation table by taking the uniform identifier as a key to obtain a uniform representation table of the ID.
Optionally, the relationship connection unit is further adapted to:
and executing a leftOutJoin command on the initial number-uniform identifier relation table and the initial number-ID pair relation table, and separating a uniform identifier and an ID through a map command to obtain a uniform identifier-ID relation table.
Optionally, the data preprocessing module is further adapted to:
and integrating a plurality of two-dimensional ID relation source data tables into the two-dimensional ID relation table comprising a plurality of ID pairs.
Optionally, the numbered operation is performed by a zipWithUniqueId command.
Optionally, the operation of aggregation is performed by a reduceByKey command.
According to yet another aspect of the embodiments of the present invention, there is also provided a computer storage medium storing computer program code, which, when run on a computing device, causes the computing device to execute the method for implementing ID mapping based on Spark framework according to any one of the above.
According to still another aspect of the embodiments of the present invention, there is also provided a computing device including:
a processor; and
a memory storing computer program code;
the computer program code, when executed by the processor, causes the computing device to perform a method of implementing ID mapping based on a Spark framework according to any of the above.
The method and the device for realizing ID mapping based on the Spark frame, provided by the embodiment of the invention, are characterized in that after each ID pair in the obtained two-dimensional ID relationship table is numbered to obtain an initial number-ID pair relationship table, the initial number-ID pair relationship table is firstly split and ID relationship aggregation is carried out by taking the ID as key to obtain a plurality of initial number one-time aggregation subsets, then the initial number is taken as key to carry out splitting and number relationship aggregation on the plurality of initial number one-time aggregation subsets to obtain an initial number aggregation subset result, the initial number aggregation subset result is numbered by the uniform identifier to obtain a uniform identifier-initial number aggregation subset relationship table, finally, the corresponding relationship between the uniform identifier and the ID is obtained according to the uniform identifier-initial number aggregation subset relationship table and the initial number-ID pair relationship table, thereby realizing uniform representation of the user ID. The method realizes the ID mapping algorithm based on the Spark distributed computing framework, and realizes the operations of storage, filtration, splitting, aggregation and the like of the mass user data set by using the thought of the mathematical set theory, thereby improving the efficiency, the accuracy and the reliability of the ID mapping algorithm.
Further, the process of performing the number relation aggregation with the initial number as the key is realized through iterative operation, and the number of the remaining subsets (i.e., the subsets to be split and aggregated in the next iterative operation) in each iterative operation can be reduced less and less by outputting the initial number aggregation subsets which are obtained after aggregation and have no intersection with other initial number aggregation subsets (i.e., no aggregation needs to be performed again) in each iterative operation, so that the memory overhead is further reduced, and the operation efficiency is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flowchart illustrating a method for implementing ID mapping based on Spark framework according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an ID relationship aggregation step of a method for implementing ID mapping based on Spark framework according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating an ID relationship aggregation step of a method for implementing ID mapping based on a Spark framework according to another embodiment of the present invention;
FIG. 4 is a flowchart illustrating a numbering relationship aggregation step of a method for implementing ID mapping based on Spark framework according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a numbering relationship aggregation step of a method for implementing ID mapping based on Spark framework according to another embodiment of the present invention;
FIG. 6 is a flowchart illustrating ID unifying representing steps of a method for implementing ID mapping based on a Spark framework according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram illustrating an apparatus for implementing ID mapping based on Spark framework according to an embodiment of the present invention; and
fig. 8 is a schematic structural diagram illustrating an apparatus for implementing ID mapping based on a Spark framework according to another embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
At present, when the situations of massive user data, more ID types and quantity and complex relation among IDs exist, an ID relation network cannot be effectively extracted, and therefore effective implementation on engineering is difficult.
The Spark framework is a fast and general cluster computing platform specially designed for large-scale data processing, enables a memory distribution data set, and has great advantages in processing mass data. Therefore, based on the Spark framework, the operations of storage, filtration, splitting, merging and the like of the mass data set are realized by using the distributed computing system, and the ID Mapping process can be realized efficiently.
In order to solve the above technical problem, an embodiment of the present invention provides a method for implementing ID mapping based on a Spark framework. Fig. 1 is a flowchart illustrating a method for implementing ID mapping based on a Spark framework according to an embodiment of the present invention. Referring to fig. 1, the method may include at least the following steps S1 to S4.
Step S1, data preprocessing: and acquiring a two-dimensional ID relation table comprising a plurality of ID pairs, numbering each ID pair, and acquiring an initial number-ID pair relation table.
Step S2, ID relationship aggregation: and taking the ID as a key, splitting and aggregating the initial number-ID pair relation table to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers.
Step S3, numbering relation aggregation: taking the initial numbers as keys, splitting and aggregating the multiple initial number once aggregated subsets to obtain initial number aggregated subset results, wherein no intersection exists between any two initial number aggregated subsets in the initial number aggregated subset results; and numbering the initial number aggregation subset results by using the uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset.
Step S4, the ID collectively indicates: and obtaining the corresponding relation between the uniform identifier and the ID according to the uniform identifier-initial number aggregation subset relation table and the initial number-ID pair relation table, so as to realize uniform representation of the ID.
The method for realizing ID mapping based on the Spark framework provided by the embodiment of the invention realizes the operations of storage, filtration, splitting, aggregation and the like of a mass user data set by using the thought of a mathematical set theory, thereby improving the efficiency, the accuracy and the reliability of an ID mapping algorithm.
In the above step S1, the two-dimensional ID relationship table including a plurality of ID pairs may be obtained by integrating a plurality of two-dimensional ID relationship source data tables acquired from the respective channels. Wherein, each two-dimensional ID relation source data table can comprise a plurality of user IDs which appear pairwise.
After the two-dimensional ID relationship table is obtained, a corresponding RDD (flexible distributed data sets) is created for the two-dimensional ID relationship table, and is used as an operation object of the Spark frame.
Preferably, each ID pair in the two-dimensional ID relationship table may be numbered using the zipWithUniqueId command of the Spark framework, thereby generating a unique initial number for each ID pair.
In the above step S2, the initial number-ID pair relationship table is split and the ID relationship is aggregated with the ID as key.
In an alternative embodiment, as shown in fig. 2, step S2 may include the following steps:
step S21: the initial number-ID pair relationship table is split into an initial number-ID relationship table.
Specifically, the initial number-ID pair relationship table is split into the initial number-ID relationship table in such a manner that each ID in each ID pair corresponds to the initial number of the ID pair, respectively.
Step S22: and aggregating the initial number-ID relation table by taking the ID as a key to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers.
Preferably, the initial number-ID relationship table is aggregated by using the reduceByKey command of the Spark framework with ID as key.
Step S23: and renumbering the plurality of primary aggregation subsets with the initial numbers to obtain a secondary number-primary aggregation subset relation table with the initial numbers.
Preferably, the plurality of initial numbered once aggregated subsets may be renumbered using the zipWithUniqueId command of Spark framework.
In a preferred embodiment, as shown in fig. 3, after performing step S22, and taking the ID as a key, aggregating the initial number-ID relationship table to obtain a plurality of initial number once aggregated subsets, step S2 may further include:
step S24: judging whether each subset in the multiple primary aggregation subsets with the initial numbers is an isolated subset without intersection with other primary aggregation subsets with the initial numbers;
and if so, outputting the initial number once aggregation subset as a target initial number aggregation subset.
At this time, step S23 is adjusted accordingly:
and renumbering the remaining primary aggregation subsets with the initial numbers to obtain a secondary-number-primary aggregation subset relation table.
Further, the operation of determining whether each subset of the plurality of primary-numbered primary aggregated subsets is an isolated subset which does not intersect with other primary-numbered primary aggregated subsets in the above step S24 may be implemented by:
counting the occurrence times and the number of contained elements of each subset in the multiple initial number one-time aggregation subsets;
and if the occurrence frequency of a certain initial number one-time aggregation subset is 2 and the number of contained elements is 1, judging that the initial number one-time aggregation subset is an isolated subset.
Since the isolated subset does not intersect with other subsets, no aggregation is needed, and therefore, the output can be directly realized. By filtering and outputting the isolated initial number once aggregation subset, the subsequent splitting and aggregation operation amount can be reduced, and the memory overhead is reduced.
In the above step S3, the multiple primary aggregation subsets may be aggregated at the primary number by using the reduceByKey command of the Spark frame with the primary number being key.
Preferably, the splitting and the aggregation of the numbering relationship are performed on the aggregation subsets with the initial numbers as keys by iterative operation. At this time, step S3 may be implemented by:
and taking the initial numbers as keys, taking the primary aggregation subsets of the plurality of initial numbers as splitting and aggregation objects of the primary iterative operation, and performing the iterative operation of splitting and aggregation to obtain initial number aggregation subset results. In each iteration operation, outputting the initial number aggregation subsets which are obtained after aggregation and have no intersection with other initial number aggregation subsets, and taking the remaining initial number aggregation subsets as splitting and aggregation objects of the next iteration operation. And outputting the remaining initial number aggregation subsets until the aggregation can not be performed among the remaining initial number aggregation subsets in one iteration operation, terminating the iteration operation, and integrating the initial number aggregation subsets output in each iteration operation to obtain an initial number aggregation subset result. Thus, no intersection exists between any two initial number aggregation subsets in the obtained initial number aggregation subset result.
By outputting the initial number aggregation subset which is obtained after aggregation and has no intersection with other initial number aggregation subsets (i.e. aggregation is not required) in each iteration operation, the number of the remaining subsets (i.e. the subsets to be split and aggregated in the next iteration operation) in each iteration operation can be reduced, thereby further reducing the memory overhead and improving the operation efficiency.
In an alternative embodiment, as shown in fig. 4, step S3 may specifically include the following steps:
step S31: and splitting the secondary number-initial number primary aggregation subset relation table obtained in the step S23 into a secondary number-initial number primary aggregation subset relation table.
Specifically, the secondary number-primary number primary aggregation subset relationship table is split into the secondary number-primary aggregation subset relationship table in a manner that the correspondence between each primary number primary aggregation subset and the secondary number is split into a correspondence between each primary number in each primary number primary aggregation subset and the secondary number of the primary number primary aggregation subset.
Step S32: and aggregating the secondary number-initial number-primary aggregation subset relation table by taking the initial number as a key to obtain one or more initial number secondary aggregation subsets.
Preferably, the secondary number-primary number primary aggregation subset relation table may be aggregated by using a reduce bykey command of a Spark framework with an initial number of key.
Step S33: and filtering and outputting the initial number secondary aggregation subset without intersection with other initial number secondary aggregation subsets as a target initial number aggregation subset.
Step S34: and carrying out duplicate removal on the remaining initial number secondary aggregation subsets, and numbering again to obtain a relation table of the third-time number-initial number secondary aggregation subsets.
Preferably, the remaining initial numbered quadratic aggregation subsets after deduplication may be renumbered with the zipWithUniqueId command of the Spark framework.
Step S35: and repeating the steps S31 to S34 to perform iterative operation n times until the number of the remaining initial number n +2 aggregation subsets is 0 or 1, and outputting the remaining initial number n +2 aggregation subsets as the target initial number aggregation subset, wherein n is a natural number.
Step S36: and integrating the target initial number aggregation subsets output in the previous step to obtain an initial number aggregation subset result, numbering the initial number aggregation subset result by using the uniform identifier, and obtaining a relationship table of the uniform identifier and the initial number aggregation subset.
In this step, the uniform identifier is not denoted by dmid, as is the uniform number of the resulting initial number aggregation subset result.
In a preferred embodiment, as shown in fig. 5, step S32 can be further implemented as:
and aggregating the secondary number-initial number-primary number aggregation subset relation table by taking the initial number as key to obtain one or more initial number secondary aggregation subsets and corresponding secondary number aggregation subsets, wherein each secondary number aggregation subset is formed by secondary numbers, and each initial number secondary aggregation subset is formed by merging the primary number primary aggregation subsets corresponding to each secondary number in the corresponding secondary number aggregation subsets.
Accordingly, step S33 can be further implemented as:
and judging whether each subset in one or more secondary number aggregation subsets is an isolated subset without intersection with other secondary number aggregation subsets, if so, performing duplication elimination on the primary number secondary aggregation subset corresponding to the secondary number aggregation subset, and outputting the primary number secondary aggregation subset as a target primary number aggregation subset.
Further, the operation of determining whether each of the one or more secondary-numbered aggregated subsets is an isolated subset without intersection with other secondary-numbered aggregated subsets in step S33 may be implemented by:
counting the number of elements contained in each secondary number aggregation subset;
and if the element number of a certain secondary number aggregation subset is 1, judging that the secondary number aggregation subset is an isolated subset.
When a certain secondary-numbered aggregation subset only contains 1 element, the initial-numbered secondary aggregation subset corresponding to the secondary-numbered aggregation subset can not be aggregated (or merged) with other initial-numbered secondary aggregation subsets any more, so that the initial-numbered secondary aggregation subset corresponding to the secondary-numbered aggregation subset can be directly output after being deduplicated.
Further, after counting the number of elements included in each secondary number aggregation subset in step S33, the following steps may be further performed:
judging whether the number of elements contained in each secondary number aggregation subset is greater than a given threshold value or not;
and if so, performing duplicate removal on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
In this step, the given threshold value may be set to 20, 30, 50, etc. according to actual needs. By introducing a 'threshold' of the number of elements of the subset, the number of IDs owned by each user can be limited, and meanwhile, data inclination in operation is prevented, so that the excessive number of IDs of a certain computing node is prevented, and memory overflow is avoided.
In the above step S4, the final mapping result of the user ID is obtained according to the unified identifier-initial number aggregation subset relationship table obtained in step S3 and the initial number-ID pair relationship table obtained in step S1.
In an alternative embodiment, as shown in fig. 6, step S4 may specifically include the following steps:
step S41: the unified identifier-initial number aggregation subset relation table obtained in step S3 is split into an initial number-unified identifier relation table.
Specifically, the unified identifier-initial number aggregation subset relationship table is split into the initial number-unified identifier relationship table in a manner that the correspondence between each initial number aggregation subset and the unified identifier is split into a manner that each initial number in each initial number aggregation subset corresponds to the unified identifier of the initial number aggregation subset.
Step S42: and obtaining a unified identifier-ID relation table according to the initial number-unified identifier relation table and the initial number-ID pair relation table.
Preferably, the leftoutworker join command is executed on the initial number-uniform identifier relationship table and the initial number-ID pair relationship table, and then the uniform identifier and the ID are separated through the map command, so that the uniform identifier-ID relationship table is obtained.
Step S43: and aggregating the relationship table of the uniform identifiers and the IDs by taking the uniform identifiers as keys to obtain a uniform representation table of the IDs.
Preferably, the uniform identifier-ID relationship table may be aggregated with the uniform identifier key using the reduce bykey command of the Spark framework.
In the above, various implementation manners of each link of the embodiment shown in fig. 1 are introduced, and an implementation process of the method for implementing ID mapping based on Spark framework according to the present invention will be described in detail through a specific embodiment.
In this embodiment, it is assumed that all IDs in the source data include: imei (International Mobile Equipment identity), aid (android ID), sn (Serial number), mac (media Access control), tel (Telephone). These IDs are all obtained from each channel, and they appear in pairs to form a complete two-dimensional ID relationship source data table. For 5 IDs, at most
Figure BDA0001996773100000101
Individual source data tables). For simplicity, in this embodiment, the values of these IDs are represented by natural numbers, and all two-dimensional ID relationship source data tables are assumed as follows:
Figure BDA0001996773100000102
the method for realizing ID mapping based on Spark framework according to the embodiment of the invention comprises the following steps:
(1) data preprocessing: integrating a plurality of two-dimensional ID relation source data tables into a two-dimensional ID relation table comprising a plurality of ID pairs, creating corresponding RDD for the two-dimensional ID relation table, numbering each ID pair in the two-dimensional ID relation table by using a zipWithUniqueId command, and obtaining an initial number-ID pair relation table. The initial number-ID pair table includes three columns, one of which is the initial number and the other two of which are the IDs, indicating the relationship between the initial number and the IDs.
Specifically, in this embodiment, the source data tables in tables 1 to 6 are preprocessed, so as to obtain the initial number-ID pair relationship table in the form of (num1, ID1, ID2) as shown in table 7 below, where num1 column is the initial number, and ID1 column and ID2 column are two IDs in each ID relationship pair.
TABLE 7 initial number-ID pair relationship Table
num1 id1 id2
1 imei_1 aid_1
2 imei_1 sn_1
3 imei_2 sn_3
4 imei_3 tel_1
5 aid_2 sn_3
6 aid_1 mac_1
7 sn_2 tel_1
(2) Aggregation of ID relationships: the serialized initial number-ID pair relationship table shown in table 7 is split, and the ID is taken as a key, and a reduceByKey command is utilized for aggregation.
Specifically, the step of aggregating the ID relationships comprises the following sub-steps:
(2.1) splitting the relation table of initial number-ID pairs: splitting the initial number-ID pair relation table obtained in the step (1) into an initial number-ID relation table. In practice, two IDs of each row in table 7 are disassembled, and one row of data is disassembled into two rows of data. After splitting the initial number-ID pair relationship table shown in table 7, the initial number-ID relationship table shown in table 8 below is obtained.
TABLE 8 initial number-ID relationship Table
Figure BDA0001996773100000111
Figure BDA0001996773100000121
(2.2) polymerization with ID as key: and aggregating the initial number-ID relation table by using the ID as a key and utilizing a reduciByKey command to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of initial numbers.
In this embodiment, after aggregating the initial number-ID relationship table shown in table 8 by using the reduceByKey command with ID as key, the ID-initial number primary aggregation subset relationship table shown in table 9 below is obtained. Where, num1sets column in table 9 shows a plurality of initial number once aggregation subsets obtained after aggregation.
TABLE 9 ID-initial number one-time aggregation subset relation Table
id num1 sets
imei_1 1,2
aid_1 1,6
sn_1 2
imei_2 3
sn_3 3,5
imei_3 4
tel_1 4,7
aid_2 5
mac_1 6
sn_2 7
(2.3) filtering and outputting the isolated subset: and (3) judging, screening and outputting the isolated subsets in the plurality of initial number primary aggregation subsets obtained in the step (2.2) in parallel by utilizing a Spark command.
The method for judging whether each subset in the multiple primary aggregation subsets with initial numbers is an isolated subset comprises the following steps: and counting the occurrence times and the included element number of each subset in a plurality of initial number one-time aggregation subsets in a distributed manner, and if the element number of a certain initial number one-time aggregation subset is 1 and occurs for 2 times, judging that the initial number one-time aggregation subset is an isolated subset. And the isolated primary aggregation subset of the initial numbers is not intersected with other primary aggregation subsets of the initial numbers, and can be directly output as a target primary aggregation subset of the initial numbers. The reason why the isolated primary-numbered primary aggregation subset can be judged by this method is that: taking table 8 as an example, each initial number appears twice in num1 column of table 8, and if a certain initial number does not intersect with other initial numbers, after aggregation with a redebykey using ID as key, the initial number alone constitutes an aggregated subset of initial numbers once after aggregation, and the aggregated subset of initial numbers once appears twice.
In this embodiment, the number of times that each initial-numbered one-time aggregation subset appears in the num1sets column of the table 9 is counted, and as a result, it is found that each initial-numbered one-time aggregation subset appears only once, so that no isolated subset exists, and all initial-numbered one-time aggregation subsets can continue to be aggregated (or merged).
(2.4) renumbering the remaining initial numbered primary aggregation subsets: renumbering the remaining primary aggregation subsets by using a zipWithUniqueId command to obtain a secondary-primary aggregation subset relation table.
In this embodiment, because there is no isolated subset in the multiple primary aggregation subsets with initial numbers shown in table 9, the remaining primary aggregation subsets with initial numbers are all the primary aggregation subsets with initial numbers in table 9. After renumbering the primary-numbered aggregation subsets in the num1sets column of table 9, a secondary-numbered-primary-numbered aggregation subset relationship table as shown in table 10 below is obtained. In table 10, num2 column indicates the secondary number.
TABLE 10 Secondary-PRIMARY NUMBER ONE-TIME AGGREGATION SUCCESSIVE RELATIONS TABLE
num2 num1 sets
1 1,2
2 1,6
3 2
4 3
5 3,5
6 4
7 4,7
8 5
9 6
10 7
(3) Aggregation of numbering relationships: the aggregation results of table 10 above are split and aggregated, which is achieved by iteration. In each iteration step, firstly splitting the result of the last step of aggregation, then using the initial number as key, utilizing a reduce ByKey command to perform aggregation, outputting the aggregated isolated subset, then splitting and aggregating the residual subset, and repeating the steps until iteration is terminated when no residual subset can be output. And integrating the output results of each step, and uniformly numbering the integrated output results by using the uniform identifiers to obtain an aggregation result (namely, a uniform identifier-initial number aggregation subset relation table which actually represents the relation between the uniform identifiers and the initial numbers).
Specifically, the aggregation step of the numbering relationship comprises the following sub-steps:
(3.1) splitting a secondary number-initial number primary aggregation subset relation table: and (3) splitting each row of data in the quadratic number-initial number once aggregation subset relation table shown in the table 10 obtained in the step (2.4) into a plurality of rows of data. Specifically, the data in num1sets in table 10 is split and placed in the first column of the new table (as key), resulting in the second-order-first-order aggregation subset relationship table shown in table 11 below. In table 11, the num1 column indicates an initial number obtained by dividing the num1sets column data in table 10, and the num2 and num1sets columns indicate corresponding secondary number and initial number primary aggregation subset, respectively.
TABLE 11 quadratic number-initial number-primary aggregation subset relationship Table
num1 num2 num1 sets
1 1 1,2
2 1 1,2
1 2 1,6
6 2 1,6
2 3 2
3 4 3
3 5 3,5
5 5 3,5
4 6 4
4 7 4,7
7 7 4,7
5 8 5
6 9 6
7 10 7
(3.2) polymerization was carried out with the initial number of key: and aggregating the secondary number-primary aggregation subset relation table by using the reducibyKey command with the primary number as the key to obtain a plurality of primary number secondary aggregation subsets and corresponding secondary number aggregation subsets, wherein each secondary number aggregation subset is formed by secondary numbers, and each primary number secondary aggregation subset is formed by merging the primary number primary aggregation subsets corresponding to each secondary number in the secondary number aggregation subsets corresponding to the primary number secondary aggregation subsets.
In this embodiment, after the table 11 is aggregated, an initial number-secondary number aggregation subset-initial number secondary aggregation subset relation table shown in the following table 12 is obtained, where two columns of data, num2sets and num1sets, are a plurality of secondary number aggregation subsets obtained after aggregation and corresponding initial number secondary aggregation subsets, respectively.
Table 12 relationship table of initial number-secondary number aggregation subset-initial number secondary aggregation subset
num1 num2 sets num1 sets
1 1,2 1,2,6
2 1,3 1,2
3 4,5 3,5
4 6,7 4,7
5 5,8 3,5
6 2,9 1,6
7 7,10 4,7
(3.3) filtering and outputting the set without continuing the aggregation: and (3) counting the number of elements contained in each of a plurality of secondary-numbered aggregation subsets (i.e. subsets of num2sets column of table 12) obtained after aggregation in the step (3.2) in parallel by using a Spark command, and if the number of elements in a certain secondary-numbered aggregation subset is 1, judging that the secondary-numbered aggregation subset is an isolated subset. The isolated secondary-numbered aggregated subset is not intersected with other secondary-numbered aggregated subsets, and the initial-numbered secondary aggregated subset corresponding to the isolated secondary-numbered aggregated subset is also an isolated subset (i.e., is not intersected with other initial-numbered secondary aggregated subsets), so that the initial-numbered secondary aggregated subset corresponding to the secondary-numbered aggregated subset determined as the isolated subset can be de-duplicated and output as the target initial-numbered aggregated subset. Further, after the number of elements included in each secondary number aggregation subset is counted, it is further determined whether the number of elements included in each secondary number aggregation subset is greater than a given threshold (for example, 50), and if so, the initial number secondary aggregation subset corresponding to the secondary number aggregation subset is deduplicated and then output as the target initial number aggregation subset.
In this example, there are no isolated subsets in the num2sets column of Table 12.
(3.4) renumbering the remaining sets: and carrying out deduplication on the remaining initial number secondary aggregation subsets by utilizing a Spark command, and then carrying out renumbering by utilizing a zipWithUniqueId command to obtain a relationship table of the third-time number-initial number secondary aggregation subsets.
In this embodiment, since there is no isolated subset in the num2sets column of table 12, the remaining initial number secondary aggregation subsets are all the data in the num2sets column. After the num2sets column data of table 12 were de-duplicated and re-numbered, a three-numbered-initial-numbered secondary aggregation subset relationship table as shown in table 13 below was obtained. In table 13, num2 is listed as the triple-numbered data (for the sake of iteration convenience, it is still referred to as num2), and num1sets is listed as the initial-numbered secondary aggregation subset data.
TABLE 13 Tertiary-initial-number quadratic aggregation subset relationship Table
num2 num1 sets
1 1,2,6
2 1,2
3 3,5
4 4,7
5 1,6
It is noted that after steps (3.1) - (3.4), the number of initial number aggregation subsets has been reduced from 10 in table 10 to 5 in table 13.
(3.5) analogy iteration: and repeating the steps (3.1) to (3.4) for n times of iterative operation by analogy until the number of the remaining initial number n +2 aggregation subsets is 0 or 1.
In this embodiment, the specific iterative operation is as follows:
repeating step (3.1) in the third-order No. first-order aggregation subset relation table shown in table 13, the third-order No. first-order aggregation subset relation table shown in table 14 below is obtained. In table 14, the num1, num2, and num1sets columns represent the initial number, the tertiary number, and the initial number, the secondary aggregation subset data, respectively.
TABLE 14 Tertiary-initial secondary aggregation subset relationship Table
Figure BDA0001996773100000151
Figure BDA0001996773100000161
Repeating the step (3.2) on the table 14 to obtain an initial number-three-times-numbered aggregate subset-initial number-three-times-numbered aggregate subset relation table shown in the following table 15, wherein two columns of data of num2sets and num1sets are respectively a plurality of three-times-numbered aggregate subsets obtained after aggregation and corresponding initial number-three-times aggregate subsets.
Table 15 relation table of initial number-triple number aggregate subset-initial number triple aggregate subset
num1 num2 sets num1 sets
1 1,2,5 1,2,6
2 1,2 1,2,6
3 3 3,5
4 4 4,7
5 3 3,5
6 1,5 1,2,6
7 4 4,7
Step (3.3) is performed on table 15, noting that the three-numbered aggregate subsets {3}, {4} in the num2sets column of table 15 have only 1 element, so the initial three-numbered aggregate subsets in the num1sets column corresponding to them can no longer be aggregated with other initial three-numbered aggregate subsets, which are screened out and output, thus resulting in the target initial-numbered aggregate subset as shown in table 16 below.
TABLE 16 target initial number aggregation subset
num2 sets num1 sets
3 3,5
4 4,7
Step (3.4) was performed on the remaining three-times-numbered aggregate subsets in table 15, resulting in a four-times-numbered three-times-numbered aggregate subset relationship table as shown in table 17 below. In table 17, num2 is listed as the four-numbered data (for the sake of iteration convenience, it is still referred to as num2), and num1sets is listed as the initial-numbered three-times aggregation subset data.
TABLE 17 fourth-numbered-initial-numbered third-aggregation subset relationship Table
num2 num1 sets
1 1,2,6
Thus, the num1sets column only has 1 initial number three-time aggregation subset {1,2,6} left, no more aggregation is needed, the loop ends, and the remaining initial number three-time aggregation subset is output as the target initial number aggregation subset.
(3.6) integrating and outputting the results: and (3) integrating the target initial number aggregation subsets output in the steps (2.3), (3.3) and (3.5), and uniformly numbering the target initial number aggregation subsets by using uniform identifiers (denoted by dmid) to obtain a uniform identifier-initial number aggregation subset relation table shown in the following table 18, wherein dmid is listed as uniform identifier data, and num1sets is listed as initial number aggregation subset data.
Table 18 unified identifier-initial number aggregation subset relation table
dmid num1 sets
1 1,2,6
2 3,5
3 4,7
(4) Unified representation of ID: and (4) obtaining the corresponding relation between the uniform identifier and the ID according to the uniform identifier-initial number aggregation subset relation table obtained in the step (3) and the initial number-ID pair relation table obtained in the step (1), and realizing uniform representation of the user ID.
Specifically, the step of uniformly representing the ID includes the following sub-steps: a
(4.1) splitting the uniform identifier-initial number aggregation subset relation table: the column of the initial number aggregation subset in the unified identifier-initial number aggregation subset relationship table shown in table 18 is disassembled to split a row of data in table 18 into rows of data, and the (dmid, initial number) relationship pair is split into (initial number, dmid) relationship pairs to obtain the initial number-unified identifier relationship table shown in table 19 below. In table 19, num1 is listed as initial number data, and dmid is listed as uniform identifier data.
TABLE 19 initial number-Unicode relationship Table
num1 dmid
1 1
2 1
6 1
3 2
5 2
4 3
7 3
And (4.2) obtaining a unified identifier-ID relation table according to the initial number-unified identifier relation table obtained in the step (4.1) and the initial number-ID pair relation table obtained in the step (1).
In this embodiment, a leftoutworker join command is executed on the initial number-uniform identifier relationship table shown in table 19 and the initial number-ID relationship table shown in table 7, and then two fields, namely dmid and ID, are separated in a distributed manner by a map command, so as to obtain the uniform identifier-ID relationship table shown in table 20 below.
Table 20 unified identifier-ID relationship table
dmid id
1 imei_1
1 aid_1
1 sn_1
1 mac_1
2 imei_2
2 sn_3
2 aid_2
3 imei_3
3 tel_1
3 sn_2
(4.3) generating a unified representation table of IDs: the unified identifier-ID relationship table shown in table 20 is aggregated by a reduce bykey command with dmid as a key, to obtain a unified representation table of IDs shown in table 21 below. Among them, in table 21, idsets columns are id aggregation subset data.
Unified representation table of table 21 ID
dmid id sets
1 imei_1,aid_1,sn_1,mac_1
2 imei_2,sn_3,aid_2
3 imei_3,tel_1,sn_2
Equivalently, table 21 may also be organized into the form of table 22 as follows:
unified representation of table 22 IDs
dmid imei aid sn mac tel
1 1 1 1 1
2 2 2 3
3 3 2 1
Thus, the final ID mapping result is obtained, and the uniform representation of the user ID is realized.
Based on the same inventive concept, an embodiment of the present invention further provides a device for implementing ID mapping based on a Spark framework, which is used to support the method for implementing ID mapping based on a Spark framework provided in any one of the above embodiments or a combination thereof. Fig. 7 is a schematic structural diagram illustrating an apparatus 700 for implementing ID mapping based on a Spark framework according to an embodiment of the present invention. Referring to fig. 7, the apparatus may include at least: a data preprocessing module 710, an ID relationship aggregation module 720, a number relationship aggregation module 730, and an ID unified representation module 740.
Now, the functions of the components or devices of the apparatus for implementing ID mapping based on the Spark framework and the connection relationship between the components are described:
the data preprocessing module 710 is adapted to obtain a two-dimensional ID relationship table including a plurality of ID pairs, and number each ID pair to obtain an initial number-ID pair relationship table.
The ID relation aggregation module 720 is connected to the data preprocessing module 710, and is adapted to split and aggregate the initial number-ID pair relation table by using the ID as a key to obtain a plurality of initial number primary aggregation subsets, where each initial number primary aggregation subset is formed by an initial number.
The number relation aggregation module 730 is connected with the ID relation aggregation module 720, and is adapted to split and aggregate the multiple initial number primary aggregation subsets by using the initial numbers as keys to obtain initial number aggregation subset results, where no intersection exists between any two initial number aggregation subsets in the initial number aggregation subset results; and numbering the initial number aggregation subset results by using the uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset.
The ID uniform representation module 740 may be connected to the number relationship aggregation module 730 and the data preprocessing module 710, respectively, and is adapted to obtain a corresponding relationship between the uniform identifier and the ID according to the uniform identifier-initial number aggregation subset relationship table and the initial number-ID pair relationship table, so as to implement uniform representation of the ID.
In an alternative embodiment, the numbering relationship aggregation module 730 is further adapted to:
taking the initial numbers as keys, taking a plurality of initial number once aggregation subsets as splitting and aggregation objects of the initial iterative operation, and performing the iterative operation of splitting and aggregation to obtain initial number aggregation subset results; in each iteration operation, outputting an initial number aggregation subset which is obtained after aggregation and has no intersection with other initial number aggregation subsets, and taking the remaining initial number aggregation subsets as splitting and aggregation objects of the next iteration operation; and outputting the remaining initial number aggregation subsets until the aggregation can not be performed among the remaining initial number aggregation subsets in one iteration operation, terminating the iteration operation, and integrating the initial number aggregation subsets output in each iteration operation to obtain an initial number aggregation subset result.
In an alternative embodiment, as shown in fig. 8, the ID relationship aggregation module 720 may include:
a first splitting unit 721 adapted to split the initial number-ID pair relationship table into initial number-ID relationship tables;
a first aggregation unit 722, connected to the first splitting unit 721, and adapted to aggregate the initial number-ID relationship table by using the ID as a key to obtain a plurality of initial number primary aggregation subsets, where each initial number primary aggregation subset is formed by an initial number; and
the first numbering unit 723, connected to the first aggregation unit 722, is adapted to renumber the plurality of primary aggregation subsets of initial numbers to obtain a secondary-primary aggregation subset-primary-secondary-number relationship table.
Preferably, still referring to fig. 8, the ID relationship aggregation module 720 may further include a first filter output unit 724. The first filtering output unit 724 may be connected to the first aggregating unit 722 and the first numbering unit 723, respectively, and is adapted to determine, after the first aggregating unit 722 aggregates to obtain a plurality of primary aggregation subsets of initial numbers, whether each subset in the plurality of primary aggregation subsets of initial numbers is an isolated subset that does not intersect with other primary aggregation subsets of initial numbers; if so, the initial number primary aggregation subset is output as a target initial number aggregation subset, and the first numbering unit 723 is triggered to renumber the remaining initial number primary aggregation subsets.
Further, the first filtering output unit 724 is further adapted to:
counting the occurrence times and the number of contained elements of each subset in a plurality of initial numbered once aggregation subsets;
and if the occurrence frequency of a certain initial number one-time aggregation subset is 2 and the number of elements is 1, judging that the initial number one-time aggregation subset is an isolated subset.
In an alternative embodiment, still referring to fig. 8, the numbering relationship aggregation module 730 may include:
a second splitting unit 731, adapted to split the quadratic number-initial number primary aggregation subset relation table into a quadratic number-initial number primary aggregation subset relation table;
the second aggregation unit 732, connected to the second splitting unit 731, is adapted to aggregate the secondary number-initial number-primary aggregation subset relation table with the initial number as a key to obtain one or more primary number secondary aggregation subsets;
a second filtering output unit 733, connected to the second aggregation unit 732, adapted to filter and output the initial-number secondary aggregation subset having no intersection with other initial-number secondary aggregation subsets as a target initial-number aggregation subset;
the second numbering unit 734, connected to the second filtering output unit 733, is adapted to renumber the remaining initial number secondary aggregation subsets to obtain a relationship table of tertiary number-initial number secondary aggregation subsets;
the analogizing iteration unit 735 may be connected to the second numbering unit 734 and the second splitting unit 731, and is adapted to trigger the second splitting unit 731, the second aggregation unit 732, the second filtering output unit 733, and the second numbering unit 734 to perform n iterations until the number of the remaining initial number n +2 aggregation subsets is 0 or 1, and output the remaining initial number n +2 aggregation subsets as a target initial number aggregation subset, where n is a natural number; and
the output result integration unit 736 may be connected to the second filtering output unit 733 and the analogizing iteration unit 735, and is adapted to integrate the output target initial number aggregation subsets to obtain an initial number aggregation subset result, and number the initial number aggregation subset result by using the uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset.
Preferably, the second polymerization unit 732 is further adapted to:
and aggregating the secondary number-initial number-primary number aggregation subset relation table by taking the initial number as key to obtain one or more initial number secondary aggregation subsets and corresponding secondary number aggregation subsets, wherein each secondary number aggregation subset is formed by secondary numbers, and each initial number secondary aggregation subset is formed by merging the primary number primary aggregation subsets corresponding to each secondary number in the corresponding secondary number aggregation subsets.
Correspondingly, the second filter output unit 733 is further adapted to:
and judging whether each subset in one or more secondary number aggregation subsets is an isolated subset without intersection with other secondary number aggregation subsets, if so, performing duplication elimination on the primary number secondary aggregation subset corresponding to the secondary number aggregation subset, and outputting the primary number secondary aggregation subset as a target primary number aggregation subset.
Further, the second filter output unit 733 is further adapted to:
counting the number of elements contained in each secondary number aggregation subset;
and judging the secondary number aggregation subset with the element number of 1 as an isolated subset.
Still further, the second filter output unit 733 is further adapted to:
after the number of elements contained in each secondary number aggregation subset is counted, whether the number of the elements contained in each secondary number aggregation subset is larger than a given threshold value is judged;
and if so, performing duplicate removal on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
In an alternative embodiment, still referring to fig. 8, the ID uniform representation module 740 may include:
a third splitting unit 741 adapted to split the uniform identifier-initial number aggregation subset relation table into an initial number-uniform identifier relation table;
a relation connection unit 742, connected to the third splitting unit 741, adapted to obtain a unified identifier-ID relation table according to the initial number-unified identifier relation table and the initial number-ID pair relation table; and
and a third aggregating unit 743, connected to the relation connecting unit 742, adapted to aggregate the unified identifier-ID relation table with the unified identifier as key to obtain a unified representation table of the ID.
Preferably, the relation connection unit 742 is further adapted to:
and executing a leftOutJoin command on the initial number-unified identifier relation table and the initial number-ID pair relation table, and separating the unified identifier and the ID through a map command to obtain the unified identifier-ID relation table.
In an alternative embodiment, the data pre-processing module 710 is further adapted to:
the plurality of two-dimensional ID relationship source data tables are integrated into a two-dimensional ID relationship table including a plurality of ID pairs.
In an alternative embodiment, the numbering of the modules is performed by a zipWithUniqueId command.
In an optional embodiment, the operation of the aggregation of the modules is performed by a reduceByKey command.
Based on the same inventive concept, the embodiment of the invention also provides a computer storage medium. The computer storage medium stores computer program code which, when run on a computing device, causes the computing device to perform a method of implementing ID mapping based on Spark framework as described in any one or combination of the above embodiments.
Based on the same inventive concept, the embodiment of the invention also provides the computing equipment. The computing device may include:
a processor; and
a memory storing computer program code;
the computer program code, when executed by a processor, causes the computing device to perform a method for implementing ID mapping based on a Spark framework according to any of the above embodiments or combinations thereof.
According to any one or a combination of multiple optional embodiments, the embodiment of the present invention can achieve the following advantages:
the method and the device for realizing ID mapping based on the Spark frame, provided by the embodiment of the invention, are characterized in that after each ID pair in the obtained two-dimensional ID relationship table is numbered to obtain an initial number-ID pair relationship table, the initial number-ID pair relationship table is firstly split and ID relationship aggregation is carried out by taking the ID as key to obtain a plurality of initial number one-time aggregation subsets, then the initial number is taken as key to carry out splitting and number relationship aggregation on the plurality of initial number one-time aggregation subsets to obtain an initial number aggregation subset result, the initial number aggregation subset result is numbered by the uniform identifier to obtain a uniform identifier-initial number aggregation subset relationship table, finally, the corresponding relationship between the uniform identifier and the ID is obtained according to the uniform identifier-initial number aggregation subset relationship table and the initial number-ID pair relationship table, thereby realizing uniform representation of the user ID. The method realizes the ID mapping algorithm based on the Spark distributed computing framework, and realizes the operations of storage, filtration, splitting, aggregation and the like of the mass user data set by using the thought of the mathematical set theory, thereby improving the efficiency, the accuracy and the reliability of the ID mapping algorithm.
Further, the process of performing the number relation aggregation with the initial number as the key is realized through iterative operation, and the number of the remaining subsets (i.e., the subsets to be split and aggregated in the next iterative operation) in each iterative operation can be reduced less and less by outputting the initial number aggregation subsets which are obtained after aggregation and have no intersection with other initial number aggregation subsets (i.e., no aggregation needs to be performed again) in each iterative operation, so that the memory overhead is further reduced, and the operation efficiency is improved.
It is clear to those skilled in the art that the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, further description is omitted here.
In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional units may be implemented in the form of hardware, or in the form of software or firmware.
Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: u disk, removable hard disk, Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, and other various media capable of storing program code.
Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.
According to an aspect of the embodiments of the present invention, a1. a method for implementing ID mapping based on a Spark framework is provided, including:
step S1: acquiring a two-dimensional ID relation table comprising a plurality of ID pairs, numbering each ID pair, and acquiring an initial number-ID pair relation table;
step S2: taking the ID as a key, splitting and aggregating the initial number-ID pair relation table to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers;
step S3: taking the initial numbers as keys, splitting and aggregating the multiple initial number once aggregated subsets to obtain initial number aggregated subset results, wherein no intersection exists between any two initial number aggregated subsets in the initial number aggregated subset results; numbering the initial number aggregation subset results by using a uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset;
step S4: and obtaining the corresponding relation between the uniform identifier and the ID according to the uniform identifier-initial number aggregation subset relation table and the initial number-ID pair relation table, so as to realize uniform representation of the ID.
A2. The method according to a1, wherein splitting and aggregating the multiple primary aggregation subsets with the initial number as key to obtain an initial number aggregation subset result includes:
taking the initial numbers as keys, taking the initial number once aggregation subsets as splitting and aggregation objects of the initial iterative operation, and performing the iterative operation of splitting and aggregation to obtain initial number aggregation subset results; in each iteration operation, outputting an initial number aggregation subset which is obtained after aggregation and has no intersection with other initial number aggregation subsets, and taking the remaining initial number aggregation subsets as splitting and aggregation objects of the next iteration operation; and outputting the remaining initial number aggregation subsets until the aggregation can not be performed among the remaining initial number aggregation subsets in one iteration operation, terminating the iteration operation, and integrating the initial number aggregation subsets output in each iteration operation to obtain an initial number aggregation subset result.
A3. The method according to a1 or a2, wherein step S2 specifically comprises:
splitting the initial number-ID pair relation table into an initial number-ID relation table;
aggregating the initial number-ID relation table by taking the ID as a key to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers;
and renumbering the primary aggregation subset with the initial number to obtain a secondary number-primary aggregation subset relation table with the initial number.
A4. The method according to a3, wherein, after aggregating to obtain a plurality of initial numbered primary aggregated subsets, step S2 further comprises:
judging whether each subset in the multiple primary aggregation subsets with the initial numbers is an isolated subset without intersection with other primary aggregation subsets with the initial numbers;
if so, outputting the initial number primary aggregation subset as a target initial number aggregation subset, and renumbering the remaining initial number primary aggregation subsets.
A5. The method of a4, wherein determining whether each subset of the plurality of primary-numbered primary-aggregate subsets is an orphaned subset that does not intersect with other primary-numbered primary-aggregate subsets comprises:
counting the occurrence times and the number of contained elements of the aggregation subset of each initial number;
the initial numbered one-time aggregation subset having the number of occurrences of 2 and the number of elements of 1 is judged as an isolated subset.
A6. The method according to any one of A3-a5, wherein step S3 specifically comprises:
step S31: splitting the secondary number-initial number primary aggregation subset relation table into a secondary number-initial number primary aggregation subset relation table;
step S32: aggregating the secondary number-initial number-primary aggregation subset relation table by taking the initial number as a key to obtain one or more initial number secondary aggregation subsets;
step S33: filtering and outputting the initial number secondary aggregation subsets without intersection with other initial number secondary aggregation subsets as target initial number aggregation subsets;
step S34: carrying out duplicate removal on the remaining initial number secondary aggregation subsets, and numbering again to obtain a relation table of the third number-initial number secondary aggregation subsets;
step S35: repeating the steps S31 to S34 to carry out iterative operation for n times until the number of the remaining initial number n +2 aggregation subsets is 0 or 1, and outputting the remaining initial number n +2 aggregation subsets as the target initial number aggregation subsets, wherein n is a natural number;
step S36: and integrating the target initial number aggregation subsets output in the previous step to obtain an initial number aggregation subset result, numbering the initial number aggregation subset result by using the uniform identifier, and obtaining a relationship table of uniform identifier-initial number aggregation subset.
A7. The method according to A6, wherein the step S32 specifically comprises:
aggregating the secondary number-initial number-primary number aggregation subset relation table by taking the initial number as key to obtain one or more initial number secondary aggregation subsets and corresponding secondary number aggregation subsets, wherein each secondary number aggregation subset is composed of secondary numbers, and each initial number secondary aggregation subset is formed by merging the initial number primary aggregation subsets corresponding to each secondary number in the corresponding secondary number aggregation subsets;
step S33 specifically includes:
and judging whether each subset in the one or more secondary number aggregation subsets is an isolated subset without intersection with other secondary number aggregation subsets, if so, performing duplication elimination on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset, and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
A8. The method of a7, wherein determining whether each of the one or more secondary numbered aggregated subsets is an orphaned subset that does not intersect with other secondary numbered aggregated subsets comprises:
counting the number of elements contained in each secondary number aggregation subset;
and judging the secondary number aggregation subset with the element number of 1 as an isolated subset.
A9. The method according to A8, wherein, after counting the number of elements contained in each twice-numbered aggregation subset, step S33 further includes:
judging whether the number of elements contained in each secondary number aggregation subset is greater than a given threshold value or not;
and if so, performing duplicate removal on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
A10. The method according to any one of a1-a9, wherein step S4 specifically comprises:
splitting the uniform identifier-initial number aggregation subset relation table into an initial number-uniform identifier relation table;
obtaining a unified identifier-ID relation table according to the initial number-unified identifier relation table and the initial number-ID pair relation table;
and aggregating the uniform identifier-ID relation table by taking the uniform identifier as a key to obtain a uniform representation table of the ID.
A11. The method according to a10, wherein obtaining a unified identifier-ID relationship table from the initial number-unified identifier relationship table and the initial number-ID pair relationship table includes:
and executing a leftOutJoin command on the initial number-uniform identifier relation table and the initial number-ID pair relation table, and separating a uniform identifier and an ID through a map command to obtain a uniform identifier-ID relation table.
A12. The method of any one of a1-a11, wherein obtaining a two-dimensional ID relationship table comprising a plurality of ID pairs comprises:
and integrating a plurality of two-dimensional ID relation source data tables into the two-dimensional ID relation table comprising a plurality of ID pairs.
A13. The method of any of a1-a12, wherein numbered operations are performed by a zipWithUniqueId command.
A14. The method of any one of a1-a13, wherein the operation of aggregation is performed by a reduce bykey command.
According to another aspect of the embodiments of the present invention, there is also provided a b15. an apparatus for implementing IDmapping based on Spark framework, including:
the data preprocessing module is suitable for acquiring a two-dimensional ID relation table comprising a plurality of ID pairs, numbering each ID pair and acquiring an initial number-ID pair relation table;
the ID relation aggregation module is suitable for splitting and aggregating the initial number-ID pair relation table by taking the ID as a key to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers;
the number relation aggregation module is suitable for splitting and aggregating the multiple initial number once aggregated subsets by taking the initial numbers as keys to obtain initial number aggregated subset results, wherein no intersection exists between any two initial number aggregated subsets in the initial number aggregated subset results; numbering the initial number aggregation subset results by using a uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset; and
and the ID unified representation module is suitable for obtaining the corresponding relation between the unified identifier and the ID according to the unified identifier-initial number aggregation subset relation table and the initial number-ID pair relation table, so as to realize the unified representation of the ID.
B16. The apparatus of B15, wherein the numbering relationship aggregation module is further adapted to:
taking the initial numbers as keys, taking the initial number once aggregation subsets as splitting and aggregation objects of the initial iterative operation, and performing the iterative operation of splitting and aggregation to obtain initial number aggregation subset results; in each iteration operation, outputting an initial number aggregation subset which is obtained after aggregation and has no intersection with other initial number aggregation subsets, and taking the remaining initial number aggregation subsets as splitting and aggregation objects of the next iteration operation; and outputting the remaining initial number aggregation subsets until the aggregation can not be performed among the remaining initial number aggregation subsets in one iteration operation, terminating the iteration operation, and integrating the initial number aggregation subsets output in each iteration operation to obtain an initial number aggregation subset result.
B17. The apparatus of B15 or B16, wherein the ID relationship aggregation module comprises:
a first splitting unit adapted to split the initial number-ID pair relationship table into an initial number-ID relationship table;
the first aggregation unit is suitable for aggregating the initial number-ID relation table by taking the ID as a key to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers; and
and the first numbering unit is suitable for renumbering the primary aggregation subset with the initial number to obtain a secondary number-primary aggregation subset relation table with the initial number.
B18. The apparatus of B17, wherein the ID relationship aggregation module further comprises:
the first filtering output unit is suitable for judging whether each subset in the plurality of initial number primary aggregation subsets is an isolated subset which does not have intersection with other initial number primary aggregation subsets after the first aggregation unit carries out aggregation to obtain a plurality of initial number primary aggregation subsets;
if so, outputting the initial number primary aggregation subset as a target initial number aggregation subset, and triggering the first numbering unit to renumber the remaining initial number primary aggregation subsets.
B19. The apparatus of B18, wherein the first filtered output unit is further adapted to:
counting the occurrence times and the number of contained elements of the aggregation subset of each initial number;
the initial numbered one-time aggregation subset having the number of occurrences of 2 and the number of elements of 1 is judged as an isolated subset.
B20. The apparatus of any one of B17-B19, wherein the numbering relationship aggregation module comprises:
the second splitting unit is suitable for splitting the secondary number-initial number primary aggregation subset relation table into a secondary number-initial number primary aggregation subset relation table;
the second aggregation unit is suitable for aggregating the secondary number-initial number primary aggregation subset relation table by taking the initial number as a key to obtain one or more initial number secondary aggregation subsets;
the second filtering output unit is suitable for filtering and outputting the initial number secondary aggregation subsets without intersection with other initial number secondary aggregation subsets as target initial number aggregation subsets;
the second numbering unit is suitable for numbering the remaining initial number secondary aggregation subsets again to obtain a relationship table of the third-time number-initial number secondary aggregation subsets;
the analogizing iteration unit is suitable for triggering the second splitting unit, the second aggregation unit, the second filtering output unit and the second numbering unit to carry out n times of iterative operation in an analogizing way until the number of the remaining initial number n +2 times of aggregation subsets is 0 or 1, and outputting the remaining initial number n +2 times of aggregation subsets as target initial number aggregation subsets, wherein n is a natural number; and
and the output result integration unit is suitable for integrating the output target initial number aggregation subset to obtain an initial number aggregation subset result, and numbering the initial number aggregation subset result by using the uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset.
B21. The apparatus of B20, wherein the second polymerization unit is further adapted to:
aggregating the secondary number-initial number-primary number aggregation subset relation table by taking the initial number as key to obtain one or more initial number secondary aggregation subsets and corresponding secondary number aggregation subsets, wherein each secondary number aggregation subset is composed of secondary numbers, and each initial number secondary aggregation subset is formed by merging the initial number primary aggregation subsets corresponding to each secondary number in the corresponding secondary number aggregation subsets;
the second filtered output unit is further adapted to:
and judging whether each subset in the one or more secondary number aggregation subsets is an isolated subset without intersection with other secondary number aggregation subsets, if so, performing duplication elimination on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset, and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
B22. The apparatus of B21, wherein the second filtered output unit is further adapted to:
counting the number of elements contained in each secondary number aggregation subset;
and judging the secondary number aggregation subset with the element number of 1 as an isolated subset.
B23. The apparatus of B22, wherein the second filtered output unit is further adapted to:
after the number of elements contained in each secondary number aggregation subset is counted, whether the number of the elements contained in each secondary number aggregation subset is larger than a given threshold value is judged;
and if so, performing duplicate removal on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
B24. The apparatus of any one of B15-B23, wherein the ID unified representation module comprises:
a third splitting unit adapted to split the unified identifier-initial number aggregation subset relation table into an initial number-unified identifier relation table;
the relation connection unit is suitable for obtaining a unified identifier-ID relation table according to the initial number-unified identifier relation table and the initial number-ID pair relation table; and
and the third aggregation unit is suitable for aggregating the uniform identifier-ID relation table by taking the uniform identifier as a key to obtain a uniform representation table of the ID.
B25. The apparatus according to B24, wherein the relational connection unit is further adapted to:
and executing a leftOutJoin command on the initial number-uniform identifier relation table and the initial number-ID pair relation table, and separating a uniform identifier and an ID through a map command to obtain a uniform identifier-ID relation table.
B26. The apparatus of any one of B15-B25, wherein the data pre-processing module is further adapted to:
and integrating a plurality of two-dimensional ID relation source data tables into the two-dimensional ID relation table comprising a plurality of ID pairs.
B27. The apparatus of any one of B15-B26, wherein numbered operations are performed by a zipWithUniqueId command.
B28. The apparatus of any of B15-B27, wherein the operation of aggregation is performed by a reduce bykey command.
There is also provided, in accordance with yet another aspect of an embodiment of the present invention, a computer storage medium storing computer program code which, when run on a computing device, causes the computing device to perform a method of implementing ID mapping based on a Spark framework as described in any of a1-a 14.
There is also provided, in accordance with yet another aspect of an embodiment of the present invention, apparatus for computing, including:
a processor; and
a memory storing computer program code;
the computer program code, when executed by the processor, causes the computing device to perform a method of implementing ID mapping based on a Spark framework according to any of A1-A14.

Claims (10)

1. A method for realizing ID mapping based on Spark framework comprises the following steps:
step S1: acquiring a two-dimensional ID relation table comprising a plurality of ID pairs, numbering each ID pair, and acquiring an initial number-ID pair relation table;
step S2: taking the ID as a key, splitting and aggregating the initial number-ID pair relation table to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers;
step S3: taking the initial numbers as keys, splitting and aggregating the multiple initial number once aggregated subsets to obtain initial number aggregated subset results, wherein no intersection exists between any two initial number aggregated subsets in the initial number aggregated subset results; numbering the initial number aggregation subset results by using a uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset;
step S4: and obtaining the corresponding relation between the uniform identifier and the ID according to the uniform identifier-initial number aggregation subset relation table and the initial number-ID pair relation table, so as to realize uniform representation of the ID.
2. The method of claim 1, wherein splitting and aggregating the plurality of initial number once aggregated subsets with an initial number as a key to obtain an initial number aggregated subset result comprises:
taking the initial numbers as keys, taking the initial number once aggregation subsets as splitting and aggregation objects of the initial iterative operation, and performing the iterative operation of splitting and aggregation to obtain initial number aggregation subset results; in each iteration operation, outputting an initial number aggregation subset which is obtained after aggregation and has no intersection with other initial number aggregation subsets, and taking the remaining initial number aggregation subsets as splitting and aggregation objects of the next iteration operation; and outputting the remaining initial number aggregation subsets until the aggregation can not be performed among the remaining initial number aggregation subsets in one iteration operation, terminating the iteration operation, and integrating the initial number aggregation subsets output in each iteration operation to obtain an initial number aggregation subset result.
3. The method according to claim 1 or 2, wherein step S2 specifically comprises:
splitting the initial number-ID pair relation table into an initial number-ID relation table;
aggregating the initial number-ID relation table by taking the ID as a key to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers;
and renumbering the primary aggregation subset with the initial number to obtain a secondary number-primary aggregation subset relation table with the initial number.
4. The method according to claim 3, wherein after performing aggregation to obtain a plurality of initial numbered primary aggregation subsets, step S2 further includes:
judging whether each subset in the multiple primary aggregation subsets with the initial numbers is an isolated subset without intersection with other primary aggregation subsets with the initial numbers;
if so, outputting the initial number primary aggregation subset as a target initial number aggregation subset, and renumbering the remaining initial number primary aggregation subsets.
5. The method of claim 4, wherein determining whether each subset of the plurality of primary-numbered primary aggregated subsets is an orphaned subset that does not intersect with other primary-numbered primary aggregated subsets comprises:
counting the occurrence times and the number of contained elements of the aggregation subset of each initial number;
the initial numbered one-time aggregation subset having the number of occurrences of 2 and the number of elements of 1 is judged as an isolated subset.
6. The method according to any one of claims 3-5, wherein step S3 specifically comprises:
step S31: splitting the secondary number-initial number primary aggregation subset relation table into a secondary number-initial number primary aggregation subset relation table;
step S32: aggregating the secondary number-initial number-primary aggregation subset relation table by taking the initial number as a key to obtain one or more initial number secondary aggregation subsets;
step S33: filtering and outputting the initial number secondary aggregation subsets without intersection with other initial number secondary aggregation subsets as target initial number aggregation subsets;
step S34: carrying out duplicate removal on the remaining initial number secondary aggregation subsets, and numbering again to obtain a relation table of the third number-initial number secondary aggregation subsets;
step S35: repeating the steps S31 to S34 to carry out iterative operation for n times until the number of the remaining initial number n +2 aggregation subsets is 0 or 1, and outputting the remaining initial number n +2 aggregation subsets as the target initial number aggregation subsets, wherein n is a natural number;
step S36: and integrating the target initial number aggregation subsets output in the previous step to obtain an initial number aggregation subset result, numbering the initial number aggregation subset result by using the uniform identifier, and obtaining a relationship table of uniform identifier-initial number aggregation subset.
7. The method according to claim 6, wherein step S32 specifically comprises:
aggregating the secondary number-initial number-primary number aggregation subset relation table by taking the initial number as key to obtain one or more initial number secondary aggregation subsets and corresponding secondary number aggregation subsets, wherein each secondary number aggregation subset is composed of secondary numbers, and each initial number secondary aggregation subset is formed by merging the initial number primary aggregation subsets corresponding to each secondary number in the corresponding secondary number aggregation subsets;
step S33 specifically includes:
and judging whether each subset in the one or more secondary number aggregation subsets is an isolated subset without intersection with other secondary number aggregation subsets, if so, performing duplication elimination on the initial number secondary aggregation subset corresponding to the secondary number aggregation subset, and outputting the initial number secondary aggregation subset as a target initial number aggregation subset.
8. An apparatus for implementing ID mapping based on Spark framework, comprising:
the data preprocessing module is suitable for acquiring a two-dimensional ID relation table comprising a plurality of ID pairs, numbering each ID pair and acquiring an initial number-ID pair relation table;
the ID relation aggregation module is suitable for splitting and aggregating the initial number-ID pair relation table by taking the ID as a key to obtain a plurality of initial number primary aggregation subsets, wherein each initial number primary aggregation subset is composed of the initial numbers;
the number relation aggregation module is suitable for splitting and aggregating the multiple initial number once aggregated subsets by taking the initial numbers as keys to obtain initial number aggregated subset results, wherein no intersection exists between any two initial number aggregated subsets in the initial number aggregated subset results; numbering the initial number aggregation subset results by using a uniform identifier to obtain a relationship table of uniform identifier-initial number aggregation subset; and
and the ID unified representation module is suitable for obtaining the corresponding relation between the unified identifier and the ID according to the unified identifier-initial number aggregation subset relation table and the initial number-ID pair relation table, so as to realize the unified representation of the ID.
9. A computer storage medium storing computer program code which, when run on a computing device, causes the computing device to perform a Spark framework based ID mapping method according to any of claims 1-7.
10. A computing device, comprising:
a processor; and
a memory storing computer program code;
the computer program code, when executed by the processor, causes the computing device to perform a method of Spark framework based ID mapping according to any of claims 1-7.
CN201910199055.2A 2019-03-15 2019-03-15 Method and device for realizing ID mapping based on Spark framework Pending CN111694876A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910199055.2A CN111694876A (en) 2019-03-15 2019-03-15 Method and device for realizing ID mapping based on Spark framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910199055.2A CN111694876A (en) 2019-03-15 2019-03-15 Method and device for realizing ID mapping based on Spark framework

Publications (1)

Publication Number Publication Date
CN111694876A true CN111694876A (en) 2020-09-22

Family

ID=72475317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910199055.2A Pending CN111694876A (en) 2019-03-15 2019-03-15 Method and device for realizing ID mapping based on Spark framework

Country Status (1)

Country Link
CN (1) CN111694876A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023280207A1 (en) * 2021-07-07 2023-01-12 清华大学 Data processing method, execution workstation, distributed computing system, and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023280207A1 (en) * 2021-07-07 2023-01-12 清华大学 Data processing method, execution workstation, distributed computing system, and storage medium

Similar Documents

Publication Publication Date Title
CN107807982B (en) Consistency checking method and device for heterogeneous database
JP5635691B2 (en) Data analysis using multiple systems
CN104036187B (en) Method and system for determining computer virus types
WO2019052162A1 (en) Method, apparatus and device for improving data cleaning efficiency, and readable storage medium
CN111625561B (en) Data query method and device
CN104778123A (en) Method and device for detecting system performance
WO2020088262A1 (en) Data analysis method and device, and storage medium
CN113051448A (en) Data processing method and device, electronic equipment and storage medium
CN109359109B (en) Data processing method and system based on distributed stream computing
WO2019061667A1 (en) Electronic apparatus, data processing method and system, and computer-readable storage medium
CN111694876A (en) Method and device for realizing ID mapping based on Spark framework
CN107798007B (en) Distributed database data verification method, device and related device
CN110876072A (en) Batch registered user identification method, storage medium, electronic device and system
CN110362540B (en) Data storage and visitor number acquisition method and device
CN112948460A (en) Method and device for screening network flow data and computer readable storage medium
CN116628215A (en) Data asset management method, control device and readable storage medium
WO2019153546A1 (en) Ten-thousand-level dimension data generation method, apparatus and device, and storage medium
CN108090095B (en) Method and device for reconstructing database in batches
CN110866037B (en) Message filtering method and device
CN110489460B (en) Optimization method and system for rapid statistics
CN111158994A (en) Pressure testing performance testing method and device
CN111881110A (en) Data migration method and device
CN113419896A (en) Data recovery method and device, electronic equipment and computer readable medium
CN114185890B (en) Database retrieval method and device, storage medium and electronic equipment
CN117610815A (en) Resource quota data processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination