CN109241419B - ID data network data analysis method and device and computing equipment - Google Patents
ID data network data analysis method and device and computing equipment Download PDFInfo
- Publication number
- CN109241419B CN109241419B CN201810973827.9A CN201810973827A CN109241419B CN 109241419 B CN109241419 B CN 109241419B CN 201810973827 A CN201810973827 A CN 201810973827A CN 109241419 B CN109241419 B CN 109241419B
- Authority
- CN
- China
- Prior art keywords
- data
- relationship
- directed
- packet
- relation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses an ID data network data analysis method, an ID data network data analysis device, a computing device and a computer storage medium, wherein the ID data network data analysis method comprises the following steps: acquiring an ID data network containing ID data and an association relation between the ID data; the ID data includes: user ID data and/or device ID data; according to the ID data contained in the ID data network and the incidence relation between the ID data, ID relation data are constructed; the ID relationship data includes a plurality of ID relationship pairs, each ID relationship pair including: two IDs and the relationship between the two IDs; and grouping the ID relation data to obtain a plurality of ID data subnets. According to the technical scheme, the data analysis efficiency of the ID data network is effectively improved, the plurality of ID data subnets can be accurately and quickly obtained, the effective division of the ID data network is realized, the ID data contained in the ID data subnets have strong and reliable incidence relation, the ID data can be identified as the ID data of the same user, and the construction of a complete and effective user portrait is facilitated.
Description
Technical Field
The invention relates to the technical field of internet, in particular to an ID data network data analysis method, an ID data network data analysis device, computing equipment and a computer storage medium.
Background
In order to meet different use requirements of users, a plurality of services such as internet surfing, shopping, food ordering, train ticket ordering, payment and the like are developed for the users to select and use. The service sets ID data for the user according to the account number of the user in the service or the device used by the user, and the like, for identifying the user. The ID data network can be constructed according to ID data from a plurality of services, and user characteristics such as user gender, user age, browsing preference, clicking preference, liveness, item purchasing preference, item purchasing potential, game preference and the like can be analyzed on the basis of the ID data network, so that a complete and effective user portrait is constructed, and accurate recommendation of news, games, advertisements and the like is realized. However, the ID data of multiple services are numerous, the association relationship between ID data is complex, the data processing amount is large, and the setting rules of different services for ID data are different, so that the ID data corresponding to the same user cannot be accurately and quickly identified from a large amount of ID data contained in the ID data network.
Disclosure of Invention
In view of the above, the present invention has been made to provide an ID data network data analysis method, apparatus, computing device and computer storage medium that overcome or at least partially address the above-mentioned problems.
According to an aspect of the present invention, there is provided an ID data network data analysis method, including: acquiring an ID data network containing ID data and an association relation between the ID data; the ID data includes: user ID data and/or device ID data; according to the ID data contained in the ID data network and the incidence relation between the ID data, ID relation data are constructed; the ID relationship data includes a plurality of ID relationship pairs, each ID relationship pair including: two IDs and the relationship between the two IDs; and grouping the ID relation data to obtain a plurality of ID data subnets.
According to another aspect of the present invention, there is provided an ID data network data analysis apparatus, including: the acquisition module is suitable for acquiring an ID data network containing the ID data and the incidence relation among the ID data; the ID data includes: user ID data and/or device ID data; the second construction module is suitable for constructing ID relation data according to the ID data contained in the ID data network and the incidence relation among the ID data; the ID relationship data includes a plurality of ID relationship pairs, each ID relationship pair including: two IDs and the relationship between the two IDs; and the grouping module is suitable for grouping the ID relation data to obtain a plurality of ID data subnets.
According to yet another aspect of the present invention, there is provided a computing device comprising: the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the ID data network data analysis method.
According to still another aspect of the present invention, there is provided a computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the ID data network data analysis method as described above.
According to the technical scheme provided by the invention, the ID relation data can be constructed based on the ID data contained in the ID data network and the incidence relation among the ID data, and the ID relation data are grouped, so that a plurality of ID data subnets can be accurately and quickly obtained; and the data volume of the ID data subnet is far smaller than that of the ID data network, so that the user characteristics can be accurately and quickly analyzed based on the ID data subnet, and a complete and effective user portrait is constructed, so that accurate recommendation of news, games, advertisements and the like is realized.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 shows a flow diagram of an ID data network processing method according to one embodiment of the invention;
FIG. 2a shows a schematic flow diagram of an ID data network processing method according to another embodiment of the invention;
FIG. 2b shows a schematic of an ID data network;
FIG. 3 illustrates a flow diagram of a method of ID data web pruning pre-processing in accordance with one embodiment of the present invention;
FIG. 4 shows a flow diagram of a method of ID data network data analysis according to one embodiment of the invention;
FIG. 5a shows a schematic flow diagram of a method for ID data network data analysis according to another embodiment of the present invention;
FIG. 5b shows a process diagram for forward and reverse directed ordering of ID relationship pairs;
fig. 6 shows a flow diagram of an ID data subnet processing method according to one embodiment of the invention;
FIG. 7 shows a block diagram of an ID data network processing apparatus according to an embodiment of the present invention;
FIG. 8 is a block diagram illustrating an ID data net pruning preprocessing device according to an embodiment of the present invention;
fig. 9 is a block diagram showing the configuration of an ID data network data analysis apparatus according to an embodiment of the present invention;
fig. 10 is a block diagram showing the configuration of an ID data network data analysis apparatus according to another embodiment of the present invention;
fig. 11 is a block diagram showing the configuration of an ID data subnet processing apparatus according to an embodiment of the present invention;
FIG. 12 shows a schematic structural diagram of a computing device according to an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Fig. 1 shows a flow diagram of an ID data network processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S100, an ID data network containing the ID data and the association relation between the ID data is obtained.
The ID data network may be constructed according to log data of a plurality of services, and includes an association relationship between ID data and ID data, where the ID data is data for identifying a user identity, and the ID data may include: user ID data and/or device ID data. And the ID data have incidence relations, and the incidence relations comprise direct incidence relations and indirect incidence relations.
Specifically, the user ID data refers to account data of the user in the service, such as a mobile phone number, a micro signal, a QQ number, a browser ID, and the like. For example, a user logs in the wechat application and the QQ application with the cell phone number "189 × 2677", and the user's micro signal in the wechat application is "wxid _ 1", and the QQ number in the QQ application is "12345", then the cell phone number "189 × 2677" has a direct association relationship with the micro signal "wxid _ 1", and the cell phone number "189 × 2677" also has a direct association relationship with the QQ number "12345".
The device ID data is identification data of a device used when the user uses the service, and includes, for example, an MD5 value of a device number of the mobile device, an MD5 value of a device number of the mobile device + a system program version number + a cell phone serial number, 32 bits in an MD5 value of a MAC address of the mobile device, 44 bits in an MD5 value of a MAC address of the mobile device, and the like. The setting rule for the device ID data is different for different services. If the same service is used by the same device using a plurality of user ID data, the device ID data marked by the service for the device and the plurality of user ID data all have an association relationship. For example, a mobile phone logs in a wechat application by using the micro signal "wxid _ 1" and the micro signal "wxid _ 2", and the wechat application marks the device ID data of the mobile phone as "m 1", then the device ID data "m 1" has a direct association with the micro signal "wxid _ 1", and the device ID data "m 1" also has a direct association with the micro signal "wxid _ 2".
And S101, carrying out data analysis on the ID data network to obtain a plurality of ID data subnets.
The ID data network is divided into a plurality of ID data subnets by carrying out data analysis on the ID data contained in the ID data network and the incidence relation between the ID data. The plurality of ID data subnets may be divided into n sets of ID data subnets according to the number of ID data included in the ID data subnets, where n is a natural number greater than 0. The number of ID data contained in the ID data subnets in different ID data subnet sets is different. For example, if the plurality of ID data subnets includes 200 ID data subnets including ID data of 2 number, 300 ID data subnets including ID data of 3 number, and 100 ID data subnets including ID data of 4 number, the plurality of ID data subnets may be divided into 3 ID data subnet sets according to the number of ID data included in the ID data subnets, specifically, the 200 ID data subnets including ID data of 2 number may be divided into a first ID data subnet set, the 300 ID data subnets including ID data of 3 number may be divided into a second ID data subnet set, and the 100 ID data subnets including ID data of 4 number may be divided into a third ID data subnet set.
Compared with the ID data network, the ID data contained in the ID data subnet has stronger and reliable incidence relation, and the ID data contained in the ID data subnet can be identified as the ID data of the same user. And the number of the ID data contained in the ID data subnet is far smaller than that of the ID data contained in the ID data network, the data volume of the ID data subnet is far smaller than that of the ID data network, and based on the ID data subnet, the user characteristics such as user gender, user age, browsing preference, clicking preference, liveness, article purchasing preference, article purchasing potential, game preference and the like can be accurately and quickly analyzed to construct a complete and effective user portrait.
According to the ID data network processing method provided by the embodiment, data analysis is performed on the ID data contained in the ID data network and the association relationship between the ID data, so that the ID data network can be quickly divided into a plurality of ID data subnets, and compared with the ID data network, the ID data contained in the ID data subnets have stronger and reliable association relationship and can be identified as ID data of the same user; and the data volume of the ID data subnet is far smaller than that of the ID data network, so that the user characteristics can be accurately and quickly analyzed based on the ID data subnet, and a complete and effective user portrait is constructed, so that accurate recommendation of news, games, advertisements and the like is realized.
Fig. 2a shows a schematic flow diagram of an ID data network processing method according to another embodiment of the present invention, as shown in fig. 2a, the method includes the following steps:
step S200, performing data analysis on the log data of the plurality of services, and determining the ID data and the incidence relation between the ID data.
The method includes acquiring log data of a plurality of services, wherein the log data can be obtained by actively uploading the plurality of services or by requesting the plurality of services. The log data of one service is described with ID data and other ID data for using the service, and the association between the ID data for using the service and the other ID data is described. Specifically, the ID data may include: user ID data and/or device ID data.
Step S201, using the ID data as nodes, determining the connection relation between the nodes according to the incidence relation between the ID data, and constructing an ID data network.
After the association relationship between the ID data and the ID data is determined, an ID data network can be constructed based on the determined association relationship between the ID data and the ID data, specifically, an ID data network can be constructed by using the ID data as nodes and determining the connection relationship between the nodes according to the association relationship between the ID data, and the ID data network includes the association relationship between the ID data and the ID data, so that the association relationship between each ID data and the ID data can be clearly shown.
It is assumed that the determined ID data includes "a 1", "b 1", "a 2", "b 2", "c 2", "a 3", "b 3", "c 3", "d 3", "a 4", "b 4", "c 4", "d 4", "e 4", "f 4", "g 4", "h 4", wherein between ID data "a 1" and ID data "b 1", between ID data "a 2" and ID data "b 2", between ID data "a 2" and ID data "c 2", between ID data "a 2" and ID data "b 2", between ID data "a 2" and ID data "c 2", between ID data "c 2" and ID data "d 2", between ID data "a 2" and ID data "b 2", between ID data "a 2" and ID data "c 2", between ID data "a 2" and ID data "d 2", between ID data "a 2" f "and ID data" d 2 ", between ID data" a2 "and ID data" d 2 ", between ID data" d 2 "and ID data" b2 ", and ID data" b2 "d 2", and ID data "b 2", and ID data ID data "b" and ID data "h" and ID data "e" and ID data "g" have direct association, and then there is indirect association between ID data "b" and ID data "c", ID data "a" and ID data "d", etc., ID data "a" to ID data "h" are respectively used as nodes a to h in ID data network, and according to the association between each ID data, node a and node b in ID data network are connected, node a is respectively connected with node b and node c, node c and node d, node a is respectively connected with node b, node c and node f, node b is respectively connected with node d, node e and node h, and node e and node g, the resulting ID data network 210 may be constructed as shown in fig. 2 b.
Step S202, an ID data network containing the ID data and the association relation between the ID data is obtained.
And after the construction of the ID data network is completed, acquiring the ID data network so as to carry out pruning pretreatment, data analysis and other treatment on the ID data network.
And step S203, carrying out pruning pretreatment on the ID data network to obtain the ID data network subjected to the pruning pretreatment.
According to the association frequency among the ID data, the number of other ID data directly associated with the ID data and the like, pruning pretreatment can be carried out on the ID data network, and the ID data network after the pruning pretreatment is obtained. Specifically, the incidence relation between part of ID data and other directly correlated ID data can be removed, pruning pretreatment of the ID data network is realized, unreliable incidence relation between the ID data in the ID data network is effectively removed, the accuracy of ID data network treatment can be improved, and the data volume of subsequent data analysis can be reduced.
And step S204, carrying out data analysis on the ID data network subjected to pruning pretreatment to obtain a plurality of ID data subnetworks.
After the ID data network after the pruning pretreatment is obtained, the ID data network can be divided into a plurality of ID data subnetworks by carrying out data analysis on the ID data contained in the ID data network after the pruning pretreatment and the incidence relation between the ID data. The plurality of ID data subnets may be divided into n sets of ID data subnets according to the number of ID data included in the ID data subnets, where n is a natural number greater than 0. The number of ID data contained in the ID data subnets in different ID data subnet sets is different. Compared with the ID data network, the ID data contained in the ID data subnet has stronger and reliable incidence relation.
Step S205, for any ID data subnet whose number of contained ID data is greater than the first preset number threshold, clustering and partitioning the ID data in the ID data subnet, to obtain a plurality of third ID data subnets corresponding to the ID data subnet.
The ID data subnets obtained after the data analysis in step S204 may still include ID data subnets with a large number of contained ID data, and the ID data in these ID data subnets may not belong to ID data of the same user although having a strong association relationship, and if these ID data are identified as ID data of the same user, the user characteristics obtained based on the ID data subnet analysis may not effectively and truly reflect the actual situation of the user. In order to further improve the reliability of the ID data subnets, further processing, such as clustering and partitioning, is performed on the ID data subnets.
Specifically, a first preset number threshold and a second preset number threshold may be preset, and for an ID data subnet in which the number of ID data included in any one of the plurality of ID data subnets is greater than the first preset number threshold, the ID data in the ID data subnet is clustered and partitioned to obtain a plurality of third ID data subnets corresponding to the ID data subnet, so that ID data having a stronger and more reliable association relationship in the ID data subnet are clustered into one class and partitioned into the same third ID data subnet. Wherein, either means either; the number of ID data contained in the third ID data subnet is less than or equal to a second preset number threshold. Compared with the ID data subnets with the number of the contained ID data larger than the first preset number threshold, the ID data in the third ID data subnet have stronger and more reliable incidence relation and can be identified as the ID data of the same user, and the user characteristics can be accurately and effectively analyzed based on the third ID data subnet to construct a complete and effective user portrait. And the data volume of the third ID data subnet is far smaller than the data volume of the ID data subnet with the number of the contained ID data larger than the first preset number threshold, so that the user characteristic analysis is more convenient, and the analysis efficiency is improved.
The first preset number threshold and the second preset number threshold may be set by those skilled in the art according to actual needs, and are not limited herein. For example, if the first preset number threshold is set to 50 and the second preset number threshold is set to 10, then for an ID data subnet in which the number of ID data included in any one of the plurality of ID data subnets is greater than 50, the ID data in the ID data subnet needs to be clustered and divided, and the ID data subnet is divided into a plurality of third ID data subnets in which the number of ID data included in the third ID data subnets is less than or equal to 10.
According to the ID data network processing method provided by the embodiment, the ID data network can be quickly constructed by performing data analysis on the log data of a plurality of services; the pruning pretreatment is carried out on the ID data network, unreliable association relation among the ID data in the ID data network is effectively and quickly removed, the accuracy of the ID data network treatment can be improved, and the data volume of data analysis can be reduced; in addition, the data analysis is carried out on the ID data contained in the ID data network and the incidence relation among the ID data, the ID data network can be quickly divided into a plurality of ID data subnets, the ID data contained in the ID data subnets have strong and reliable incidence relation and can be identified as the ID data of the same user, and the user characteristics can be accurately and quickly analyzed on the basis of the ID data subnets to construct a complete and effective user portrait.
The invention also provides an ID data network pruning preprocessing method, which comprises the following steps: acquiring an ID data network containing ID data and an association relation between the ID data; and carrying out pruning pretreatment on the ID data network to obtain the ID data network subjected to pruning pretreatment. Wherein the ID data includes: user ID data and/or device ID data. The method of pre-processing the pruning of the ID data network is described below with the specific embodiment shown in fig. 3.
Fig. 3 is a flow chart of a method for preprocessing the pruning of the ID data network according to an embodiment of the present invention, and as shown in fig. 3, the method includes the following steps:
step S300, an ID data network containing the ID data and the incidence relation between the ID data is obtained.
The description of this step can refer to the description of step S100 in the embodiment shown in fig. 1, and is not repeated here.
Step S301, performing data analysis on the log data of a plurality of services to obtain the association frequency among the ID data.
The log data of one service is recorded with the ID data and other ID data of the service, and the correlation between the ID data of the service and other ID data is described.
Specifically, log data of a plurality of services are subjected to data analysis, and the actual association frequency between ID data is calculated. In practical applications, the actual frequency of association between ID data may be calculated in a preset unit time. Taking a preset unit time as a day as an example, if log data are analyzed, and a certain ID data and another ID data have an association relationship for 50 days, the actual association frequency between the two ID data is recorded as 50. According to the method, the actual association frequency between each ID data and other ID data is calculated.
In practical application, there is also a case where a plurality of users use the same service successively through the same device at different times, and the user ID data of the plurality of users and the device ID data of the device have an association relationship, but the actual association frequency thereof cannot truly reflect the user actually corresponding to the device at the current time. For example, if two users use a 360-security guard application on the same mobile phone, the 360 accounts of the two users have an association relationship with the device ID data of the mobile phone, and it is assumed that the 360 accounts are obtained from log data of the 360-security guard application, where the first 360 account frequently logs in the 360-security guard application through the mobile phone 1 year ago, the actual association frequency between the first 360 account and the device ID data of the mobile phone is 100, but the first 360 account no longer logs in the 360-security guard application through the mobile phone from half a year ago, but the second 360 account frequently logs in the 360-security guard application through the mobile phone from half a year ago, and the actual association frequency between the second 360 account and the device ID data of the mobile phone is 50. Although the actual association frequency between the first 360-account and the device ID data of the mobile phone is higher than the actual association frequency between the second 360-account and the device ID data of the mobile phone, the log data corresponding to the first 360-account is log data of one year ago, the time information of the log data is far from the current time, and it is obvious that the user corresponding to the second 360-account is the user actually corresponding to the mobile phone in the current time, and if the user actually corresponding to the mobile phone in the current time cannot be truly reflected only according to the actual association frequency.
In order to solve the problems, the invention introduces corresponding time weight to the log data corresponding to the ID data, and calculates the association frequency between the ID data according to the actual association frequency between the ID data, the time information of the log data corresponding to the ID data and the time weight. The weight value of the time weight corresponding to the log data corresponding to the ID data is related to the distance between the log data corresponding to the ID data and the current time. If the time information of the log data corresponding to the ID data is closer to the current time, the weight of the time weight corresponding to the log data corresponding to the ID data is larger; and if the time information of the log data corresponding to the ID data is farther from the current time, the smaller the weight of the time weight corresponding to the log data corresponding to the ID data is. And performing attenuation processing on the actual association frequency between the ID data through time weight, and taking the numerical value obtained after the attenuation processing as the association frequency between the ID data. The association frequency between the ID data obtained by the method can accurately reflect the real association degree between the ID data of the current period, has higher reference value, and is beneficial to accurately carrying out pruning pretreatment on the ID data network.
Step S302, aiming at any ID data in the ID data network, according to the quantity of other ID data directly related to the ID data and/or the frequency of the association between the ID data and other ID data, pruning pretreatment is carried out on the association relationship between the ID data and other ID data.
The invention sets pruning rules and each threshold value specified in the pruning rules through repeated data analysis, wherein the pruning rules comprise: for any ID data in an ID data network, if the number of other ID data directly associated with the ID data is larger than a first threshold and the association frequency between the ID data and any other ID data is smaller than or equal to a second threshold, removing the association relationship between the ID data and any other ID data; if the number of other ID data directly associated with the ID data is larger than a third threshold value and the sum of the association frequency between the ID data and each other ID data is larger than or equal to a fourth threshold value, removing the association relationship between the ID data and each other ID data; if the sum of the association frequencies between the ID data and each other ID data is greater than or equal to a fifth threshold; removing the association relationship between the ID data and each other ID data; for other cases than the above three cases, the association relationship between the ID data and each other ID data is retained without being removed. The invention provides that the corresponding association relation is removed as long as any one of the three cases of removing the association relation is satisfied.
In order to facilitate the determination of whether any ID data in the ID data network meets the pruning rule, an intermediate subnet centered on the ID data may be first constructed for any ID data in the ID data network, specifically, ID relationship data may be constructed according to the ID data and an association relationship between ID data included in the ID data network, where the ID relationship data includes a plurality of ID relationship pairs, and each ID relationship pair includes: two IDs and the relationship between the two IDs, for example, ID data "a 1" has a direct association relationship with ID data "b 1", then the corresponding ID relationship pair constructed is (a1, b1), a1 and b1 are the two IDs included in the ID relationship pair, and () represents that there is a relationship between the two IDs. And then, grouping all the ID relation pairs according to a primary key ID grouping method, and obtaining the intermediate subnet according to a grouping result, wherein the primary key ID grouping method is a method for grouping according to the set primary key ID. For example, all the ID relationship pairs are grouped by a groupByKey method according to the fact that the left ID in all the ID relationship pairs is the primary key ID, all the intermediate subnets centering on the left ID are obtained according to the grouping result, and the intermediate subnet centering on any ID data in the ID data network is obtained. After the intermediate subnet is obtained, it is convenient to judge whether the ID data conforms to the pruning rule.
In practical application, after the judgment of whether the pruning rule is met, a pruning marking bit can be set for the ID relationship pair for marking whether the relationship between two IDs in the ID relationship pair is the incidence relationship which needs to be removed. If the relationship between two IDs in a certain ID relationship pair is an incidence relationship needing to be removed, setting the pruning mark bit of the ID relationship pair to be 1; and if the relationship between two IDs in a certain ID relationship pair is not the incidence relationship which needs to be removed, setting the pruning mark bit of the ID relationship pair to be 0. Whether the relationship between two IDs in the ID relationship pair is the incidence relationship which needs to be removed or not can be clearly known through the pruning mark bit.
Specifically, for any ID data in the ID data network, it may be determined whether the number of other ID data directly associated with the ID data is greater than a first threshold and the association frequency between the ID data and any other ID data is less than or equal to a second threshold according to an intermediate subnet centered on the ID data; and if so, removing the association relationship between the ID data and any other ID data. If the first threshold value may be 2 and the second threshold value may be 5, determining whether the number of other ID data directly associated with the ID data is greater than 2 and the association frequency between the ID data and any other ID data is less than or equal to 5; if so, indicating that the association relationship between the ID data and any other ID data is unreliable, and removing the association relationship between the ID data and any other ID data. Assuming that, for the ID data "a 4" in the ID data network, the ID data directly associated with the ID data "a 4" includes ID data "b 4", ID data "c 4" and ID data "f 4" as known from the intermediate subnet centering on the ID data "a 4", wherein the frequency of association between the ID data "a 4" and the ID data "b 4" is 20, the frequency of association between the ID data "a 4" and the ID data "c 4" is 30, the frequency of association between the ID data "a 4" and the ID data "f 4" is 3, the number of other ID data directly associated with the ID data "a 4" is 3 and more than 2, and the frequency of association between the ID data "a 4" and the ID data "f 4" is less than 5, the association relationship between the ID data "a 4" and the ID data "f 4" is removed.
Aiming at any ID data in the ID data network, judging whether the number of other ID data directly related to the ID data is larger than a third threshold value or not and the sum of the association frequency between the ID data and each other ID data is larger than or equal to a fourth threshold value according to an intermediate subnet taking the ID data as a center; and if so, removing the association relationship between the ID data and each other ID data. Wherein, the third threshold may be 299, the fourth threshold may be 100, and then it is determined whether the number of other ID data directly associated with the ID data is greater than 299 and the sum of the association frequencies between the ID data and each of the other ID data is greater than or equal to 100; if so, indicating that the association relationship between the ID data and each other ID data is unreliable, and removing the association relationship between the ID data and each other ID data. In addition, it may also be determined whether the sum of the frequency of association between the ID data and each of the other ID data is greater than or equal to a fifth threshold; and if so, removing the association relationship between the ID data and each other ID data. If the fifth threshold may be 1000, determining whether the sum of the association frequencies between the ID data and each of the other ID data is greater than or equal to 1000; if so, indicating that the association relationship between the ID data and each other ID data is unreliable, and removing the association relationship between the ID data and each other ID data.
And step S303, obtaining the ID data network after the pruning pretreatment.
The method comprises the steps of judging whether any ID data in the ID data network meets pruning rules or not, and carrying out pruning pretreatment on the incidence relation between the ID data and other ID data according to the judgment result to obtain the ID data network after the pruning pretreatment, so that the unreliable incidence relation between the ID data in the ID data network is effectively removed, the incidence relation between the ID data in the ID data network after the pruning pretreatment is stronger and reliable, the accuracy of ID data network treatment can be improved, and the data volume of subsequent data analysis can be reduced.
According to the ID data network pruning preprocessing method provided by the embodiment, data analysis is performed on log data of a plurality of services, association frequency among ID data is obtained quickly, pruning preprocessing is performed on association relations between the ID data and other ID data according to the number of other ID data directly associated with the ID data and/or the association frequency between the ID data and other ID data aiming at any ID data in an ID data network, and unreliable association relations among the ID data in the ID data network are effectively and quickly removed, so that the association relations among the ID data in the ID data network after pruning preprocessing are strong and reliable association relations, the accuracy of ID data network processing can be improved, and the data quantity of data analysis can be reduced. Optionally, a corresponding time weight is introduced into the log data, the actual association frequency between the ID data is attenuated through the time weight, and a numerical value obtained after attenuation is used as the association frequency between the ID data, so that the real association degree between the ID data in the current period is accurately reflected, and the method has a high reference value and is beneficial to accurately performing pruning pretreatment on the ID data network.
The invention also provides an ID data network data analysis method, which comprises the following steps: acquiring an ID data network containing ID data and an association relation between the ID data; according to the ID data contained in the ID data network and the incidence relation between the ID data, ID relation data are constructed; the ID relationship data includes a plurality of ID relationship pairs; and comparing and combining the ID relation data to obtain a plurality of ID data subnets. Wherein the ID data includes: user ID data and/or device ID data. The ID data network data analysis method is described below with the specific embodiment shown in fig. 4.
Fig. 4 is a flow chart of an ID data network data analysis method according to an embodiment of the present invention, and as shown in fig. 4, the method includes the following steps:
step S400, obtaining an ID data network containing the ID data and the incidence relation between the ID data.
The description of this step can refer to the description of step S100 in the embodiment shown in fig. 1, and is not repeated here.
Step S401, according to the ID data contained in the ID data network and the incidence relation between the ID data, the ID relation data is constructed.
After the ID data network is acquired, ID relationship data can be constructed according to the ID data included in the ID data network and the association relationship between the ID data, the constructed ID relationship data includes a plurality of ID relationship pairs, each ID relationship pair includes: two IDs and the relationship between the two IDs, for example, ID data "a 1" has a direct association with ID data "b 1", ID data "a 2" has a direct association with ID data "b 2", and ID data "a 2" has a direct association with ID data "c 2", then the corresponding ID relationship pairs respectively including two IDs and represented by () have a relationship between the two IDs are constructed as (a1, b1), (a2, b2) and (a2, c 2). Taking the ID relationship pair (a1, b1) as an example, the two IDs included are a1 and b1, respectively, and the two IDs are included together with () to indicate that there is a relationship between the two IDs. And aiming at all the ID data contained in the ID data network and the incidence relation among all the ID data, constructing a plurality of ID relation pairs by using the construction method, thereby finishing the construction of the ID relation data.
In step S402, the ID relationship data is copied to the memory in full.
Before the comparison and combination are carried out, the ID relation data are required to be copied to the memory in a full amount, so that the memory comprises the ID relation data in a full amount, and the ID relation data can be compared and combined quickly and conveniently.
Step S403, comparing and combining the ID relation data with the ID relation data copied to the memory in full, and performing data integration according to the comparison and combination result to obtain a plurality of ID data subnets.
After the ID relation data is copied to the memory in full, each ID relation pair in the ID relation data can be compared and combined with the ID relation data copied to the memory in full, and then data integration is performed according to the comparison and combination result to obtain a plurality of ID data subnets. And aiming at each ID relation pair in the ID relation data, finding out an ID relation pair which has at least one same ID with the ID relation pair from the ID relation data in the memory through comparison, and combining the ID in the ID relation pair with the ID in the found ID relation pair according to the relation between two IDs contained in the ID relation pair to obtain a comparison combination intermediate result of the ID relation pair. For example, for the ID relationship pair (a2, b2), if an ID relationship pair having at least one same ID as the ID relationship pair (a2, b2) is found from the ID relationship data in the memory by comparison, and includes the ID relationship pair (a2, b2) and the ID relationship pair (a2, c2), the ID in the ID relationship pair is combined with the ID in the found ID relationship pair, and the intermediate result of the comparison combination of the ID relationship pair (a2, b2) is "c 2-a2-b 2", where "-" between two IDs indicates that there is a relationship between the two IDs.
Considering that the obtained comparison combination intermediate result may still have incomplete combination, then continuously comparing and combining the comparison combination intermediate results of all ID relation pairs with the ID relation data copied to the memory in full quantity to obtain the comparison combination intermediate result of the next iteration operation, and iteratively executing the step until the preset iteration condition is met. And obtaining a comparison combination result after the iteration process is finished. Wherein, the comparison combination result records a plurality of groups of IDs and the relationship between IDs in each group of IDs, and each group of IDs comprises one or more IDs. And performing data integration according to the multiple groups of IDs in the comparison combination result and the relation between the IDs in each group of IDs to obtain a plurality of ID data subnets, specifically, performing data integration according to the relation between the IDs in the group of IDs aiming at any group of IDs in the comparison combination result to integrate the IDs into one ID data subnet.
Optionally, the ID relationship data may be divided into a plurality of fragments, and the fragments are compared and combined in parallel, so as to further improve the data analysis efficiency of the ID data network. And comparing and combining the plurality of fragments with the ID relation data copied to the memory in full to obtain comparison and combination results of all the fragments, and then performing data integration on the comparison and combination results of all the fragments to obtain a plurality of ID data subnets. And the comparison combination result of all the fragments records the relationship between the multiple groups of IDs and the IDs in each group of IDs, and data integration is carried out according to the multiple groups of IDs in the comparison combination result of all the fragments and the relationship between the IDs in each group of IDs to obtain a plurality of ID data subnets. And comparing and combining the fragment with the ID relation data copied to the memory in full quantity to obtain a comparison and combination intermediate result of the fragment. Specifically, for each ID relationship pair in the fragment, an ID relationship pair having at least one same ID as the ID relationship pair is found from the ID relationship data in the memory by comparison, and according to a relationship between two IDs included in the ID relationship pair, the ID in the ID relationship pair and the ID in the found ID relationship pair are combined to obtain a comparison combination intermediate result of the ID relationship pair, until all the ID relationship pairs in the fragment complete comparison combination with the ID relationship data in the memory, the comparison combination intermediate result of the fragment is obtained, and the comparison combination intermediate result of the fragment includes: the comparison of all ID relationship pairs in the fragment combines the intermediate results.
Considering that the obtained comparison combination intermediate results of all the fragments may still have incomplete combination, after obtaining the comparison combination intermediate results of all the fragments, the invention iteratively executes the following intermediate comparison steps until the preset iteration conditions are met, wherein the intermediate comparison steps are as follows: and dividing the comparison and combination intermediate result of all the fragments into a plurality of intermediate sub-fragments, and comparing and combining the plurality of intermediate sub-fragments with the ID relation data copied to the memory in full quantity in parallel to obtain the comparison and combination intermediate result of all the fragments in the next iteration operation. And when the iteration process is finished, obtaining the comparison combination result of all the fragments. Through the iterative execution mode, the comparison and combination intermediate results of the fragments can be fully combined so as to carry out data integration. The preset iteration condition can be set by a person skilled in the art according to actual needs, and is not limited herein. For example, the preset iteration condition may include: the iteration number reaches a preset iteration number, wherein a person skilled in the art can set the preset iteration number according to actual needs, for example, the preset iteration number is set to 3.
According to the ID data network data analysis method provided by the embodiment, the ID relation data can be constructed based on the ID data contained in the ID data network and the association relation between the ID data, then the ID relation data and the ID relation data copied to the memory in full quantity are compared and combined, data integration is performed according to the comparison and combination result, a plurality of ID data subnets are obtained accurately and quickly, and therefore effective division of the ID data network is achieved. Optionally, the ID relationship data may be further divided into a plurality of fragments, and the fragments are compared and combined with the ID relationship data copied to the memory in full, so as to further improve the data analysis efficiency of the ID data network. Compared with an ID data network, the ID data contained in the ID data subnet has stronger and reliable incidence relation and can be identified as the ID data of the same user, and the user characteristics can be accurately and quickly analyzed on the basis of the ID data subnet so as to construct a complete and effective user portrait.
The invention also provides another ID data network data analysis method, which comprises the following steps: acquiring an ID data network containing ID data and an association relation between the ID data; according to the ID data contained in the ID data network and the incidence relation between the ID data, ID relation data are constructed; the ID relationship data includes a plurality of ID relationship pairs, each ID relationship pair including: two IDs and the relationship between the two IDs; and grouping the ID relation data to obtain a plurality of ID data subnets. Wherein the ID data includes: user ID data and/or device ID data. The ID data network data analysis method is described below with the specific embodiment shown in fig. 5.
Fig. 5a shows a schematic flow chart of a method for analyzing ID data network data according to another embodiment of the present invention, as shown in fig. 5a, the method includes the following steps:
step S500, an ID data network containing the ID data and the association relation between the ID data is obtained.
The description of this step can refer to the description of step S100 in the embodiment shown in fig. 1, and is not repeated here.
Step S501, according to the ID data contained in the ID data network and the incidence relation between the ID data, the ID relation data is constructed.
Wherein the ID relationship data comprises a plurality of ID relationship pairs, each ID relationship pair comprising: two IDs and the relationship between the two IDs. The description of this step can refer to the description of step S401 in the embodiment shown in fig. 4, and is not repeated here.
Step S502, each ID relation pair is processed with positive order and negative order, and two ID relation pairs corresponding to each ID relation pair are obtained.
In order to facilitate the grouping processing, the invention is provided with a directed forward processing method and a directed reverse processing method, specifically, the sequence of the ID relationship pair from the left ID to the right ID is set to be a forward sequence, the sequence of the ID relationship pair from the right ID to the left ID is set to be a reverse sequence, the sequencing of the two IDs in the ID relationship pair according to the forward sequence is called directed forward processing, and the sequencing of the two IDs in the ID relationship pair according to the reverse sequence is called directed reverse processing. After each ID relationship pair is processed in a forward order and a reverse order, two ID relationship pairs corresponding to each ID relationship pair are obtained. In order to conveniently know whether the ID directed relationship pairs correspond to the same ID relationship pair, a relationship bit may be set for each ID directed relationship pair, where the relationship bits of two ID directed relationship pairs corresponding to the same ID relationship pair are the same, and the relationship bits of ID directed relationship pairs corresponding to different ID relationship pairs are different.
The schematic diagram of processing ID relationship pairs in forward order and reverse order can be shown in fig. 5 b. The left part of fig. 5b shows that the ID relationship pairs included in the ID relationship data are (a1, b1), (a2, b2), (a2, c2), (a3, b3), (a3, c3), and (c3, d 3). For the ID relationship pair (a1, b1), the directed forward processing is performed on (a1, b1) to obtain an ID directed relationship pair (a1-b1-01), the directed reverse processing is performed on (a1, b1) to obtain an ID directed relationship pair (b1-a1-01), and then the ID directed relationship pair (a1-b1-01) and the ID directed relationship pair (b1-a1-01) are two ID directed relationship pairs corresponding to the ID relationship pair (a1, b1), wherein the relationship bits in the two ID directed relationship pairs are the same and are both 01. In the above manner, (a2, b2), (a2, c2), (a3, b3), (a3, c3), and (c3, d3) are subjected to the forward and reverse order processing, respectively, to obtain the ID directed relationship pair shown in the right part of fig. 5 b. And determining the ID of the primary key in any ID directed relationship pair according to a preset rule. The skilled person can set the preset rule according to the actual requirement, and the preset rule is not limited herein. For example, the preset rules include: the ID on the left side of the ID directed relationship pair is taken as the primary key ID.
And S503, grouping all the ID directed relation pairs by using a primary key ID grouping method, and obtaining a plurality of ID data subnets according to a grouping result.
Grouping all ID directed relation pairs by using a primary key ID grouping method to obtain a plurality of first groups; for any first packet, determining the counting bits of the first packet according to the number of ID directed relationship pairs contained in the first packet; extracting at least one first packet with counting bits as a first counting value, and performing combined processing on ID directed relation pairs contained in the extracted at least one first packet according to the relation bits to obtain at least one first ID data subnet; the number of ID data included in the first ID data subnet is 2. Wherein the first count value is 1.
Taking all the ID directed relationship pairs as the ID directed relationship pairs shown in the right part of fig. 5b as an example, all the ID directed relationship pairs are grouped by the groupByKey method according to the left ID of the ID directed relationship pair as the primary key ID, i.e., the ID directed relationship pairs with the same primary key ID are divided into a first group, thereby obtaining a plurality of first groups, which are respectively the first group 1 including the ID directed relationship pair (a1-b1-01), the first group 2 including the ID directed relationship pair (a2-b2-02) and (a2-c2-03), the first group 3 including the ID directed relationship pair (a3-b3-04) and (a3-c3-05), the first group 4 including the ID directed relationship pair (b1-a1-01), the first group 5 including the ID directed relationship pair (b2-a2-02), A first packet 6 containing an ID directed relationship pair (b3-a3-04), a first packet 7 containing an ID directed relationship pair (c2-a2-03), a first packet 8 containing ID directed relationship pairs (c3-a3-05) and (c3-d3-06), and a first packet 9 containing an ID directed relationship pair (d3-c 3-06). And then, for any first packet, determining the counting bits of the first packet according to the number of ID directed relationship pairs contained in the first packet, wherein the counting bits of the first packet 1, the first packet 4, the first packet 5, the first packet 6, the first packet 7 and the first packet 9 are all 1, and the counting bits of the first packet 2, the first packet 3 and the first packet 8 are all 2.
First packets with the count bit of 1 are extracted from the first packets 1 to 9, the extracted first packets include the first packet 1, the first packet 4, the first packet 5, the first packet 6, the first packet 7 and the first packet 9, and then the ID directed relationship pairs included in the extracted first packets are combined according to the relationship bits, that is, the ID directed relationship pairs with the same relationship bits in the extracted first packets are combined into a first ID data subnet, and the number of ID data included in the first ID data subnet is 2. And only if the relation bits of the ID directed relation pairs (a1-b1-01) and (b1-a1-01) are the same in the extracted ID directed relation pairs contained in the first groups, combining the two ID directed relation pairs into a first ID data subnet, specifically, respectively taking a1 and b1 as nodes, and determining the connection relation between the two nodes according to the association relation between a1 and b1 so as to obtain the first ID data subnet.
By the grouping processing mode, the first ID data subnet with the number of the contained ID data being 2 can be conveniently and quickly obtained. In addition, the invention can conveniently and quickly obtain the second ID data subnet with the number of the contained ID data being 3, and the specific processing mode is as follows:
in the packet processing process, after the counting bits of all the first packets are determined, at least one first packet of which the counting bits are a second counting value is extracted; aiming at any extracted first packet, obtaining an ID directed relation group corresponding to the first packet according to the ID directed relation pair contained in the first packet; each ID directed relationship group contains: three IDs and the relationship between the three IDs; determining the ID of a primary key in any ID directed relationship group according to a preset rule; setting a relationship position for each ID directed relationship group; the ID directed relationship groups corresponding to the same first group have the same relationship bits, and the ID directed relationship groups corresponding to different first groups have different relationship bits. Secondly, grouping all the ID directed relationship groups by using a primary key ID grouping method to obtain a plurality of second groups, determining the counting bits of the second groups according to the number of the ID directed relationship groups contained in the second groups aiming at any second group, then extracting at least one second group of which the counting bits are third counting values, and carrying out combined processing on the extracted ID directed relationship groups contained in the at least one second group according to the relationship bits to obtain at least one second ID data subnet; the number of ID data contained in the second ID data subnet is 3. Wherein the second count value is 2 and the third count value is 1.
As can be seen from the above example, the count bits of the first packet 1, the first packet 4, the first packet 5, the first packet 6, the first packet 7, and the first packet 9 are all 1, the count bits of the first packet 2, the first packet 3, and the first packet 8 are all 2, and the first packet having the count bit of 2 is extracted from the first packet 1 to the first packet 9, and the extracted first packet includes the first packet 2, the first packet 3, and the first packet 8. For the first packet 2, the ID directed relationship pair included in the first packet 2 is (a2-b2-02) and (a2-c2-03), and according to the ID directed relationship pair (a2-b2-02) and (a2-c2-03), the ID directed relationship group corresponding to the first packet 2 is obtained, specifically, the ID directed relationship group corresponding to the first packet 2 includes 3 ID directed relationship groups, for example, the obtained ID directed relationship group corresponding to the first packet 2 includes an ID directed relationship group (a2-b2-c2-001), an ID directed relationship group (b2-a2-c2-001) and an ID directed relationship group (c2-a2-b2-001), wherein the relationship bits in the three ID directed relationship groups are the same and are all 001. According to the above manner, the ID directed relationship group corresponding to the first group 3 and the ID directed relationship group corresponding to the first group 8 are obtained, wherein the ID directed relationship group corresponding to the first group 3 includes (a3-b3-c3-002), (b3-a3-c3-002), and (c3-a3-b3-002), and the ID directed relationship group corresponding to the first group 8 includes (c 35 3-a3-d3-003), (a3-c3-d3-003), and (d3-c3-a 3-003). Grouping all the ID directed relationship groups by the groupByKey method according to the ID on the left side in the ID directed relationship groups as the primary key ID, namely grouping the ID directed relationship groups with the same primary key ID into a second group to obtain a plurality of second groups, wherein the second groups are respectively a second group 1 comprising the ID directed relationship group (a2-b2-c2-001), a second group 2 comprising the ID directed relationship groups (a3-b3-c3-002) and (a3-c3-d3-003), a second group 3 comprising the ID directed relationship group (b2-a2-c2-001), a second group 4 comprising the ID directed relationship group (b3-a3-c3-002), a second group 5 comprising the ID directed relationship group (c2-a2-b2-001), and the ID directed relationship groups (c3-a3-b 3-002-b 5956-3) 003) and a second packet 7 containing the ID directed set (d3-c3-a 3-003). Then, for any second packet, the count bit of the second packet is determined according to the number of ID direction relation groups contained in the second packet, wherein the count bits of the second packet 1, the second packet 3, the second packet 4, the second packet 5 and the second packet 7 are all 1, and the count bits of the second packet 2 and the second packet 6 are all 2.
Second packets with the count bit of 1 are extracted from the second packets 1 to 7, the extracted second packets include the second packet 1, the second packet 3, the second packet 4, the second packet 5 and the second packet 7, and then the ID directed relationship groups included in the extracted second packets are combined according to the relationship bit, that is, the ID directed relationship groups with the same relationship bit in the extracted second packets are combined into one second ID data subnet, and the number of ID data included in the second ID data subnet is 3. In the extracted ID directed relationship groups included in these second packets, only if the relationship bits of the ID directed relationship groups (a2-b2-c2-001), (b2-a2-c2-001) and (c2-a2-b2-001) are the same, the three ID directed relationship groups are combined into one second ID data subnet, specifically, the connection relationships among the three nodes are determined according to the association relationships among a2, b2 and c2 by taking a2, b2 and c2 as nodes, respectively, so as to obtain the second ID data subnet, specifically, the ID directed relationship pairs (a2-b2-c2-001), (b2-a2-c2-001) and (c2-a2-b 2-36001) correspond to the ID directed relationship pairs (a2-b 2-3602 and (a2-b 2-b 36001), the connection relationship among the three nodes a2, b2 and c2 is determined, the node a2 is connected with the node b2, and the node a2 is connected with the node c2, so that the second ID data subnet is obtained.
By the above grouping processing method, the first ID data subnet containing 2 ID data and the second ID data subnet containing 3 ID data can be obtained conveniently and quickly, and of course, those skilled in the art can refer to the above grouping processing method and so on to obtain other ID data subnets containing 4, 5, 6, etc. ID data, and details are not described here.
According to the data analysis method for the ID data network provided by the embodiment, ID relation data can be constructed based on the ID data contained in the ID data network and the incidence relation between the ID data, then two ID directed relation pairs corresponding to each ID relation pair in the ID relation data are obtained through directed forward order and directed reverse order processing, and then all the ID directed relation pairs are grouped by using a primary key ID grouping method, so that the data analysis efficiency of the ID data network is effectively improved, a plurality of ID data subnets can be accurately and quickly obtained, and the effective division of the ID data network is realized. Optionally, the first ID data subnet and the second ID data subnet can be conveniently and quickly obtained by using the obtained count bits of the packet and the relationship bits set for the ID directed relationship pair and the ID directed relationship group.
Those skilled in the art can also combine the ID data network data analysis method shown in fig. 5a with the ID data network data analysis method shown in fig. 4 to further improve the data analysis efficiency of the ID data network. For example, the ID data network data analysis method shown in fig. 5a is used to group ID relationship data to obtain a first ID data subnet with the number of contained ID data being 2 and a second ID data subnet with the number of contained ID data being 3, then the other ID relationship pairs except the ID relationship pair corresponding to the first ID data subnet and the second ID data subnet in the ID relationship data are divided into a plurality of fragments, the plurality of fragments are compared and combined with the ID relationship data copied to the memory in full quantity in parallel to obtain a comparison and combination result of all the fragments, and then the comparison and combination result of all the fragments is subjected to data integration to obtain other ID data subnets with the number of contained ID data being 4, 5, 6, and the like. By the processing mode, the first ID data subnet with the number of contained ID data being 2 and the second ID data subnet with the number of contained ID data being 3 can be conveniently and quickly obtained, the data processing amount of comparison combination is effectively reduced, and the data analysis efficiency of the ID data network is improved.
The invention also provides an ID data subnet processing method, which comprises the following steps: calculating the number of ID data contained in each ID data subnet in a plurality of ID data subnets; extracting ID data subnets with the number exceeding a first preset number threshold; for any ID data subnet with the number of the contained ID data larger than a first preset number threshold, clustering and dividing the ID data in the ID data subnet to obtain a plurality of third ID data subnets corresponding to the ID data subnet; the number of ID data contained in the third ID data subnet is less than or equal to a second preset number threshold. The ID data subnet processing method is described below by a specific embodiment shown in fig. 6.
Fig. 6 is a flowchart illustrating an ID data subnet processing method according to an embodiment of the present invention, and as shown in fig. 6, the method includes the following steps:
step S600, the number of ID data included in each of the plurality of ID data subnets is calculated.
The ID data subnetworks are obtained by analyzing data of the ID data network, the ID data subnetworks comprise ID data and an incidence relation between the ID data, and the number of the ID data contained in the ID data subnetworks is far smaller than that of the ID data contained in the ID data network. The ID data subnets may still include ID data subnets with a large number of contained ID data, and the ID data in the ID data subnets may not belong to ID data of the same user although having a strong association relationship, and if the ID data are identified as ID data of the same user, user characteristics obtained based on the ID data subnet analysis may not effectively and truly reflect the actual situation of the user. In order to further increase the reliability of these ID data subnets, further processing of these ID data subnets is required. In order to conveniently find the ID data subnet needing to be processed from the plurality of ID data subnets, the number of ID data included in each of the plurality of ID data subnets may be calculated.
Step S601, extracting the ID data subnet in which the number of the included ID data exceeds a first preset number threshold.
After the number of ID data included in each ID data subnet is calculated, ID data subnets whose number of ID data included exceeds a first preset number threshold are extracted from the plurality of ID data subnets, where the first preset number threshold may be set by a person skilled in the art according to actual needs, and is not limited herein. For example, a first preset number threshold may be set to 50, and then ID data subnets containing ID data whose number exceeds 50 are extracted from the number of ID data subnets.
In step S602, an ID data subnet that has not been selected is selected from the extracted ID data subnets that include ID data whose number exceeds a first preset number threshold.
After the ID data subnets with the number of the included ID data exceeding the first preset number threshold are extracted, for effectively obtaining the third ID data subnets, for any ID data subnet with the number of the included ID data larger than the first preset number threshold, the ID data in the ID data subnet are clustered and divided, so as to obtain a plurality of third ID data subnets corresponding to the ID data subnet. Specifically, in step S602, an ID data subnet that has not been selected is selected among the extracted ID data subnets in which the number of included ID data exceeds the first preset number threshold.
Step S603, performing data analysis on the log data of the multiple services corresponding to the ID data subnet, and determining the association frequency between the ID data in the ID data subnet.
Specifically, for log data of one service, ID data and other ID data using the service are described in the log data, and it is described that there is an association relationship between the ID data using the service and the other ID data, so that log data corresponding to the ID data in the ID data subnet can be searched from log data of a plurality of services, and the frequency of association between the ID data in the ID data subnet can be determined by performing data analysis on the log data of a plurality of services corresponding to the ID data subnet.
Specifically, data analysis is performed on log data of a plurality of services corresponding to the ID data subnet, and the actual association frequency between ID data in the ID data subnet is calculated. In practical applications, the actual frequency of association between ID data may be calculated in a preset unit time. Taking the preset unit time as a day as an example, if the log data is subjected to data analysis, and 50 days of a certain ID data in the ID data subnet and another ID data in the ID data subnet have an association relationship, the actual association frequency between the two ID data is recorded as 50. According to the method, the actual association frequency between each ID data in the ID data subnet and other ID data in the ID data subnet is calculated.
In consideration of the fact that in practical application, a plurality of users use the same service through the same device at different times, the user ID data of the users and the device ID data of the device have an association relationship, but the actual association frequency of the users cannot truly reflect the users actually corresponding to the device at the current time. Therefore, the invention introduces corresponding time weight for the log data corresponding to the ID data, and calculates the association frequency between the ID data according to the actual association frequency between the ID data, the time information of the log data corresponding to the ID data and the time weight. The weight value of the time weight corresponding to the log data corresponding to the ID data is related to the distance between the log data corresponding to the ID data and the current time. If the time information of the log data corresponding to the ID data is closer to the current time, the weight of the time weight corresponding to the log data corresponding to the ID data is larger; and if the time information of the log data corresponding to the ID data is farther from the current time, the smaller the weight of the time weight corresponding to the log data corresponding to the ID data is. And performing attenuation processing on the actual association frequency between the ID data through time weight, and taking the numerical value obtained after the attenuation processing as the association frequency between the ID data. The association frequency between the ID data obtained by the method can accurately reflect the real association degree between the ID data of the current period, has higher reference value, and is beneficial to accurately clustering the ID data in the ID data subnet.
Step S604, for any ID data in the ID data subnet, the distance between the ID data and other ID data is calculated according to the association frequency between the ID data and other ID data.
Wherein the greater the frequency of association between the ID data and other ID data, the smaller the resulting distance between the ID data and other ID data. The specific calculation method can be set by those skilled in the art according to actual needs, and is not limited herein. For example, the frequency of association between the ID data and other ID data may be divided by a preset value, and the resultant value is taken as the distance between the ID data and other ID data. Assuming that the preset value is 1, it is determined in step S603 that the association frequency between the ID data "d 5" in the ID data subnet and the ID data "e 5" in the ID data subnet is 50, then the association frequency is divided by 1 to obtain a value of 0.02, and then the value of 0.02 is taken as the distance between the ID data "d 5" and the ID data "e 5". When the calculation of the distance between the ID data and other ID data is completed for any ID data in the ID data subnet, the distance between the ID data in the ID data subnet is obtained.
Step S605, clustering the ID data in the ID data subnet according to the distance between the ID data in the ID data subnet and a preset clustering rule, to obtain a plurality of cluster sets.
The preset clustering rule can be set by a person skilled in the art according to actual needs, and is not limited herein. For example, the preset clustering rule specifies a preset neighborhood radius, a preset minimum value and a second preset number threshold, specifically, according to the distance between ID data in the ID data subnet and the preset neighborhood radius, a plurality of core ID data are determined from the ID data in the ID data subnet, then, for any core ID data, other ID data in the ID data subnet within the preset neighborhood radius of the core ID data are searched, and according to the second preset number threshold, the core ID data and the searched other ID data are clustered to obtain a cluster set, so that ID data having a stronger and more reliable association relationship in the ID data subnet are clustered into the cluster set.
And aiming at any ID data in the ID data subnet, calculating the number of other ID data in a preset neighborhood radius of the ID data according to the distance between the ID data and other ID data, and determining the ID data of which the number exceeds a preset minimum value as core ID data. For example, the preset neighborhood radius is 1, the preset minimum value is 3, the ID data included in the ID data subnet includes "d 5", "e 5", "f 5", "g 5", "h 5", etc., and for the ID data "d 5", the distance between the ID data "d 5" and the other ID data, the distance between the ID data "d 5" and the ID data "e 5", the distance between the ID data "d 5" and the ID data "f 5", the distance between the ID data "d 5" and the ID data "g 5", and the distance between the ID data "d 5" and the ID data "h 5" are all less than or equal to 1, the distance between the ID data "d 5" and the ID data other than the ID data "e 5", "f 5", "g 5", and "h 5" are all greater than 1, and then the other ID data existing in the preset neighborhood radius of the ID data "d 5" includes "e 6 5", "g 5", "36867" and "368658", that is, the number of other ID data within the preset neighborhood radius corresponding to the ID data "d 5" is 4, which exceeds the preset minimum value, the ID data "d 5" is determined as the core ID data. In the above manner, all the core ID data are determined from the ID data in the ID data subnet.
After all the core ID data in the ID data subnet are determined, aiming at any core ID data in all the core ID data, searching other ID data in the ID data subnet within the preset neighborhood radius of the core ID data, and clustering the core ID data and the searched other ID data according to a second preset quantity threshold to obtain a cluster set. Specifically, ID data whose number is smaller than a second preset number threshold may be selected from the other found ID data according to the distance between the core ID data and the other found ID data, and then the core ID data and the selected ID data are clustered to obtain a cluster set. For example, the second preset number threshold is 10, the number of the other found ID data within the preset neighborhood radius of the core ID data is 15, and is greater than the second preset number threshold, then 9 ID data closest to the core ID data may be selected from the 15 found other ID data, and the core ID data and the selected 9 ID data are clustered to obtain a cluster set. For another example, if the number of the other ID data found within the preset neighborhood radius of the core ID data is 8 and is smaller than the second preset number threshold, the core ID data and the 8 ID data may be directly clustered to obtain a cluster set without selecting ID data from the 8 ID data.
Step S606, according to the plurality of cluster sets, the ID data subnet is divided to obtain a plurality of third ID data subnets corresponding to the ID data subnet.
After obtaining the plurality of cluster sets, the ID data subnet needs to be partitioned according to the plurality of cluster sets. In the ID data subnet, aiming at any cluster set, the incidence relation between the ID data in the cluster set and the ID data outside the cluster set is removed, the effective division of the ID data subnet is realized, and a plurality of third ID data subnets corresponding to the ID data subnet are obtained. Specifically, the association relationship between the ID data in the cluster set and the ID data in other cluster sets is removed, and the association relationship between the ID data in the cluster set and the ID data in the ID data subnet that is not clustered into several cluster sets is removed. For example, if there is an association between ID data "d 5" in the cluster set and ID data "a 5" in another cluster set, and ID data "d 5" in the cluster set also has an association with ID data "b 5" in the ID data subnet that is not clustered into several cluster sets, then the association between ID data "d 5" and ID data "a 5" may be removed, and the association between ID data "d 5" and ID data "b 5" may be removed.
Compared with the ID data subnets with the number of the contained ID data larger than the first preset number threshold, the ID data in the third ID data subnet has stronger and more reliable incidence relation and can be identified as the ID data of the same user, and the user characteristics can be accurately and effectively analyzed according to the third ID data subnet so as to construct a complete and effective user portrait. And the data volume of the third ID data subnet is far smaller than the data volume of the ID data subnet with the number of the contained ID data larger than the first preset number threshold, so that the user characteristic analysis is more convenient, and the analysis efficiency is improved. In practical application, a splitting flag bit may be set for an ID relationship pair corresponding to ID data in which an association relationship needs to be removed in an ID data subnet, so as to flag whether a relationship between two IDs in the ID relationship pair is an association relationship that needs to be removed in a splitting process. If the relationship between two IDs in a certain ID relationship pair is an incidence relationship which needs to be removed in the segmentation process, setting the segmentation mark bit of the ID relationship pair to be 1; if the relationship between two IDs in a certain ID relationship pair is not the association relationship which needs to be removed in the splitting process, the splitting flag bit of the ID relationship pair is set to 0. Whether the relationship between two IDs in the ID relationship pair is an association relationship which needs to be removed in the splitting process can be clearly known through the splitting mark bit.
Step S607, determining whether all the ID data subnets in the extracted ID data subnets have been selected; if yes, the method is ended; if not, go to step S602.
If the ID data subnets in the ID data subnets with the extracted number of the contained ID data exceeding the first preset number threshold are selected, the method is ended if the ID data subnets in the extracted ID data subnets are clustered and divided; if it is determined that none of the two are selected, step S602 is performed.
According to the ID data subnet processing method provided in this embodiment, for any ID data subnet in which the number of ID data included exceeds the first preset number threshold, ID data having a stronger and more reliable association relationship in the ID data subnet can be grouped into one type according to the association frequency between ID data and the preset clustering rule, and divided into the same third ID data subnet, thereby obtaining a plurality of corresponding third ID data subnets, and implementing effective processing on the ID data subnet. Compared with the ID data subnet before processing, the ID data in the third ID data subnet has stronger and more reliable incidence relation and can be identified as the ID data of the same user, and the user characteristics can be accurately and effectively analyzed based on the third ID data subnet so as to construct a complete and effective user portrait. And the data volume of the third ID data subnet is far smaller than that of the ID data subnet before processing, so that the user characteristic analysis is more convenient, and the analysis efficiency is improved.
Fig. 7 is a block diagram showing the structure of an ID data network processing apparatus according to an embodiment of the present invention, which includes, as shown in fig. 7: an acquisition module 710 and an ID data network analysis module 720.
The obtaining module 710 is adapted to: acquiring an ID data network containing ID data and an association relation between the ID data; the ID data includes: user ID data and/or device ID data.
The ID data network analysis module 720 is adapted to: carrying out data analysis on the ID data network to obtain a plurality of ID data subnets; dividing a plurality of ID data subnets into n ID data subnet sets according to the number of ID data contained in the ID data subnets, wherein n is a natural number greater than 0; the number of ID data contained in the ID data subnets in different ID data subnet sets is different.
Optionally, the apparatus further comprises: the log data analysis module 730 is suitable for performing data analysis on the log data of a plurality of services to determine the ID data and the association relation between the ID data; and the constructing module 740 is adapted to determine the connection relationship between the nodes according to the association relationship between the ID data by using the ID data as the nodes, and construct the ID data network.
Optionally, the apparatus further comprises: the pruning preprocessing module 750 is suitable for carrying out pruning preprocessing on the ID data network to obtain the ID data network after the pruning preprocessing; the ID data network analysis module 720 is further adapted to: and carrying out data analysis on the ID data network subjected to pruning pretreatment to obtain a plurality of ID data subnetworks.
Optionally, the pruning pre-processing module 750 is further adapted to: performing data analysis on the log data of the plurality of services to obtain the association frequency among the ID data; aiming at any ID data in an ID data network, carrying out pruning pretreatment on the incidence relation between the ID data and other ID data according to the quantity of other ID data directly correlated with the ID data and/or the incidence frequency between the ID data and other ID data; and obtaining the ID data network after pruning pretreatment.
Optionally, the pruning pre-processing module 750 is further adapted to: performing data analysis on the log data of the plurality of services, and calculating the actual association frequency among the ID data; and calculating the association frequency among the ID data according to the actual association frequency among the ID data, the time information of the log data corresponding to the ID data and the time weight.
Optionally, the pruning pre-processing module 750 is further adapted to: judging whether the number of other ID data directly related to the ID data is larger than a first threshold value or not and the association frequency between the ID data and any other ID data is smaller than or equal to a second threshold value; and if so, removing the association relationship between the ID data and any other ID data. The pruning pre-processing module 750 is further adapted to: judging whether the number of other ID data directly related to the ID data is larger than a third threshold value or not and the sum of the association frequency between the ID data and each other ID data is larger than or equal to a fourth threshold value; and if so, removing the association relationship between the ID data and each other ID data. The pruning pre-processing module 750 is further adapted to: judging whether the sum of the association frequencies between the ID data and the other ID data is greater than or equal to a fifth threshold value; and if so, removing the association relationship between the ID data and each other ID data.
Optionally, the ID data network analysis module 720 is further adapted to: according to the ID data contained in the ID data network and the incidence relation between the ID data, ID relation data are constructed; the ID relationship data includes a plurality of ID relationship pairs; copying the ID relation data to the memory in full; and comparing and combining the ID relation data with the ID relation data copied into the memory in full quantity, and integrating data according to a comparison and combination result to obtain a plurality of ID data subnets.
Optionally, the ID data network analysis module 720 is further adapted to: dividing ID relation data into a plurality of fragments; comparing and combining the plurality of fragments with the ID relation data copied to the memory in full to obtain comparison and combination results of all the fragments; and performing data integration on the comparison combination result of all the fragments to obtain a plurality of ID data subnets. The ID data network analysis module 720 is further adapted to: aiming at any fragment, comparing and combining the fragment and ID relation data copied to the memory in full quantity to obtain a comparison and combination intermediate result of the fragment; and (3) iteratively executing the step until a preset iteration condition is met: dividing the comparison combination intermediate result of all the fragments into a plurality of intermediate sub-fragments, and comparing and combining the plurality of intermediate sub-fragments with the ID relation data copied to the memory in full in parallel to obtain the comparison combination intermediate result of all the fragments in the next iteration operation; and after the iteration process is finished, obtaining comparison combination results of all the fragments. Wherein the preset iteration condition comprises the following steps: the iteration times reach the preset iteration times.
Optionally, the ID data network analysis module 720 is further adapted to: according to the ID data contained in the ID data network and the incidence relation between the ID data, ID relation data are constructed; the ID relationship data includes a plurality of ID relationship pairs, each ID relationship pair including: two IDs and the relationship between the two IDs; performing directed forward order and directed reverse order processing on each ID relationship pair to obtain two ID directed relationship pairs corresponding to each ID relationship pair; determining the ID of the primary key in any ID directed relationship pair according to a preset rule; and grouping all ID directed relation pairs by using a primary key ID grouping method, and obtaining a plurality of ID data subnets according to a grouping result. The ID data network analysis module 720 is further adapted to: setting a relationship bit for each ID directed relationship pair; the relationship bits of two ID directed relationship pairs corresponding to the same ID relationship pair are the same, and the relationship bits of the ID directed relationship pairs corresponding to different ID relationship pairs are different; grouping all ID directed relation pairs by using a primary key ID grouping method to obtain a plurality of first groups; for any first packet, determining the counting bits of the first packet according to the number of ID directed relationship pairs contained in the first packet; extracting at least one first packet with counting bits as a first counting value, and performing combined processing on ID directed relation pairs contained in the extracted at least one first packet according to the relation bits to obtain at least one first ID data subnet; the number of ID data included in the first ID data subnet is 2.
Optionally, the ID data network analysis module 720 is further adapted to: extracting at least one first packet of which the counting bit is a second counting value; aiming at any extracted first packet, obtaining an ID directed relation group corresponding to the first packet according to the ID directed relation pair contained in the first packet; each ID directed relationship group contains: three IDs and the relationship between the three IDs; determining the ID of a primary key in any ID directed relationship group according to a preset rule; setting a relationship position for each ID directed relationship group; the ID directed relationship groups corresponding to the same first group have the same relationship bits, and the ID directed relationship groups corresponding to different first groups have different relationship bits; grouping all the ID directed relationship groups by using a primary key ID grouping method to obtain a plurality of second groups; for any second packet, determining the counting bit of the second packet according to the number of the ID directed relation groups contained in the second packet; extracting at least one second packet with the counting bit being a third counting value, and performing combined processing on the ID directed relationship group contained in the extracted at least one second packet according to the relationship bit to obtain at least one second ID data subnet; the number of ID data contained in the second ID data subnet is 3.
Optionally, the apparatus further comprises: the cluster segmentation module 760 is adapted to cluster and segment the ID data in any ID data subnet, where the number of the included ID data is greater than a first preset number threshold, to obtain a plurality of third ID data subnets corresponding to the ID data subnet; the number of ID data contained in the third ID data subnet is less than or equal to a second preset number threshold.
Optionally, the cluster segmentation module 760 is further adapted to: aiming at any ID data in the ID data subnet, calculating the distance between the ID data and other ID data according to the association frequency between the ID data and other ID data; clustering the ID data in the ID data subnets according to the distance between the ID data in the ID data subnets and a preset clustering rule to obtain a plurality of clustering sets; and according to a plurality of cluster sets, dividing the ID data subnet to obtain a plurality of third ID data subnets corresponding to the ID data subnet. The cluster segmentation module 760 is further adapted to: determining a plurality of core ID data from the ID data in the ID data subnet according to the distance between the ID data in the ID data subnet and a preset neighborhood radius; and aiming at any core ID data, searching other ID data in the ID data subnet within the preset neighborhood radius of the core ID data, and clustering the core ID data and the searched other ID data according to a second preset quantity threshold to obtain a cluster set. The cluster segmentation module 760 is further adapted to: in the ID data subnet, aiming at any cluster set, removing the incidence relation between the ID data in the cluster set and the ID data in other cluster sets; and obtaining a plurality of third ID data subnets corresponding to the ID data subnets.
According to the ID data network processing apparatus provided in this embodiment, an ID data network can be quickly constructed by performing data analysis on log data of a plurality of services; the pruning pretreatment is carried out on the ID data network, unreliable association relation among the ID data in the ID data network is effectively and quickly removed, the accuracy of the ID data network treatment can be improved, and the data volume of data analysis can be reduced; in addition, the data analysis is carried out on the ID data contained in the ID data network and the incidence relation among the ID data, the ID data network can be quickly divided into a plurality of ID data subnets, the ID data contained in the ID data subnets have strong and reliable incidence relation and can be identified as the ID data of the same user, and the user characteristics can be accurately and quickly analyzed on the basis of the ID data subnets to construct a complete and effective user portrait.
Fig. 8 is a block diagram showing a structure of an ID data net pruning preprocessing apparatus according to an embodiment of the present invention, and as shown in fig. 8, the apparatus includes: an acquisition module 810 and a pruning pre-processing module 820.
The obtaining module 810 is adapted to: acquiring an ID data network containing ID data and an association relation between the ID data; the ID data includes: user ID data and/or device ID data.
Pruning pre-processing module 820 is adapted to: and carrying out pruning pretreatment on the ID data network to obtain the ID data network subjected to pruning pretreatment.
Optionally, the pruning pre-processing module 820 is further adapted to: performing data analysis on the log data of the plurality of services to obtain the association frequency among the ID data; aiming at any ID data in an ID data network, carrying out pruning pretreatment on the incidence relation between the ID data and other ID data according to the quantity of other ID data directly correlated with the ID data and/or the incidence frequency between the ID data and other ID data; and obtaining the ID data network after pruning pretreatment.
Optionally, the pruning pre-processing module 820 is further adapted to: performing data analysis on the log data of the plurality of services, and calculating the actual association frequency among the ID data; and calculating the association frequency among the ID data according to the actual association frequency among the ID data, the time information of the log data corresponding to the ID data and the time weight. Pruning pre-processing module 820 is further adapted to: judging whether the number of other ID data directly related to the ID data is larger than a first threshold value or not and the association frequency between the ID data and any other ID data is smaller than or equal to a second threshold value; and if so, removing the association relationship between the ID data and any other ID data. Pruning pre-processing module 820 is further adapted to: judging whether the number of other ID data directly related to the ID data is larger than a third threshold value or not and the sum of the association frequency between the ID data and each other ID data is larger than or equal to a fourth threshold value; and if so, removing the association relationship between the ID data and each other ID data. Pruning pre-processing module 820 is further adapted to: judging whether the sum of the association frequencies between the ID data and the other ID data is greater than or equal to a fifth threshold value; and if so, removing the association relationship between the ID data and each other ID data.
According to the ID data network pruning preprocessing device provided by the embodiment, data analysis is carried out on log data of a plurality of services, association frequency among ID data is obtained quickly, pruning preprocessing is carried out on association relations between the ID data and other ID data according to the number of other ID data directly associated with the ID data and/or the association frequency between the ID data and other ID data aiming at any ID data in an ID data network, and unreliable association relations among the ID data in the ID data network are effectively and quickly removed, so that the association relations among the ID data in the ID data network after pruning preprocessing are strong and reliable association relations, the accuracy of ID data network processing can be improved, and the data quantity of data analysis can be reduced. Optionally, a corresponding time weight is introduced into the log data, the actual association frequency between the ID data is attenuated through the time weight, and a numerical value obtained after attenuation is used as the association frequency between the ID data, so that the real association degree between the ID data in the current period is accurately reflected, and the method has a high reference value and is beneficial to accurately performing pruning pretreatment on the ID data network.
Fig. 9 is a block diagram illustrating a structure of an ID data network data analysis apparatus according to an embodiment of the present invention, and as shown in fig. 9, the apparatus includes: an obtaining module 910, a first constructing module 920 and an alignment combining module 930.
The obtaining module 910 is adapted to: acquiring an ID data network containing ID data and an association relation between the ID data; the ID data includes: user ID data and/or device ID data.
The first building block 920 is adapted to: according to the ID data contained in the ID data network and the incidence relation between the ID data, ID relation data are constructed; the ID relationship data includes a number of ID relationship pairs.
Optionally, the alignment combination module 930 is further adapted to: copying the ID relation data to the memory in full; and comparing and combining the ID relation data with the ID relation data copied into the memory in full quantity, and integrating data according to a comparison and combination result to obtain a plurality of ID data subnets.
Optionally, the alignment combination module 930 is further adapted to: dividing ID relation data into a plurality of fragments; comparing and combining the plurality of fragments with the ID relation data copied to the memory in full to obtain comparison and combination results of all the fragments; and performing data integration on the comparison combination result of all the fragments to obtain a plurality of ID data subnets. The alignment combination module 930 is further adapted to: aiming at any fragment, comparing and combining the fragment and ID relation data copied to the memory in full quantity to obtain a comparison and combination intermediate result of the fragment; and (3) iteratively executing the step until a preset iteration condition is met: dividing the comparison combination intermediate result of all the fragments into a plurality of intermediate sub-fragments, and comparing and combining the plurality of intermediate sub-fragments with the ID relation data copied to the memory in full in parallel to obtain the comparison combination intermediate result of all the fragments in the next iteration operation; and after the iteration process is finished, obtaining comparison combination results of all the fragments. Wherein the preset iteration condition comprises the following steps: the iteration times reach the preset iteration times.
According to the ID data network data analysis device provided in this embodiment, ID relation data can be constructed based on the ID data included in the ID data network and the association relationship between the ID data, then the ID relation data and the ID relation data copied to the memory in full are compared and combined, and data integration is performed according to the comparison and combination result, so that a plurality of ID data subnets are obtained accurately and quickly, thereby realizing effective division of the ID data network. Optionally, the ID relationship data may be further divided into a plurality of fragments, and the fragments are compared and combined with the ID relationship data copied to the memory in full, so as to further improve the data analysis efficiency of the ID data network. Compared with an ID data network, the ID data contained in the ID data subnet has stronger and reliable incidence relation and can be identified as the ID data of the same user, and the user characteristics can be accurately and quickly analyzed on the basis of the ID data subnet so as to construct a complete and effective user portrait.
Fig. 10 is a block diagram showing a configuration of an ID data network data analysis apparatus according to another embodiment of the present invention, which includes, as shown in fig. 10: an acquisition module 1010, a second construction module 1020, and a grouping module 1030.
The acquisition module 1010 is adapted to: acquiring an ID data network containing ID data and an association relation between the ID data; the ID data includes: user ID data and/or device ID data.
The second building module 1020 is adapted to: according to the ID data contained in the ID data network and the incidence relation between the ID data, ID relation data are constructed; the ID relationship data includes a plurality of ID relationship pairs, each ID relationship pair including: two IDs and the relationship between the two IDs.
The grouping module 1030 is adapted to: and grouping the ID relation data to obtain a plurality of ID data subnets.
Optionally, the grouping module 1030 is further adapted to: performing directed forward order and directed reverse order processing on each ID relationship pair to obtain two ID directed relationship pairs corresponding to each ID relationship pair; determining the ID of the primary key in any ID directed relationship pair according to a preset rule; and grouping all ID directed relation pairs by using a primary key ID grouping method, and obtaining a plurality of ID data subnets according to a grouping result. The grouping module 1030 is further adapted to: setting a relationship bit for each ID directed relationship pair; the relationship bits of two ID directed relationship pairs corresponding to the same ID relationship pair are the same, and the relationship bits of the ID directed relationship pairs corresponding to different ID relationship pairs are different; grouping all ID directed relation pairs by using a primary key ID grouping method to obtain a plurality of first groups; for any first packet, determining the counting bits of the first packet according to the number of ID directed relationship pairs contained in the first packet; extracting at least one first packet with counting bits as a first counting value, and performing combined processing on ID directed relation pairs contained in the extracted at least one first packet according to the relation bits to obtain at least one first ID data subnet; the number of ID data included in the first ID data subnet is 2.
Optionally, the grouping module 1030 is further adapted to: extracting at least one first packet of which the counting bit is a second counting value; aiming at any extracted first packet, obtaining an ID directed relation group corresponding to the first packet according to the ID directed relation pair contained in the first packet; each ID directed relationship group contains: three IDs and the relationship between the three IDs; determining the ID of a primary key in any ID directed relationship group according to a preset rule; setting a relationship position for each ID directed relationship group; the ID directed relationship groups corresponding to the same first group have the same relationship bits, and the ID directed relationship groups corresponding to different first groups have different relationship bits; grouping all the ID directed relationship groups by using a primary key ID grouping method to obtain a plurality of second groups; for any second packet, determining the counting bit of the second packet according to the number of the ID directed relation groups contained in the second packet; extracting at least one second packet with the counting bit being a third counting value, and performing combined processing on the ID directed relationship group contained in the extracted at least one second packet according to the relationship bit to obtain at least one second ID data subnet; the number of ID data contained in the second ID data subnet is 3.
According to the ID data network data analysis device provided in this embodiment, ID relationship data can be constructed based on ID data included in an ID data network and an association relationship between the ID data, then two ID directed relationship pairs corresponding to each ID relationship pair in the ID relationship data are obtained through directed forward and reverse processing, and then all the ID directed relationship pairs are grouped by using a primary key ID grouping method, so that the ID data network data analysis efficiency is effectively improved, a plurality of ID data subnets can be accurately and quickly obtained, and thus, the ID data network can be effectively divided. Optionally, the first ID data subnet and the second ID data subnet can be conveniently and quickly obtained by using the obtained count bits of the packet and the relationship bits set for the ID directed relationship pair and the ID directed relationship group.
Fig. 11 is a block diagram showing a configuration of an ID data subnet processing apparatus according to an embodiment of the present invention, which includes, as shown in fig. 11: a calculation module 1110, an extraction module 1120, and a cluster segmentation module 1130.
The calculation module 1110 is adapted to: the number of ID data included in each of the plurality of ID data subnets is calculated.
The extraction module 1120 is adapted to: and extracting the ID data subnets with the number exceeding a first preset number threshold value.
The cluster segmentation module 1130 is adapted to: for any ID data subnet with the number of the contained ID data larger than a first preset number threshold, clustering and dividing the ID data in the ID data subnet to obtain a plurality of third ID data subnets corresponding to the ID data subnet; the number of ID data contained in the third ID data subnet is less than or equal to a second preset number threshold.
Optionally, the cluster segmentation module 1130 is further adapted to: aiming at any ID data in the ID data subnet, calculating the distance between the ID data and other ID data according to the association frequency between the ID data and other ID data; clustering the ID data in the ID data subnets according to the distance between the ID data in the ID data subnets and a preset clustering rule to obtain a plurality of clustering sets; and according to a plurality of cluster sets, dividing the ID data subnet to obtain a plurality of third ID data subnets corresponding to the ID data subnet.
Optionally, the apparatus further comprises: the association frequency determining module 1140 is adapted to, for any ID data subnet containing ID data whose number is greater than the first preset number threshold, perform data analysis on log data of multiple services corresponding to the ID data subnet, and determine an association frequency between ID data in the ID data subnet. The association frequency determination module 1140 is further adapted to: performing data analysis on log data of a plurality of services corresponding to the ID data subnet, and calculating the actual association frequency between the ID data in the ID data subnet; and calculating the association frequency among the ID data according to the actual association frequency among the ID data, the time information of the log data corresponding to the ID data and the time weight.
Optionally, the cluster segmentation module 1130 is further adapted to: determining a plurality of core ID data from the ID data in the ID data subnet according to the distance between the ID data in the ID data subnet and a preset neighborhood radius; and aiming at any core ID data, searching other ID data in the ID data subnet within the preset neighborhood radius of the core ID data, and clustering the core ID data and the searched other ID data according to a second preset quantity threshold to obtain a cluster set. The cluster segmentation module 1130 is further adapted to: in the ID data subnet, aiming at any cluster set, removing the incidence relation between the ID data in the cluster set and the ID data outside the cluster set; and obtaining a plurality of third ID data subnets corresponding to the ID data subnets.
According to the ID data subnet processing apparatus provided in this embodiment, for any ID data subnet in which the number of ID data included exceeds the first preset number threshold, ID data having a stronger and more reliable association relationship in the ID data subnet can be grouped into one type according to the association frequency between ID data and the preset clustering rule, and divided into the same third ID data subnet, so as to obtain a plurality of corresponding third ID data subnets, thereby implementing effective processing on the ID data subnet. Compared with the ID data subnet before processing, the ID data in the third ID data subnet has stronger and more reliable incidence relation and can be identified as the ID data of the same user, and the user characteristics can be accurately and effectively analyzed based on the third ID data subnet so as to construct a complete and effective user portrait. And the data volume of the third ID data subnet is far smaller than that of the ID data subnet before processing, so that the user characteristic analysis is more convenient, and the analysis efficiency is improved.
The invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores at least one executable instruction, and the executable instruction can execute the ID data network data analysis method in any method embodiment.
Fig. 12 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.
As shown in fig. 12, the computing device may include: a processor (processor)1202, a communication Interface 1204, a memory 1206, and a communication bus 1208.
Wherein:
the processor 1202, communication interface 1204, and memory 1206 communicate with one another via a communication bus 1208.
A communication interface 1204 for communicating with network elements of other devices, such as clients or other servers.
The processor 1202 is configured to execute the program 1210, and may specifically execute relevant steps in the ID data network data analysis method embodiment described above.
In particular, program 1210 may include program code comprising computer operating instructions.
The processor 1202 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
The memory 1206 is used for storing programs 1210. The memory 1206 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 1210 may be specifically configured to enable the processor 1202 to execute the ID data network data analysis method in any of the above-described method embodiments. The specific implementation of each step in the program 1210 may refer to the corresponding steps and corresponding descriptions in the units in the ID data network data analysis embodiment, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in accordance with embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Claims (6)
1. A method of ID data network data analysis, the method comprising:
acquiring an ID data network containing ID data and an association relation between the ID data; the ID data includes: user ID data and/or device ID data;
according to the ID data contained in the ID data network and the incidence relation between the ID data, ID relation data are constructed; the ID relationship data includes a plurality of ID relationship pairs, each ID relationship pair including: two IDs and the relationship between the two IDs;
performing directed forward order and directed reverse order processing on each ID relationship pair to obtain two ID directed relationship pairs corresponding to each ID relationship pair; determining the ID of the primary key in any ID directed relationship pair according to a preset rule;
grouping all ID directed relation pairs by using a primary key ID grouping method, and obtaining a plurality of ID data subnets according to a grouping result;
setting a relationship bit for each ID directed relationship pair; the relationship bits of two ID directed relationship pairs corresponding to the same ID relationship pair are the same, and the relationship bits of the ID directed relationship pairs corresponding to different ID relationship pairs are different;
wherein, the grouping all ID directed relationship pairs by using the ID grouping method according to the main key, and obtaining a plurality of ID data subnets according to the grouping result further comprises: grouping all ID directed relation pairs by using a primary key ID grouping method to obtain a plurality of first groups; for any first packet, determining the counting bits of the first packet according to the number of ID directed relationship pairs contained in the first packet; extracting at least one first packet with counting bits as a first counting value, and performing combined processing on ID directed relation pairs contained in the extracted at least one first packet according to the relation bits to obtain at least one first ID data subnet; the number of ID data included in the first ID data subnet is 2.
2. The method of claim 1, wherein the grouping all ID-directed relationship pairs using a primary key ID-based grouping method, and deriving a plurality of ID data subnets from the grouping result further comprises:
extracting at least one first packet of which the counting bit is a second counting value;
aiming at any extracted first packet, obtaining an ID directed relation group corresponding to the first packet according to the ID directed relation pair contained in the first packet; each ID directed relationship group contains: three IDs and relationships between the three IDs; determining the ID of a primary key in any ID directed relationship group according to a preset rule;
setting a relationship position for each ID directed relationship group; the ID directed relationship groups corresponding to the same first group have the same relationship bits, and the ID directed relationship groups corresponding to different first groups have different relationship bits;
grouping all the ID directed relationship groups by using a primary key ID grouping method to obtain a plurality of second groups;
for any second packet, determining the counting bit of the second packet according to the number of the ID directed relation groups contained in the second packet;
extracting at least one second packet with the counting bit being a third counting value, and performing combined processing on the ID directed relationship group contained in the extracted at least one second packet according to the relationship bit to obtain at least one second ID data subnet; the number of ID data included in the second ID data subnet is 3.
3. An ID data network data analysis apparatus, the apparatus comprising:
the acquisition module is suitable for acquiring an ID data network containing the ID data and the incidence relation among the ID data; the ID data includes: user ID data and/or device ID data;
the second construction module is suitable for constructing ID relation data according to the ID data contained in the ID data network and the incidence relation among the ID data; the ID relationship data includes a plurality of ID relationship pairs, each ID relationship pair including: two IDs and the relationship between the two IDs;
the grouping module is suitable for performing directed forward order and directed reverse order processing on each ID relationship pair to obtain two ID directed relationship pairs corresponding to each ID relationship pair; determining the ID of the primary key in any ID directed relationship pair according to a preset rule; setting a relationship bit for each ID directed relationship pair; the relationship bits of two ID directed relationship pairs corresponding to the same ID relationship pair are the same, and the relationship bits of the ID directed relationship pairs corresponding to different ID relationship pairs are different; grouping all ID directed relation pairs by using a primary key ID grouping method to obtain a plurality of first groups; for any first packet, determining the counting bits of the first packet according to the number of ID directed relationship pairs contained in the first packet; extracting at least one first packet with counting bits as a first counting value, and performing combined processing on ID directed relation pairs contained in the extracted at least one first packet according to the relation bits to obtain at least one first ID data subnet; the number of ID data included in the first ID data subnet is 2.
4. The apparatus of claim 3, wherein the grouping module is further adapted to:
extracting at least one first packet of which the counting bit is a second counting value;
aiming at any extracted first packet, obtaining an ID directed relation group corresponding to the first packet according to the ID directed relation pair contained in the first packet; each ID directed relationship group contains: three IDs and relationships between the three IDs; determining the ID of a primary key in any ID directed relationship group according to a preset rule;
setting a relationship position for each ID directed relationship group; the ID directed relationship groups corresponding to the same first group have the same relationship bits, and the ID directed relationship groups corresponding to different first groups have different relationship bits;
grouping all the ID directed relationship groups by using a primary key ID grouping method to obtain a plurality of second groups;
for any second packet, determining the counting bit of the second packet according to the number of the ID directed relation groups contained in the second packet;
extracting at least one second packet with the counting bit being a third counting value, and performing combined processing on the ID directed relationship group contained in the extracted at least one second packet according to the relationship bit to obtain at least one second ID data subnet; the number of ID data included in the second ID data subnet is 3.
5. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the ID data network data analysis method of any one of claims 1-2.
6. A computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the ID data network data analysis method according to any one of claims 1-2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810973827.9A CN109241419B (en) | 2018-08-24 | 2018-08-24 | ID data network data analysis method and device and computing equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810973827.9A CN109241419B (en) | 2018-08-24 | 2018-08-24 | ID data network data analysis method and device and computing equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109241419A CN109241419A (en) | 2019-01-18 |
CN109241419B true CN109241419B (en) | 2021-06-29 |
Family
ID=65067877
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810973827.9A Active CN109241419B (en) | 2018-08-24 | 2018-08-24 | ID data network data analysis method and device and computing equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109241419B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105224606A (en) * | 2015-09-02 | 2016-01-06 | 新浪网技术(中国)有限公司 | A kind of disposal route of user ID and device |
WO2016029178A1 (en) * | 2014-08-22 | 2016-02-25 | Adelphic, Inc. | Audience on networked devices |
-
2018
- 2018-08-24 CN CN201810973827.9A patent/CN109241419B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016029178A1 (en) * | 2014-08-22 | 2016-02-25 | Adelphic, Inc. | Audience on networked devices |
CN105224606A (en) * | 2015-09-02 | 2016-01-06 | 新浪网技术(中国)有限公司 | A kind of disposal route of user ID and device |
Also Published As
Publication number | Publication date |
---|---|
CN109241419A (en) | 2019-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110222170B (en) | Method, device, storage medium and computer equipment for identifying sensitive data | |
CN107423613B (en) | Method and device for determining device fingerprint according to similarity and server | |
CN110209660B (en) | Cheating group mining method and device and electronic equipment | |
CN110046929B (en) | Fraudulent party identification method and device, readable storage medium and terminal equipment | |
CN106682906B (en) | Risk identification and service processing method and equipment | |
CN106921504B (en) | Method and equipment for determining associated paths of different users | |
CN106302104B (en) | User relationship identification method and device | |
KR102086936B1 (en) | User data sharing method and device | |
CN111090807B (en) | Knowledge graph-based user identification method and device | |
CN111062013A (en) | Account filtering method and device, electronic equipment and machine-readable storage medium | |
CN109376362A (en) | A kind of the determination method and relevant device of corrected text | |
CN106909619B (en) | Hybrid social network clustering method and system based on offset adjustment and bidding | |
CN109241421B (en) | ID data network processing method, device, computing equipment and computer storage medium | |
US11412063B2 (en) | Method and apparatus for setting mobile device identifier | |
CN109145588A (en) | Data processing method and device | |
CN109241419B (en) | ID data network data analysis method and device and computing equipment | |
CN109829099B (en) | ID data subnet processing method and device, computing equipment and computer storage medium | |
CN112241820A (en) | Risk identification method and device for key nodes in fund flow and computing equipment | |
CN107092650A (en) | A kind of Web Log Analysis method and device | |
CN112532414B (en) | Method, device, equipment and computer storage medium for determining ISP attribution | |
CN108154177B (en) | Service identification method, device, terminal equipment and storage medium | |
CN111159347B (en) | Article content quality data calculation method, calculation device and storage medium | |
CN115361231B (en) | Host abnormal flow detection method, system and equipment based on access baseline | |
CN111127064B (en) | Method and device for determining social attribute of user and electronic equipment | |
CN109299349B (en) | Application recommendation method and device, equipment and computer-readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |