Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In embodiments of the present invention, the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.
The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Example one
Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention. The embodiment of the invention provides a data processing method aiming at the problems that when the prior DPI system acquires the behavior characteristic data of users, the corresponding user behavior data of each user is established for each application, a large amount of redundant data is stored, and the panoramic user characteristic data cannot be formed. The method in this embodiment is applied to deep packet inspection equipment, and the computer equipment may be a computer equipment where a DPI system is located. In other embodiments of the present invention, the method in this embodiment may also be applied to other computer devices, and this embodiment takes a deep packet inspection device as an example for illustration. As shown in fig. 1, the method comprises the following specific steps:
step S101, first identity characteristic data of a first user and first identity characteristic data of a second user are respectively extracted from first user data and second user data to be processed, wherein the first identity characteristic data comprise at least one type of identity information used for uniquely identifying a user main body.
In this embodiment, the first user data and the second user data are user behavior data acquired by the DPI system for two different user accounts of one application, or user behavior data acquired for two different user accounts of two different applications.
In practical application, the first user data and the second user data to be processed may be specified by a technician by specifying an application identifier and a user registration account, or may be user data corresponding to any two user registration accounts in the obtained user data by the DPI system, which is not specifically limited in this embodiment.
The first identity characteristic data comprises at least one identity token for uniquely identifying a user agent. The identity information for uniquely identifying a user principal may at least include: identity card number, mobile phone number, email, etc.
And step S102, determining whether the first user and the second user belong to the same user main body according to the first identity characteristic data of the first user and the second user.
Because the first identity characteristic data of the users comprises at least one kind of identity information for uniquely identifying a user main body, if the first identity characteristic data of the first user and the second user simultaneously comprises at least one kind of identity information for uniquely identifying a user main body, when any kind of identity information for uniquely identifying a user main body simultaneously included in the first identity characteristic data of the first user and the second user is consistent, the first user and the second user can be determined to belong to the same user main body.
If the first identity characteristic data of the first user and the second user simultaneously includes at least one kind of identity information for uniquely identifying one user principal, it can be determined that the first user and the second user do not belong to the same user principal when any kind of identity information for uniquely identifying one user principal simultaneously included in the first identity characteristic data of the first user and the second user is inconsistent.
If the first identity characteristic data of the first user and the second user does not include identity information used for uniquely identifying a user principal at the same time, it cannot be determined that the first user and the second user belong to the same user principal or cannot be determined that the first user and the second user do not belong to the same user principal according to the first identity characteristic data of the first user and the second user.
And step S103, if the first user and the second user belong to the same user main body, merging the first user data and the second user data.
And after determining that the first user and the second user belong to the same user main body, merging the first user data and the second user data.
Specifically, the merging the first user data and the second user data includes:
and generating a uniform user data identifier corresponding to the first user data and the second user data, removing redundant information in the first user data and the second user data, and generating more comprehensive user data corresponding to the user data identifier.
The embodiment of the invention respectively extracts first identity characteristic data of a first user and second user from first user data and second user data to be processed, wherein the first identity characteristic data comprises at least one type of identity information for uniquely identifying a user main body; determining whether the first user and the second user belong to the same user main body or not according to the first identity characteristic data of the first user and the second user; and if the first user and the second user belong to the same user main body, merging the first user data and the second user data, so that a plurality of user data of the same user main body are merged to form panoramic user characteristic data, and the data redundancy of the DPI system is reduced.
Example two
Fig. 2 is a flowchart of a data processing method according to a second embodiment of the present invention. On the basis of the first embodiment, in this embodiment, if it is not determined that the first user and the second user belong to the same user principal, second identity feature data of the first user and the second user are respectively extracted from the first user data and the second user data to be processed, where the second identity feature data at least includes: family address, friend information, incidence relation and behavior characteristic data; calculating the similarity between the second identity characteristic data of the first user and the second user; comparing the similarity between the second identity characteristic data of the first user and the second user with the size of a first preset threshold value; and if the similarity between the second identity characteristic data of the first user and the second user is greater than a first preset threshold value, determining that the first user and the second user belong to the same user main body, and merging the first user data and the second user data. If the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to a first preset threshold, comparing the similarity between the second identity characteristic data of the first user and the second user with a second preset threshold, wherein the second preset threshold is smaller than the first preset threshold; and if the similarity between the second identity characteristic data of the first user and the second user is greater than a second preset threshold value, establishing an association relationship between the first user data and the second user data.
As shown in fig. 2, the method comprises the following specific steps:
step S201, extracting first identity feature data of the first user and the second user from the first user data and the second user data to be processed, respectively, where the first identity feature data includes at least one kind of identity information for uniquely identifying a user principal.
In this embodiment, the first user data and the second user data are user behavior data acquired by the DPI system for two different user accounts of one application, or user behavior data acquired for two different user accounts of two different applications.
In practical application, the first user data and the second user data to be processed may be specified by a technician by specifying an application identifier and a user registration account, or may be user data corresponding to any two user registration accounts in the obtained user data by the DPI system, which is not specifically limited in this embodiment.
The first identity characteristic data comprises at least one identity token for uniquely identifying a user agent. The identity information for uniquely identifying a user principal may at least include: identity card number, mobile phone number, email, etc.
Optionally, the first identity feature data of the first user and the first identity feature data of the second user may be extracted from the first user data and the second user data to be processed, respectively, and recorded in the data list.
Step S202, determining whether the first user and the second user belong to the same user subject according to the first identity characteristic data of the first user and the second user.
In this embodiment, whether the first user and the second user belong to the same user subject is determined according to the first identity feature data of the first user and the second user, which may be specifically implemented in the following manner:
judging whether any identity information exists in the first identity characteristic data of the first user and the second user; and if any identity information exists in the first identity characteristic data of the first user and the second user, determining that the first user and the second user belong to the same user main body.
Because the first identity characteristic data of the users comprises at least one kind of identity information for uniquely identifying a user main body, if the first identity characteristic data of the first user and the second user simultaneously comprises at least one kind of identity information for uniquely identifying a user main body, when any kind of identity information for uniquely identifying a user main body simultaneously included in the first identity characteristic data of the first user and the second user is consistent, the first user and the second user can be determined to belong to the same user main body.
If the first identity characteristic data of the first user and the second user simultaneously includes at least one kind of identity information for uniquely identifying one user principal, it can be determined that the first user and the second user do not belong to the same user principal when any kind of identity information for uniquely identifying one user principal simultaneously included in the first identity characteristic data of the first user and the second user is inconsistent.
If the first identity characteristic data of the first user and the second user does not include identity information used for uniquely identifying a user principal at the same time, it cannot be determined that the first user and the second user belong to the same user principal or cannot be determined that the first user and the second user do not belong to the same user principal according to the first identity characteristic data of the first user and the second user.
Step S203, if it is determined that the first user and the second user belong to the same user subject, merging the first user data and the second user data.
And after determining that the first user and the second user belong to the same user main body, merging the first user data and the second user data.
Specifically, the merging the first user data and the second user data includes:
and generating a uniform user data identifier corresponding to the first user data and the second user data, removing redundant information in the first user data and the second user data, and generating more comprehensive user data corresponding to the user data identifier.
Step S204, if the first user and the second user are not determined to belong to the same user subject, second identity characteristic data of the first user and second user are respectively extracted from the first user data and second user data to be processed.
Wherein the second identity characteristic data comprises at least: family address, friend information, association relation and behavior characteristic data. The association relationship may be mobile phone contact information. Optionally, the second identity characteristic data may further include an account number of the instant messaging tool, and the like.
Optionally, if it is not determined that the first user and the second user belong to the same user subject, before extracting second identity feature data of the first user and the second user from the first user data and the second user data to be processed, the method further includes:
respectively extracting registered accounts of a first user and a second user from first user data and second user data to be processed; judging whether the registered accounts of the first user and the second user are consistent; and if the registered accounts of the first user and the second user are consistent, then executing the subsequent step of respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed.
If the registered accounts of the first user and the second user are inconsistent, calculating the similarity of the registered accounts of the first user and the second user; judging whether the similarity of the registered accounts of the first user and the second user is greater than a third preset threshold value or not; and if the similarity of the registered accounts of the first user and the second user is greater than a third preset threshold, then executing a subsequent step of respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed.
The registered accounts of the first user and the second user are two character strings, and the similarity between the registered accounts of the first user and the second user is calculated. For example, two strings may be matched, the longest matching sub-string of the two strings may be determined, and the proportion of the longest sub-string may be calculated.
In addition, the third preset threshold may be set by a technician according to actual needs, and this embodiment is not specifically limited herein.
Step S205, calculating the similarity between the second identity characteristic data of the first user and the second user; and comparing the similarity between the second identity characteristic data of the first user and the second user with the first preset threshold value.
In this embodiment, the similarity between the second identity characteristic data of the first user and the second user may be specifically implemented by any method in the prior art for calculating the similarity between the two users according to the behavior data and the attribute information of the two users, which is not specifically limited in this embodiment.
The first preset threshold may be set by a technician according to actual needs, and this embodiment is not specifically limited herein.
Step S206, if the similarity between the second identity characteristic data of the first user and the second user is larger than a first preset threshold value, determining that the first user and the second user belong to the same user subject, and merging the first user data and the second user data.
In this embodiment, if the similarity between the second identity characteristic data of the first user and the second user is greater than the first preset threshold, it is indicated that the similarity between the second identity characteristic data of the first user and the second user is very high, and the first user and the second user may be considered to belong to the same user subject, and the first user data and the second user data are merged.
In addition, the process of merging the first user data and the second user data is the same as step S203, and details are not repeated here in this embodiment.
Step S207, if the similarity between the second identity characteristic data of the first user and the second user is less than or equal to the first preset threshold, comparing the similarity between the second identity characteristic data of the first user and the second user with a second preset threshold, where the second preset threshold is less than the first preset threshold.
The second preset threshold may be set by a technician according to actual needs, and this embodiment is not specifically limited here.
If the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to a second preset threshold, it is indicated that the association degree between the first user data and the second user data is small, the first user data and the second user data are not combined, and the association relationship between the first user data and the second user data is not required to be established.
Step S208, if the similarity between the second identity characteristic data of the first user and the second user is greater than a second preset threshold, establishing an association relationship between the first user data and the second user data.
If the similarity between the second identity characteristic data of the first user and the second user is greater than the second preset threshold, it is indicated that the first user and the second user cannot be determined to belong to the same user subject according to the existing user data of the first user and the second user, but the association between the first user and the second user is large, so that the association relationship between the first user data and the second user data is established, so that after more important identity data of the first user and the second user is subsequently acquired, whether the first user and the second user belong to the same user subject can be further determined more accurately, and the accuracy of merging the user data is improved.
According to the embodiment of the invention, when the first user and the second user are not determined to belong to the same user subject, the second identity characteristic data of the first user and the second user are respectively extracted from the first user data and the second user data to be processed, and according to the similarity between the second identity characteristic data of the first user and the second user, if the similarity between the second identity characteristic data of the first user and the second user is greater than a first preset threshold value, the first user and the second user are determined to belong to the same user subject, and the first user data and the second user data are combined; if the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to a first preset threshold and larger than a second preset threshold, establishing an incidence relation between the first user data and the second user data, accurately determining a plurality of user numbers of the same user main body, combining the plurality of user data of the same user main body to form panoramic user characteristic data, and reducing the overall data redundancy of the DPI system.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention. The data processing device provided by the embodiment of the invention can execute the processing flow provided by the embodiment of the data processing method. As shown in fig. 3, the apparatus 30 includes: a data extraction module 301, a determination module 302 and a processing module 303.
Specifically, the data extraction module 301 is configured to extract first identity feature data of the first user and first identity feature data of the second user from the first user data and the second user data to be processed, where the first identity feature data includes at least one type of identity information for uniquely identifying a user principal.
The determining module 302 is configured to determine whether the first user and the second user belong to the same user subject according to the first identity feature data of the first user and the second user.
The processing module 303 is configured to perform merging processing on the first user data and the second user data if it is determined that the first user and the second user belong to the same user main body.
The apparatus provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in the first embodiment, and specific functions are not described herein again.
The embodiment of the invention respectively extracts first identity characteristic data of a first user and second user from first user data and second user data to be processed, wherein the first identity characteristic data comprises at least one type of identity information for uniquely identifying a user main body; determining whether the first user and the second user belong to the same user main body or not according to the first identity characteristic data of the first user and the second user; and if the first user and the second user belong to the same user main body, merging the first user data and the second user data, so that a plurality of user data of the same user main body are merged to form panoramic user characteristic data, and the data redundancy of the DPI system is reduced.
Example four
On the basis of the third embodiment, in this embodiment, the processing module is further configured to:
if the first user and the second user are not determined to belong to the same user subject, respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed, wherein the second identity characteristic data at least comprises: family address, friend information, incidence relation and behavior characteristic data; calculating the similarity between the second identity characteristic data of the first user and the second user; comparing the similarity between the second identity characteristic data of the first user and the second user with the size of a first preset threshold value; and if the similarity between the second identity characteristic data of the first user and the second user is greater than a first preset threshold value, determining that the first user and the second user belong to the same user main body, and merging the first user data and the second user data.
Optionally, the processing module is further configured to:
if the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to a first preset threshold, comparing the similarity between the second identity characteristic data of the first user and the second user with a second preset threshold, wherein the second preset threshold is smaller than the first preset threshold; and if the similarity between the second identity characteristic data of the first user and the second user is greater than a second preset threshold value, establishing an association relationship between the first user data and the second user data.
Optionally, the processing module is further configured to:
respectively extracting registered accounts of a first user and a second user from first user data and second user data to be processed; judging whether the registered accounts of the first user and the second user are consistent; and if the registered accounts of the first user and the second user are consistent, then executing the subsequent step of respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed.
Optionally, the processing module is further configured to:
if the registered accounts of the first user and the second user are inconsistent, calculating the similarity of the registered accounts of the first user and the second user; judging whether the similarity of the registered accounts of the first user and the second user is greater than a third preset threshold value or not; and if the similarity of the registered accounts of the first user and the second user is greater than a third preset threshold, then executing a subsequent step of respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed.
Optionally, the processing module is further configured to:
judging whether any identity information exists in the first identity characteristic data of the first user and the second user; and if any identity information exists in the first identity characteristic data of the first user and the second user, determining that the first user and the second user belong to the same user main body.
The apparatus provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in the second embodiment, and specific functions are not described herein again.
According to the embodiment of the invention, when the first user and the second user are not determined to belong to the same user subject, the second identity characteristic data of the first user and the second user are respectively extracted from the first user data and the second user data to be processed, and according to the similarity between the second identity characteristic data of the first user and the second user, if the similarity between the second identity characteristic data of the first user and the second user is greater than a first preset threshold value, the first user and the second user are determined to belong to the same user subject, and the first user data and the second user data are combined; if the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to a first preset threshold and larger than a second preset threshold, establishing an incidence relation between the first user data and the second user data, accurately determining a plurality of user numbers of the same user main body, combining the plurality of user data of the same user main body to form panoramic user characteristic data, and reducing the overall data redundancy of the DPI system.
EXAMPLE five
Fig. 4 is a schematic structural diagram of a deep packet inspection device according to a fifth embodiment of the present invention. As shown in fig. 4, the apparatus 40 includes: a processor 401, a memory 402, and computer programs stored on the memory 402 and executable by the processor 401.
The processor 401, when executing the computer program stored on the memory 402, implements the data processing method provided by any of the method embodiments described above.
The embodiment of the invention respectively extracts first identity characteristic data of a first user and second user from first user data and second user data to be processed, wherein the first identity characteristic data comprises at least one type of identity information for uniquely identifying a user main body; determining whether the first user and the second user belong to the same user main body or not according to the first identity characteristic data of the first user and the second user; and if the first user and the second user belong to the same user main body, merging the first user data and the second user data, so that a plurality of user data of the same user main body are merged to form panoramic user characteristic data, and the data redundancy of the DPI system is reduced.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the data processing method provided in any of the above method embodiments.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.