CN109088788B

CN109088788B - Data processing method, apparatus, device, and computer-readable storage medium

Info

Publication number: CN109088788B
Application number: CN201810752308.XA
Authority: CN
Inventors: 袁晓静; 翟京卿
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2018-07-10
Filing date: 2018-07-10
Publication date: 2021-02-02
Anticipated expiration: 2038-07-10
Also published as: CN109088788A

Abstract

The present invention provides a data processing method, apparatus, device and computer-readable storage medium. In the method of the present invention, the first identity feature data of the first user and the second user are respectively extracted from the first user data and the second user data to be processed, and the first identity feature data includes at least one The identity information of the user subject; according to the first identity feature data of the first user and the second user, determine whether the first user and the second user belong to the same user subject; if it is determined that the first user and the second user belong to the same user subject, then The first user data and the second user data are merged to realize the merged processing of multiple user data of the same user body to form panoramic user characteristic data, which reduces the data redundancy of the DPI system as a whole.

Description

Data processing method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of information data processing technologies, and in particular, to a data processing method, apparatus, device, and computer readable storage medium.

Background

Deep Packet Inspection (DPI) is an application layer traffic Inspection and control technology based on data packets, and performs Deep Inspection and analysis on different layers of information of the data packets to obtain application layer information of the whole data stream or data Packet, and then performs statistical analysis and control on traffic according to a policy defined by a DPI system.

With the development of big data and internet technology, various applications are entering people's lives. Because different applications do not have uniform requirements for the registration information of the user, the user identifications used by different applications registered by the same user may be different, and the same user identification may be used by different applications registered by different users. When the prior DPI system acquires the behavior data of a user, the user behavior data corresponding to each user is established for each application, a large amount of redundant data is stored, and panoramic user characteristic data cannot be formed.

Disclosure of Invention

The invention provides a data processing method, a data processing device, data processing equipment and a computer readable storage medium, which are used for solving the problems that when the prior DPI system acquires behavior characteristic data of users, user behavior characteristic data corresponding to each user is established for each application, a large amount of redundant data is stored, and panoramic user characteristic data cannot be formed.

One aspect of the present invention provides a data processing method, including:

respectively extracting first identity characteristic data of a first user and first identity characteristic data of a second user from first user data and second user data to be processed, wherein the first identity characteristic data comprises at least one type of identity information used for uniquely identifying a user main body;

determining whether the first user and the second user belong to the same user subject according to the first identity characteristic data of the first user and the second user;

and if the first user and the second user belong to the same user main body, merging the first user data and the second user data.

Another aspect of the present invention provides a data processing apparatus comprising:

the data extraction module is used for respectively extracting first identity characteristic data of a first user and first identity characteristic data of a second user from first user data and second user data to be processed, wherein the first identity characteristic data comprises at least one type of identity information used for uniquely identifying a user main body;

the determining module is used for determining whether the first user and the second user belong to the same user main body according to the first identity characteristic data of the first user and the second user;

a processing module, configured to perform merging processing on the first user data and the second user data if it is determined that the first user and the second user belong to the same user subject

Another aspect of the present invention provides a deep packet inspection device, including:

a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor, when running the computer program, implements the method described above.

Another aspect of the present invention provides a computer-readable storage medium storing a computer program,

which when executed by a processor implements the method described above.

According to the data processing method, the data processing device, the data processing equipment and the computer readable storage medium, first identity characteristic data of a first user and first identity characteristic data of a second user are respectively extracted from first user data and second user data to be processed, wherein the first identity characteristic data comprise at least one type of identity information used for uniquely identifying a user main body; determining whether the first user and the second user belong to the same user main body or not according to the first identity characteristic data of the first user and the second user; and if the first user and the second user belong to the same user main body, merging the first user data and the second user data, so that a plurality of user data of the same user main body are merged to form panoramic user characteristic data, and the data redundancy of the DPI system is reduced.

Drawings

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a data processing method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a deep packet inspection device according to a fifth embodiment of the present invention.

With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In embodiments of the present invention, the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.

The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Example one

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention. The embodiment of the invention provides a data processing method aiming at the problems that when the prior DPI system acquires the behavior characteristic data of users, the corresponding user behavior data of each user is established for each application, a large amount of redundant data is stored, and the panoramic user characteristic data cannot be formed. The method in this embodiment is applied to deep packet inspection equipment, and the computer equipment may be a computer equipment where a DPI system is located. In other embodiments of the present invention, the method in this embodiment may also be applied to other computer devices, and this embodiment takes a deep packet inspection device as an example for illustration. As shown in fig. 1, the method comprises the following specific steps:

step S101, first identity characteristic data of a first user and first identity characteristic data of a second user are respectively extracted from first user data and second user data to be processed, wherein the first identity characteristic data comprise at least one type of identity information used for uniquely identifying a user main body.

In this embodiment, the first user data and the second user data are user behavior data acquired by the DPI system for two different user accounts of one application, or user behavior data acquired for two different user accounts of two different applications.

In practical application, the first user data and the second user data to be processed may be specified by a technician by specifying an application identifier and a user registration account, or may be user data corresponding to any two user registration accounts in the obtained user data by the DPI system, which is not specifically limited in this embodiment.

The first identity characteristic data comprises at least one identity token for uniquely identifying a user agent. The identity information for uniquely identifying a user principal may at least include: identity card number, mobile phone number, email, etc.

And step S102, determining whether the first user and the second user belong to the same user main body according to the first identity characteristic data of the first user and the second user.

Because the first identity characteristic data of the users comprises at least one kind of identity information for uniquely identifying a user main body, if the first identity characteristic data of the first user and the second user simultaneously comprises at least one kind of identity information for uniquely identifying a user main body, when any kind of identity information for uniquely identifying a user main body simultaneously included in the first identity characteristic data of the first user and the second user is consistent, the first user and the second user can be determined to belong to the same user main body.

If the first identity characteristic data of the first user and the second user simultaneously includes at least one kind of identity information for uniquely identifying one user principal, it can be determined that the first user and the second user do not belong to the same user principal when any kind of identity information for uniquely identifying one user principal simultaneously included in the first identity characteristic data of the first user and the second user is inconsistent.

If the first identity characteristic data of the first user and the second user does not include identity information used for uniquely identifying a user principal at the same time, it cannot be determined that the first user and the second user belong to the same user principal or cannot be determined that the first user and the second user do not belong to the same user principal according to the first identity characteristic data of the first user and the second user.

And step S103, if the first user and the second user belong to the same user main body, merging the first user data and the second user data.

And after determining that the first user and the second user belong to the same user main body, merging the first user data and the second user data.

Specifically, the merging the first user data and the second user data includes:

and generating a uniform user data identifier corresponding to the first user data and the second user data, removing redundant information in the first user data and the second user data, and generating more comprehensive user data corresponding to the user data identifier.

The embodiment of the invention respectively extracts first identity characteristic data of a first user and second user from first user data and second user data to be processed, wherein the first identity characteristic data comprises at least one type of identity information for uniquely identifying a user main body; determining whether the first user and the second user belong to the same user main body or not according to the first identity characteristic data of the first user and the second user; and if the first user and the second user belong to the same user main body, merging the first user data and the second user data, so that a plurality of user data of the same user main body are merged to form panoramic user characteristic data, and the data redundancy of the DPI system is reduced.

Example two

Fig. 2 is a flowchart of a data processing method according to a second embodiment of the present invention. On the basis of the first embodiment, in this embodiment, if it is not determined that the first user and the second user belong to the same user principal, second identity feature data of the first user and the second user are respectively extracted from the first user data and the second user data to be processed, where the second identity feature data at least includes: family address, friend information, incidence relation and behavior characteristic data; calculating the similarity between the second identity characteristic data of the first user and the second user; comparing the similarity between the second identity characteristic data of the first user and the second user with the size of a first preset threshold value; and if the similarity between the second identity characteristic data of the first user and the second user is greater than a first preset threshold value, determining that the first user and the second user belong to the same user main body, and merging the first user data and the second user data. If the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to a first preset threshold, comparing the similarity between the second identity characteristic data of the first user and the second user with a second preset threshold, wherein the second preset threshold is smaller than the first preset threshold; and if the similarity between the second identity characteristic data of the first user and the second user is greater than a second preset threshold value, establishing an association relationship between the first user data and the second user data.

As shown in fig. 2, the method comprises the following specific steps:

step S201, extracting first identity feature data of the first user and the second user from the first user data and the second user data to be processed, respectively, where the first identity feature data includes at least one kind of identity information for uniquely identifying a user principal.

Optionally, the first identity feature data of the first user and the first identity feature data of the second user may be extracted from the first user data and the second user data to be processed, respectively, and recorded in the data list.

Step S202, determining whether the first user and the second user belong to the same user subject according to the first identity characteristic data of the first user and the second user.

In this embodiment, whether the first user and the second user belong to the same user subject is determined according to the first identity feature data of the first user and the second user, which may be specifically implemented in the following manner:

judging whether any identity information exists in the first identity characteristic data of the first user and the second user; and if any identity information exists in the first identity characteristic data of the first user and the second user, determining that the first user and the second user belong to the same user main body.

Step S203, if it is determined that the first user and the second user belong to the same user subject, merging the first user data and the second user data.

Step S204, if the first user and the second user are not determined to belong to the same user subject, second identity characteristic data of the first user and second user are respectively extracted from the first user data and second user data to be processed.

Wherein the second identity characteristic data comprises at least: family address, friend information, association relation and behavior characteristic data. The association relationship may be mobile phone contact information. Optionally, the second identity characteristic data may further include an account number of the instant messaging tool, and the like.

Optionally, if it is not determined that the first user and the second user belong to the same user subject, before extracting second identity feature data of the first user and the second user from the first user data and the second user data to be processed, the method further includes:

respectively extracting registered accounts of a first user and a second user from first user data and second user data to be processed; judging whether the registered accounts of the first user and the second user are consistent; and if the registered accounts of the first user and the second user are consistent, then executing the subsequent step of respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed.

If the registered accounts of the first user and the second user are inconsistent, calculating the similarity of the registered accounts of the first user and the second user; judging whether the similarity of the registered accounts of the first user and the second user is greater than a third preset threshold value or not; and if the similarity of the registered accounts of the first user and the second user is greater than a third preset threshold, then executing a subsequent step of respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed.

The registered accounts of the first user and the second user are two character strings, and the similarity between the registered accounts of the first user and the second user is calculated. For example, two strings may be matched, the longest matching sub-string of the two strings may be determined, and the proportion of the longest sub-string may be calculated.

In addition, the third preset threshold may be set by a technician according to actual needs, and this embodiment is not specifically limited herein.

Step S205, calculating the similarity between the second identity characteristic data of the first user and the second user; and comparing the similarity between the second identity characteristic data of the first user and the second user with the first preset threshold value.

In this embodiment, the similarity between the second identity characteristic data of the first user and the second user may be specifically implemented by any method in the prior art for calculating the similarity between the two users according to the behavior data and the attribute information of the two users, which is not specifically limited in this embodiment.

The first preset threshold may be set by a technician according to actual needs, and this embodiment is not specifically limited herein.

Step S206, if the similarity between the second identity characteristic data of the first user and the second user is larger than a first preset threshold value, determining that the first user and the second user belong to the same user subject, and merging the first user data and the second user data.

In this embodiment, if the similarity between the second identity characteristic data of the first user and the second user is greater than the first preset threshold, it is indicated that the similarity between the second identity characteristic data of the first user and the second user is very high, and the first user and the second user may be considered to belong to the same user subject, and the first user data and the second user data are merged.

In addition, the process of merging the first user data and the second user data is the same as step S203, and details are not repeated here in this embodiment.

Step S207, if the similarity between the second identity characteristic data of the first user and the second user is less than or equal to the first preset threshold, comparing the similarity between the second identity characteristic data of the first user and the second user with a second preset threshold, where the second preset threshold is less than the first preset threshold.

The second preset threshold may be set by a technician according to actual needs, and this embodiment is not specifically limited here.

If the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to a second preset threshold, it is indicated that the association degree between the first user data and the second user data is small, the first user data and the second user data are not combined, and the association relationship between the first user data and the second user data is not required to be established.

Step S208, if the similarity between the second identity characteristic data of the first user and the second user is greater than a second preset threshold, establishing an association relationship between the first user data and the second user data.

If the similarity between the second identity characteristic data of the first user and the second user is greater than the second preset threshold, it is indicated that the first user and the second user cannot be determined to belong to the same user subject according to the existing user data of the first user and the second user, but the association between the first user and the second user is large, so that the association relationship between the first user data and the second user data is established, so that after more important identity data of the first user and the second user is subsequently acquired, whether the first user and the second user belong to the same user subject can be further determined more accurately, and the accuracy of merging the user data is improved.

According to the embodiment of the invention, when the first user and the second user are not determined to belong to the same user subject, the second identity characteristic data of the first user and the second user are respectively extracted from the first user data and the second user data to be processed, and according to the similarity between the second identity characteristic data of the first user and the second user, if the similarity between the second identity characteristic data of the first user and the second user is greater than a first preset threshold value, the first user and the second user are determined to belong to the same user subject, and the first user data and the second user data are combined; if the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to a first preset threshold and larger than a second preset threshold, establishing an incidence relation between the first user data and the second user data, accurately determining a plurality of user numbers of the same user main body, combining the plurality of user data of the same user main body to form panoramic user characteristic data, and reducing the overall data redundancy of the DPI system.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a data processing apparatus according to a third embodiment of the present invention. The data processing device provided by the embodiment of the invention can execute the processing flow provided by the embodiment of the data processing method. As shown in fig. 3, the apparatus 30 includes: a data extraction module 301, a determination module 302 and a processing module 303.

Specifically, the data extraction module 301 is configured to extract first identity feature data of the first user and first identity feature data of the second user from the first user data and the second user data to be processed, where the first identity feature data includes at least one type of identity information for uniquely identifying a user principal.

The determining module 302 is configured to determine whether the first user and the second user belong to the same user subject according to the first identity feature data of the first user and the second user.

The processing module 303 is configured to perform merging processing on the first user data and the second user data if it is determined that the first user and the second user belong to the same user main body.

The apparatus provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in the first embodiment, and specific functions are not described herein again.

Example four

On the basis of the third embodiment, in this embodiment, the processing module is further configured to:

if the first user and the second user are not determined to belong to the same user subject, respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed, wherein the second identity characteristic data at least comprises: family address, friend information, incidence relation and behavior characteristic data; calculating the similarity between the second identity characteristic data of the first user and the second user; comparing the similarity between the second identity characteristic data of the first user and the second user with the size of a first preset threshold value; and if the similarity between the second identity characteristic data of the first user and the second user is greater than a first preset threshold value, determining that the first user and the second user belong to the same user main body, and merging the first user data and the second user data.

Optionally, the processing module is further configured to:

if the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to a first preset threshold, comparing the similarity between the second identity characteristic data of the first user and the second user with a second preset threshold, wherein the second preset threshold is smaller than the first preset threshold; and if the similarity between the second identity characteristic data of the first user and the second user is greater than a second preset threshold value, establishing an association relationship between the first user data and the second user data.

Optionally, the processing module is further configured to:

The apparatus provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in the second embodiment, and specific functions are not described herein again.

EXAMPLE five

Fig. 4 is a schematic structural diagram of a deep packet inspection device according to a fifth embodiment of the present invention. As shown in fig. 4, the apparatus 40 includes: a processor 401, a memory 402, and computer programs stored on the memory 402 and executable by the processor 401.

The processor 401, when executing the computer program stored on the memory 402, implements the data processing method provided by any of the method embodiments described above.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the data processing method provided in any of the above method embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A data processing method, comprising:

if the first user and the second user belong to the same user main body, merging the first user data and the second user data;

after determining whether the first user and the second user belong to the same user subject according to the first identity characteristic data of the first user and the second user, the method further includes:

if the first user and the second user are not determined to belong to the same user subject, respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed, wherein the second identity characteristic data at least comprises: family address, friend information, incidence relation and behavior characteristic data;

calculating the similarity between the second identity characteristic data of the first user and the second user;

comparing the similarity between the second identity characteristic data of the first user and the second user with a first preset threshold value;

if the similarity between the second identity characteristic data of the first user and the second user is greater than a first preset threshold value, determining that the first user and the second user belong to the same user subject, and merging the first user data and the second user data;

after comparing the similarity between the second identity characteristic data of the first user and the second user with the first preset threshold, the method further includes:

if the similarity between the second identity characteristic data of the first user and the second user is smaller than or equal to the first preset threshold, comparing the similarity between the second identity characteristic data of the first user and the second user with a second preset threshold, wherein the second preset threshold is smaller than the first preset threshold;

if the similarity between the second identity characteristic data of the first user and the second user is greater than the second preset threshold, establishing an association relationship between the first user data and the second user data;

if it is not determined that the first user and the second user belong to the same user subject, before extracting second identity feature data of the first user and the second user from the first user data and the second user data to be processed, respectively, the method further includes:

respectively extracting registered accounts of a first user and a second user from first user data and second user data to be processed;

judging whether the registered accounts of the first user and the second user are consistent;

and if the registered accounts of the first user and the second user are consistent, then executing the subsequent step of respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed.

2. The method of claim 1, wherein determining whether the first user and the second user belong to the same user subject based on the first identity characteristic data of the first user and the second user comprises:

judging whether any identity information in the first identity characteristic data of the first user and the second user is consistent;

and if any one of the identity information in the first identity feature data of the first user and the second user is consistent, determining that the first user and the second user belong to the same user main body.

3. The method of claim 1, wherein after determining whether the registered accounts of the first user and the second user are consistent, the method further comprises:

if the registered accounts of the first user and the second user are inconsistent, calculating the similarity of the registered accounts of the first user and the second user;

judging whether the similarity of the registered accounts of the first user and the second user is greater than a third preset threshold value or not;

and if the similarity of the registered accounts of the first user and the second user is greater than a third preset threshold, then executing a subsequent step of respectively extracting second identity characteristic data of the first user and the second user from the first user data and the second user data to be processed.

4. A data processing apparatus, comprising:

the processing module is used for merging the first user data and the second user data if the first user and the second user belong to the same user main body;

the processing module is further configured to:

the processing module is further configured to: respectively extracting registered accounts of a first user and a second user from first user data and second user data to be processed;

5. The apparatus of claim 4, wherein the processing module is further configured to:

6. A deep packet inspection device, comprising:

the processor, when executing the computer program, implements the method of any of claims 1-3.

7. A computer-readable storage medium, in which a computer program is stored,

the computer program, when executed by a processor, implementing the method of any one of claims 1-3.