WO2020014916A1

WO2020014916A1 - Method for identifying user and related device

Info

Publication number: WO2020014916A1
Application number: PCT/CN2018/096239
Authority: WO
Inventors: 黄晓光; 岳晓贫; 郑春芳
Original assignee: 华为技术有限公司
Priority date: 2018-07-19
Filing date: 2018-07-19
Publication date: 2020-01-23

Abstract

Provided in an embodiment of the present invention are a method for identifying a user and an identification server. The method mainly comprises: an identification server acquiring a set of user whitelists and network data of a user to be identified; identifying noise data in a positive dataset corresponding to the set of user whitelists, and performing calculation to obtain a ratio of the noise data in the positive dataset; establishing an EM model to calculate a probability that each data item in an unlabeled dataset is a positive data item; determining an identification threshold according to the probability and the ratio, and obtaining a negative dataset; and performing identification of the user. The embodiment of the present invention allows more reliable positive samples and negative samples to be obtained, thereby enhancing accuracy of subsequent modeling and user identification.

Description

User identification method and related equipment

Technical field

The present application relates to the technical field of mobile Internet service identification, and in particular, to a service identification method and related equipment.

Background technique

Service identification is a very important topic in the mobile Internet industry, and it is the basis for topics such as user network behavior research and operator intelligent pipelines.

In wireless networks, because it is impossible to ensure 100% timely and effective tracking of user information, and in existing communication protocols, the wireless side does not have the function of recording and tracking user information, and the classification information of users and services (such as machine communication (Machine to Machine (M2M), client equipment (Customer Equipment, CPE, etc.) cannot be obtained on the wireless side. However, with the diversification of service types, more and more types of services will be carried on the network in the future. This requires the network to support multiple services with different service characteristics and resource requirements at the same time. In order to make network planning more reasonable and more efficient, It is an important breakthrough to carry out network optimization and classify users and services in the network. In some scenarios, real-time service type identification is required to better perform service scheduling and resource allocation.

Wireless side service characteristics include information such as the packet length, terminal capabilities, service duration, and access frequency of the service. Services of the same type (such as Point Of Sale Terminal (POS), measurement tables, etc.) are in these aspects. Shows high similarity.

At present, user classification modeling is performed by obtaining the full amount of user classification information of the existing offices. It is necessary to obtain user account opening information, stitching all business record information from the core network to the wireless side, and wireless side service record information. Obtaining data from multiple nodes is difficult to implement in practical applications. In addition, multiple data sources need to be spliced, and storage and calculation costs are huge. There are no related solutions in the prior art to achieve accurate classification of network users and services.

Summary of the invention

This application provides a user identification method and related equipment, which can obtain more reliable positive and negative data, and improve the accuracy of user identification.

In a first aspect, a user identification method is provided. The method includes: an identification server obtains a user whitelist set and network data of a user to be identified, and the network data of the user to be identified includes a positive corresponding to the user whitelist set. Example data set and unlabeled data set; identify noise data in the positive data set, and calculate the proportion of the noise data in the positive data set; calculate each data in the unlabeled data set Is a probability value of positive data; a recognition threshold value is determined according to the probability value and the proportional value, and a negative data set is identified from the unlabeled data set according to the recognition threshold; according to the positive data set and The counter-example data set is used to identify a user to be identified.

By executing the above method, the recognition server recognizes the noise data in the positive data set corresponding to the user whitelist set and calculates the proportion of the noise data in the positive data set, and then calculates the unlabeled data set through the EM model calculation. Each data is a probability value of the positive example data. According to the probability value and the proportional value, the identification threshold is determined to determine the negative example data set. More reliable positive example data and negative example data can be obtained, thereby improving the accuracy of user identification.

In a possible implementation manner, an implementation manner in which the identification server obtains the user whitelist set may be: the identification server obtains current whitelist information, and maps the current whitelist information and network data of the user to be identified based on the current whitelist information A user whitelist set is generated; the user whitelist set is merged.

By executing the above method, based on the initially obtained white list information, and by combining with network data for mapping, the white list information can be expanded and the space of reliable positive data samples can be increased.

In another possible implementation manner, the current whitelist information includes multiple different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user whitelist includes User identification information and industry information, the address whitelist includes address information and industry information.

By performing the above method, the recognition server can perform iterative mapping based on the current user whitelist Ai or the current address whitelist Bi to expand the whitelist information and increase the space of reliable positive data samples.

In yet another possible implementation manner, the user identifier includes an international mobile subscriber identity IMSI, and the address information includes an Internet protocol address IP.

In yet another possible implementation manner, the merging of the user whitelist set by the identification server includes: the identification server performs conflict deduplication on the user whitelist set through a preset rule, and the preset rule includes based on the White list priority, or based on mapping time.

By executing the above method, the identification server performs conflict deduplication on the user whitelist set based on the priority or mapping time of the whitelist, which can ensure that there are no conflicting and duplicate users in the obtained user whitelist set, and improve the accuracy of the space of positive data samples. Sex.

In another possible implementation manner, after the identification server merges the user whitelist set, the method further includes:

The identification server combines the network data with multiple addresses of the user whitelist to obtain an address whitelist Bj, where the address includes a public address, and the public address is identified and marked during the mapping process;

The identification server judges whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are the same, outputs the user white list; if they are not the same, the address white list Bj is used as the current white list information, and repeats Performing the step of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified.

By executing the above method, when the initial white list information is an address white list, the identification server obtains the address white list by performing address mapping on the obtained user white list, which can identify and mark public addresses, and can iterate the mapping process in subsequent iterations. China no longer participates in mapping, which simplifies the mapping process and improves mapping efficiency.

The identification server judges whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai. If they are the same, the user whitelist Aj is output. If they are not the same, the user whitelist Aj is used. As the current whitelist information, the step of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified is repeatedly performed.

By executing the above method, when the initial white list information is a user white list, the identification server determines whether to continue iterative mapping by judging whether the user white list obtained after the mapping is consistent with the user white list before the mapping, which can effectively expand The user's whitelisting service has enough comprehensive sample data space for positive examples.

In another possible implementation manner, the method further includes:

The recognition server performs cluster analysis on the positive data based on the uplink and downlink packet length and uplink and downlink duration corresponding to the positive data set, and identifies and marks the classification with the smallest uplink and downlink packet length and uplink and downlink duration as noise data, The proportion value of the noise data in the positive data set is calculated by calculation.

By executing the above method, the recognition server performs cluster analysis on the positive data through the four dimensions of the uplink and downlink packet length and the uplink and downlink duration, which can accurately distinguish the noise data and calculate the noise data in the positive data set. Percentage value.

In another possible implementation manner, the recognition server calculates the probability value that each data in the unlabeled data set is positive data including:

The identification server divides the positive data set into i groups of spy data;

The recognition server constructs an iterative EM model based on M and Pi, where M = U + Si and Pi = P-Si, where Si represents each set of the spy data, and P represents the positive data Set, where U represents the unlabeled data set;

The recognition server analyzes each data in the M according to the EM model, and obtains a probability value tj where each data in the M is positive data;

Wherein, i and j are positive integers greater than or equal to 1.

By performing the above method, the recognition server can analyze each data in M by constructing an EM model, and can accurately obtain the probability value that each data in M is positive data.

In another possible implementation manner, the identification server determines an identification threshold according to the probability value and the proportion value, and identifies a counter-example data set from the unlabeled data set according to the identification threshold, including:

The recognition server combines the probability value tj of each data in M as positive data, and obtains the noise data in M as the negative data by using the ratio of the noise data in the positive data as the confidence value. The probability t corresponding to the time;

The recognition server judges the magnitude relationship between tj and t, and adds all data corresponding to tj smaller than t to the counter-example data set RNi.

By executing the above method, the recognition server can obtain the probability value t corresponding to the noise data in M as the negative data by using the ratio tj of the noise data in the positive data as the confidence, and then determine The size relationship between tj and t, and adding all the data corresponding to tj smaller than t to the counter-example data set RNi can improve the accuracy of the counter-example data set, thereby improving the accuracy of user identification.

In another possible implementation manner, the method further includes:

The identification server obtains a counter set RN by combining the i counter-example data sets RNi obtained through the i-group spy data.

By executing the above method, the accuracy of the counter-example set RN can be further improved, thereby ensuring that the user identification can be performed accurately.

In a second aspect, an identification server is provided. The identification server includes:

An obtaining unit, configured to obtain a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;

A recognition unit, configured to identify noise data in the positive data set;

A calculation unit, configured to calculate a proportion value of the noise data in the positive data set;

A determining unit, configured to determine an identification threshold according to the probability value and the ratio value;

The calculation unit is further configured to calculate a probability value that each data in the unlabeled data set is positive data; and the recognition unit is further configured to identify the unlabeled data set according to the recognition threshold. Counter-example data set; the identification unit is further configured to identify a user to be identified based on the positive-example data set and the counter-example data set.

In another possible implementation manner, the obtaining unit is further configured to: obtain current whitelist information;

The obtaining unit further includes a mapping subunit, configured to map a user whitelist set based on the current whitelist information and network data of the user to be identified;

The obtaining unit further includes a merging subunit for merging the user whitelist set.

In another possible implementation manner, the merging subunit further includes a conflict deduplication subunit, and the conflict deduplication subunit is configured to perform conflict deduplication on the user whitelist set through a preset rule. Suppose that the rule includes a priority based on the white list or a mapping time.

In another possible implementation manner, the mapping subunit is further configured to perform mapping of multiple addresses on the user whitelist in combination with the network data to obtain an address whitelist Bj, where the addresses include public addresses ;

The identification unit is further configured to identify and mark the public address;

The server further includes a judging unit for judging whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are consistent, output the user white list; if they are not consistent, use the address white list As the current white list information, Bj repeatedly executes the step of mapping a user white list set based on the current white list information and network data of the user to be identified.

In another possible implementation manner, the determining unit is further configured to:

Determine whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai, and if they are consistent, output the user whitelist Aj; if they are not consistent, use the user whitelist Aj as the current White list information, repeating the step of mapping a user white list set based on the current white list information and network data of the user to be identified.

In another possible implementation manner, the identification unit further includes a cluster analysis subunit, configured to cluster the positive data based on the uplink and downlink packet length and the uplink and downlink duration corresponding to the positive data set. Analyze, identify, and mark the classification with the smallest uplink and downlink packet length and smallest uplink and downlink duration as noise data.

In another possible implementation manner, the calculation unit further includes a grouping subunit, configured to divide the positive data set into i groups of spy data;

The calculation unit further includes a construction sub-unit for constructing an iterative EM model according to M and Pi, where M = U + Si, and Pi = P-Si, where Si represents each set of the spy data , P represents the positive data set, and U represents the unlabeled data set;

The calculation unit further includes an analysis subunit, configured to analyze each data in the M according to the EM model, to obtain a probability value tj where each data in the M is positive data;

Wherein, i and j are positive integers greater than or equal to 1.

Combining the probability value tj of each data in M with positive data, it is obtained when the noise data in M is determined as the negative data with the proportional value of the noise data in the positive data as the confidence data. The corresponding probability value t;

Determine the magnitude relationship between tj and t, and add all data corresponding to tj smaller than t to the counter-example data set RNi.

In yet another possible implementation manner, the identification unit further includes a fetch set sub-unit, configured to obtain a set of i counter-example data sets RNi obtained by the i-group spy data to obtain a counter-example set RN.

According to a third aspect, an identification server is provided. The identification server includes a processor, a memory, and a transceiver, where:

The processor, the memory, and the transceiver are connected to each other. The memory is used to store a computer program. The computer program includes program instructions. The processor is configured to call the program instructions, and execute the following steps. :

Acquiring a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;

Identify the noise data in the positive data set, and calculate the proportion value of the noise data in the positive data set;

Calculate the probability value of each data in the unlabeled data set as positive data;

Determining a recognition threshold according to the probability value and the proportion value, and identifying a counter-example data set from the unlabeled data set according to the recognition threshold;

According to the positive data set and the negative data set, the user to be identified is identified.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program includes program instructions, and the program instructions are executed by a processor that identifies a server. When causing the processor of the identification server to execute the method described in the first aspect or any optional implementation manner of the first aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic flowchart of a user identification method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a user whitelist according to an embodiment of the present application; FIG.

FIG. 3 is a schematic diagram of an address whitelist according to an embodiment of the present application; FIG.

FIG. 4 is a schematic diagram of an iterative matching process of an address whitelist according to an embodiment of the present application; FIG.

FIG. 5 is a schematic diagram of a topology structure and measurement level distribution characteristics of a stationary user according to an embodiment of the present application; FIG.

6 is a schematic diagram of a service distribution provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a user recognition effect comparison provided by an embodiment of the present application; FIG.

8 is a schematic structural diagram of an identification server according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of another identification server according to an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

The identification server involved in the embodiment of the present application may be a server for communicating with a terminal device. The identification server may be any kind of device with wireless transceiver function or a chip that can be set on the device. The device includes but is not limited to: evolved Node B (eNB), radio network controller (radio network controller) , RNC), Node B (Node B, NB), base station controller (BSC), base transceiver station (BTS), home base station (e.g., home NodeB, or home NodeB, HNB ), Baseband unit (BBU), access point (AP), wireless relay node, wireless backhaul node, and transmission point (wireless fidelity, WIFI) system TP) or transmission and reception point (TRP), etc., may also be 5G, such as NR, gNB in the system, or transmission point (TRP or TP), one or a group of base stations in the 5G system The antenna panel (including multiple antenna panels) may also be a network node constituting a gNB or a transmission point, such as a baseband unit (BBU), or a distributed unit (DU).

In the embodiment of the present application, the identification server can obtain required whitelist information and network data (such as service data or service RBI). The whitelist information and network data may be stored by the identification server, or may be obtained by the identification server from other devices, such as a network node or a maintenance node, via the Internet. The network data applied by the identification server can be divided into sample data and test data. The sample data is used for mapping to obtain a user whitelist set to increase the sample space of reliable positive examples, and the test data is used for the identification server to identify the user carrying the test data. Understandably, the sample data may be part of the test data.

The embodiments of the present application can be applied to identify a user type offline or online in a mixed service scenario in a network, and can realize resource and experience optimization based on the user type.

The following describes in detail a user identification method and related equipment provided by the embodiments of the present application. It should be noted that the display order of the embodiments of the present application only represents the order of the embodiments, and does not represent the merits of the technical solutions provided by the embodiments.

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a user identification method according to an embodiment of the present application. The method includes, but is not limited to, the following steps:

S110: Obtain a user whitelist set and network data of a user to be identified. The network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set.

Specifically, the identification server may send a request message to an operator (mobile operator) to request to obtain the user whitelist set, or the identification server may also obtain the user whitelist set based on the account opening information of the user terminal.

Further, the identification server may obtain network data of a user to be identified from a network node or a network maintenance node, such as an access point (Access Point, AP). These network data represent the characteristics of the user's business, and can be obtained by analyzing these network data. According to the corresponding business characteristics, it is possible to determine which type of users have these business characteristics.

Optionally, the manner in which the identification server obtains the user whitelist set may include:

Acquiring current whitelist information, and mapping a user whitelist set based on the current whitelist information and network data of the user to be identified;

Merging the user whitelist set.

Specifically, when a user performs registration or registration, the mobile operator can record the user's whitelist information, the identification server obtains the whitelist information from the mobile operator, and uses it as the current whitelist information, and then combines it with network maintenance The network data obtained by the nodes and the like is mapped to obtain a user whitelist set.

Optionally, the current whitelist information includes a plurality of different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user whitelist includes user identification information and industry information The address whitelist includes address information and industry information.

Specifically, the identification server can obtain different types of whitelists, each of which corresponds to a different industry type, such as shared bicycles, smart meters, smart street lights and other industries. The whitelist includes the user whitelist Ai and the address white For the list Bi, it is worth noting that Ai represents the white list of users corresponding to different industries, and Bi represents the white list of addresses corresponding to different industries.

See Figure 2. Figure 2 is a schematic diagram of a user whitelist. It can be seen that the user whitelist includes two columns. The left column corresponds to the user's International Mobile Subscriber Identity (IMSI), and the right column corresponds to the It is the name of the industry. It should be noted that IMSI is used to represent different users. It is understandable that other identification information can also be used to distinguish different users. This application does not limit this. The right column corresponds to one. This specific industry name, such as smart meters, is not limited in this application.

Refer to Figure 3, which is a schematic diagram of an address whitelist. It can be seen that the address whitelist also includes two columns, the left column corresponds to the Internet Protocol address (IP), and the right column corresponds to the industry name. It is noted that IP addresses are used to represent the peer addresses of different users in the same industry. Different users communicating with the IP can be considered to belong to the same industry. It is understandable that other address information can also be used to represent the same industry. The peer addresses of different users, such as short signals, access point names (APN), etc., are not limited in this application. The right column corresponds to a specific industry name, such as a smart meter. This application does not limit this either.

Optionally, the manner in which the identification server merges the user whitelist set may include:

The identification server performs conflict deduplication on the user whitelist set through a preset rule, and the preset rule includes a priority based on the whitelist or a mapping time.

Specifically, after the identification server obtains the current user whitelist information, it will perform the first matching mapping on it. For example, the current whitelist information obtained by the identification server is IMSI1, IMSI2, and IMSI3 belong to industry A (A represents a specific industry Name), and where the peer address of IMSI1 (that is, the address that communicates with IMSI1) includes the IP address of server B and the address of SMS station C (the address information can be a short signal of the SMS station, such as 10086), then contact server B The user who communicates and the user who communicates with the SMS station C also considers it to be a user belonging to the A industry, and adds the tag information of the A industry to it. According to the same idea, it traverses the other users in the current whitelist information, and Expanding users in the industry, it can be understood that after the above matching mapping, the user whitelist set corresponding to the industry A can be expanded, and the sample space of reliable positive examples can be increased.

When the current whitelist information obtained by the identification server is the address of SMS station 1 and the IP address of server 1 (the address of the SMS station can be a short signal, for example, 10086) belongs to the B industry (B represents a specific industry name), that is, the identification server What is obtained is the address whitelist. At this time, the address whitelist needs to be mapped and converted to the user whitelist. Users who are about to communicate with the address of the SMS station 1 or the IP address of the server 1 are added with industry B label information, for example, with The users who communicate with the IP address of server 1 include user 1, user 2, and user 3. The IMSI (ie, IMSI1, IMSI2, and IMSI3) of these three users is added with the industry B label information, and the address whitelist is mapped to user white. After the list, each user in the user whitelist is traversed. If the peer address of IMSI1 also includes the IP address of server 2, the user who will communicate with the IP address of server 2 is also considered to be a user in industry B. , Add tag information of industry B to it, and expand users belonging to industry B.

It can be understood that the above method can effectively expand the user whitelist set corresponding to different industries, but there may be a problem of conflict and duplication, that is, a user may have label information of two or more different industries. For example, IMSI1 belongs to industry A, and the peer address of IMSI1 includes the IP address of server 1, and the user communicating with server 1 also includes IMSI2, then IMSI2 should also belong to industry A; IMSI3 belongs to industry B, and the peer address of IMSI3 also includes The IP address of server 1, then IMSI2 should also belong to the B industry. This is obviously incorrect. A user cannot belong to two different industries at the same time.

Therefore, in order to solve the above conflict duplication problem, the identification server needs to perform conflict deduplication through preset rules when merging the user whitelist set. The identification server can perform conflict deduplication according to the priority of the whitelist. For example, the identification server can Set the whitelist priority in advance. If the identification server obtains two initial user whitelists (user whitelist 1 and user whitelist 2), the information in user whitelist 1 includes IMSI1 and IMSI2 belong to industry A, and user whitelist 2 The information includes that IMSI3 and IMSI4 belong to industry B. The identification server sets the level of user whitelist 1 to be higher than the level of user whitelist 2, so it is necessary to give priority to user whitelist 1 when expanding the user whitelist set. Extension, for example, the peer address of IMSI1 includes the IP address of server 1, and the peer address of IMSI3 also includes the address of server 1. For other users communicated by the IP address of server 1, the priority of user whitelist 1 is The level is higher than the level of user whitelist 2, so you can only add the label information of industry A instead of B plus tag information industry, by setting the priority of the user whitelist can effectively resolve the conflict duplication.

Or, the identification server performs conflict deduplication according to the sequence of the mapping expansion time. For example, the identification server obtains two initial user whitelists (user whitelist 1 and user whitelist 2). The information in user whitelist 1 includes IMSI1 and IMSI2 belong to industry A. The information in user whitelist 2 includes IMSI3 and IMSI4. Belonging to industry B, the peer address of IMSI1 includes the IP address of server 1, and the peer address of IMSI3 also includes the address of server 1. If the server is identified to first expand the user whitelist set for user whitelist 1, then server 1 When other users communicated by the IP address of the company add the industry A tag information, and then expand the user white list 2 to the user white list 2, there is no need to add the industry B's other users communicated by the server 1 IP address. The tag information, based on the sequence of the whitelist mapping extension time, can effectively solve the problem of conflict and duplication.

It should be noted that the above only analyzes the cases where the peer addresses of the users in the two user whitelists overlap repeatedly. For the cases where the peer addresses of the users in the multiple user whitelists overlap repeatedly, it can still be used. The foregoing priority setting for the white list, or the conflicting and deduplication based on the sequence of the white list mapping extension time, or other similar methods, is not limited in this application.

Optionally, after the identification server merges the user whitelist set, the method further includes:

Map the user whitelist with multiple addresses in combination with the network data to obtain an address whitelist Bj, where the addresses include public addresses, and identify and mark the public addresses during the mapping process;

Determine whether the address whitelist Bj is consistent with the obtained current address whitelist Bi, if they are the same, output the user whitelist; if they are not the same, use the address whitelist Bj as the current whitelist information and repeat the execution The steps of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified are described.

Specifically, when the current whitelist information obtained by the identification server is the address whitelist Bi, the identification server first maps the address whitelist to the user whitelist, and then expands and merges the user whitelist to obtain the The extended user whitelist set, and then map the user whitelist to multiple addresses (that is, each user in the user whitelist set is mapped to obtain its peer address) to obtain the address whitelist Bj. It is worth explaining that The Bj obtained from the address mapping may contain some public addresses. At this time, these public addresses need to be identified and marked. The Bj obtained after the mapping is compared with the current Bi. If the length of Bj (that is, the Bj contains The number of peer addresses) is greater than the length of Bi, then Bj is used as the new current address whitelist, and the above iterative iterative process is repeatedly performed, but in the process of repeated execution, the marked public address in Bj is no longer required Mapping is performed; if the length of Bj is equal to the length of Bi, the user whitelist set corresponding to Bj is output. It can be understood that the user whitelist set is expanded through repeated iterative mapping, and the user whitelist set is output until the length of the obtained address whitelist Bj is equal to the length of the current address whitelist Bi at this time, which can fully expand the user whitelist set. Increase the sample space for reliable positive examples.

Referring to FIG. 4, FIG. 4 is a schematic diagram of an iterative matching process of an address whitelist according to an embodiment of the present application.

S401: Enter the current address white list.

The current address whitelist entered may be a different type of address whitelist.

The length of the current address whitelist indicates the number of peer addresses included in the address whitelist.

S402: Map the address whitelist to the user whitelist.

S403: Map and expand the user whitelist, and add industry tag information to the users in the user whitelist obtained after the expansion.

Among them, the user whitelist is mapped and expanded based on network data.

S404: Merging and deduplication of the expanded user whitelist.

S405: Perform mapping of multiple addresses on the merged and deduplicated user whitelist to obtain the address whitelist.

S406: Compare and judge the obtained address white list with the input current address white list. If they are the same, execute S407; if they are not the same, use the obtained address white list as the current address white list, and execute S401.

S407: Output the user white list.

For example, the current address whitelist obtained by the identification server is that the IP address of server 1 and the IP address of server 2 belong to industry A, and the users communicating with the IP address of server 1 include IMSI1 and IMSI2, and the users communicating with the IP address of server 2 include IMSI3 and IMSI4, so the user whitelist obtained from this address whitelist mapping is IMSI1, IMSI2, IMSI3, and IMSI4 belong to industry A, and the peer address of IMSI1 also includes the IP address of server 3, and the peer address of IMSI3 also includes the SMS station 1 address, and the address of SMS station 1 is a public address (that is, all users may communicate with the SMS station 1). At this time, the identification server will identify the address of the SMS station 1 and mark it. The user communicated with the IP address of 3 adds the label information of the A industry to obtain the extended user whitelist set, and then performs multiple address mappings on the extended user whitelist set to obtain a new address whitelist. The address whitelist should be the IP address of server 1, the IP address of server 2 and the IP address of server 3 belong to the A industry. The current address white list is compared, because the length of the obtained address white list is larger than the length of the current address white list entered (that is, it can be further expanded). This is to output the obtained address white list as the current white list. The above steps are performed iteratively until the length of the obtained address white list is equal to the length of the input current address white list. At this time, the iteration process is stopped, and the user white list set corresponding to the obtained address white list is output.

Specifically, in a case where the current whitelist information obtained by the identification server is a user whitelist Ai, the identification server will expand and merge the user whitelist Ai to obtain an expanded user whitelist set Aj, and obtain the extended user whitelist Aj after mapping. The Aj is compared with the current Ai. If the length of Aj (that is, the number of IMSIs contained in Aj) is greater than the length of Ai, then Aj is used as the new current user whitelist and the above-mentioned extended iterative process is repeated; if the length of Aj is If the length is equal to Ai, the user whitelist set corresponding to Aj is output. It can be understood that the user whitelist set is expanded through repeated iterative mapping, and the user whitelist set is output until the length of the obtained user whitelist Aj is equal to the length of the current user whitelist Ai at this time, which can fully expand the user whitelist set. Increase the sample space for reliable positive examples.

S120: Identify the noise data in the positive data set, and calculate the proportion value of the noise data in the positive data set.

Specifically, in the network, there are a large number of services, including all services of users in the white list user set and other users, and these service information can be reflected by data. The services of the users in the user whitelist set are labeled and can be referred to as positive data sets, while the services of all other users can be referred to as unlabeled data sets. Understandably, the unlabeled data set will also contain some positive data.

It should be noted that although the positive data set is labeled, some of the data corresponds to a very small amount of traffic, which is mapped to user services and is generally reflected in services such as heartbeat information. Such services are very frequent and have different user types. This type of business is very similar and cannot reflect the industry characteristics of the user, which will greatly affect the accuracy of user classification. These data can be called noise data. For example, in a Long Term Evolution (LTE) network, most service records (business records or service data) counted by Call History Records (CHR) are very small, and the contribution of traffic is very small, close to 50% of downlink service packets have a length of 0. This type of data is noise data for user identification. In fact, the service characteristics will be close to the Gaussian distribution only after removing this part of the noise data.

Optionally, the recognition server identifies the noise data in the positive data set, and calculates a ratio value of the noise data in the positive data set to include:

Perform cluster analysis on the positive data based on the uplink and downlink packet length and uplink and downlink duration corresponding to the positive data set, identify and mark the classification with the smallest uplink and downlink packet length and uplink and downlink duration as noise data, and calculate it A proportion value of the noise data in the positive data set.

Specifically, the recognition server can perform cluster analysis on the service based on the four dimensions of the uplink and downlink packet length and uplink and downlink duration of the service corresponding to the positive data set, and can determine the characteristic distribution of the service and mark the one with the smallest packet and duration. The positive data corresponding to the classification is used as noise data, and the proportion value of the noise data in the positive data set is calculated.

S130: Calculate a probability value that each data in the unlabeled data set is positive data.

Specifically, the identification server may obtain network data from a network node (such as a maintenance node), where the network data includes positive data and data to be identified. For example, the identification server obtains 100,000 call record data from the maintenance node, of which 10,000 call record data is labeled positive data corresponding to users in the user whitelist, and the remaining 90,000 call records The data is unlabeled data corresponding to other users. It can be understood that in the unlabeled data, there may be some positive data (only unlabeled) and negative data (that is, the user corresponding to the negative data does not belong to the user whitelist).

Optionally, the calculating the probability value that each data in the unlabeled data set is positive data includes:

Divide the positive data set into i groups of spy data;

Build an iterative EM model according to M and P, where M = U + Si and Pi = P-Si, where Si represents each set of the spy data, and P represents the positive data set, The U represents the unlabeled data set;

Analyzing each data in the M according to the EM model, and obtaining a probability value tj of each data in the M being positive data;

Wherein, i and j are positive integers greater than or equal to 1.

Specifically, the identification server randomly divides the obtained positive data set (which can be represented by P) into i groups of spy data randomly. The value range of i can be 5 or more and 10 or less. The value is not limited in this application.

Further, for each set of spy data (which can be represented by Si), it is added to the unlabeled data set (which can be represented by U) to obtain a new unlabeled data set (which can be represented by M). Taking P-Si as the positive example data set (which can be represented by Pi) and M as the negative example data set, Pi and M are used to construct an iterative EM model.

Specifically, the implementation of the EM algorithm involved in the embodiments of the present application includes two parts, initialization and EM iteration. During initialization, the naive Bayesian model N is constructed using the M and Pi sets. The E stage uses N to predict the M set and the M stage uses the new prediction The results are re-modeled.

By using the obtained EM model, each sample data in M is analyzed, and the probability value (which can be represented by tj) of each data as positive data is obtained.

S140: Determine a recognition threshold according to the probability value and the proportion value, and identify a counter-example data set from the unlabeled data set according to the recognition threshold.

It should be noted that after the EM model analysis is performed to obtain the probability value tj of each data in M as positive data, the existing practice is to directly determine the recognition threshold based on experience, for example, if the probability value is greater than 0.5, it is considered a positive If the data is less than 0.5, it is considered counter-example data to obtain a counter-example data set. The counter-example data set obtained by this method will have a large error in the subsequent user identification process, making the identification inaccurate.

Optionally, the identification server determines an identification threshold according to the probability value and the proportion value, and identifies a counter-example data set from the unlabeled data set according to the identification threshold, which may include:

For example, assuming M contains 100 data, the probability values corresponding to the 100 data are reordered in ascending order. The proportion of the noise data in the positive data is 30%, but of course it can be other values. , This is only for illustrative purposes, and it is not a limitation. Judging that the noise data in the 100 data is determined as the probability value corresponding to the counter-example data with 30% confidence, that is, among the probability values corresponding to the 100 data, the probability value corresponding to the first 30% is taken as the recognition threshold. For example, if the 100 probability values are in ascending order and the 30th probability value is 0.4, 0.4 is used as the recognition threshold.

Further, the probability value corresponding to each data is compared with the recognition threshold (0.4), and all data corresponding to the probability values less than 0.4 are added to the counter-example data set RNi.

It can be seen that the recognition threshold and the counter-example data set RNi obtained by the above method have higher reliability than the recognition threshold and the counter-example set obtained through experience, which can make user identification more accurate.

Optionally, after the identification server obtains the counter-example data set RNi, the method may further include:

A union set is obtained on the i counter-example data sets RNi obtained through the i-group spy data to obtain a counter-example set RN.

Specifically, for each group of spy data, a counter-example data set RNi can be obtained through the above method. It can be understood that because each set of spy data is different, the final counter-example data set RNi obtained is not the same. All the obtained counter-example data sets RNi are taken as a union set, and the union set is used as the final counter-example data set RN.

It can be understood that by combining all the counter-example data sets RNi to obtain the counter-example data set RN, the scope of the counter-example data set can be expanded, the reliability of the counter-example data set can be improved, and subsequent user identification can be facilitated.

S150: Identify the user to be identified according to the positive data set and the negative data set.

Specifically, combined with the obtained expanded positive example data set and negative example data set, the business data corresponding to the user who does not belong to the user whitelist is analyzed, and the characteristics of the business data (such as time, traffic, and coverage level) are analyzed. Etc.) Compare with the positive or negative data to determine whether it belongs to the positive data set or the negative data set, and then determine whether the user corresponding to the business data belongs to the user in the whitelisted industry and complete the comparison. Identification and classification of the user.

Further, the embodiments of the present application can implement online or offline user identification. If you need to identify a user offline, you can obtain the business data that the user has generated from the relevant network nodes, and use the above method to analyze the business data by combining the expanded positive and negative data sets. The user performs identification classification. If the user needs to be identified online, the service generated by the user can be obtained in real time from the relevant network node. Using the above method, the business data is analyzed by combining the expanded positive data set and counter data set, and the user is analyzed. Perform classification. It is worth noting that during the online recognition process, it is necessary to continuously obtain the network data generated by the user from the relevant network node, and execute the above method steps multiple times to identify and classify the user to ensure the accuracy of the classification.

It should be noted that if the recognition server can understand the status of the user (for example, whether it is a static user), it has a significant impact on the user's recognition classification and can further improve the accuracy of recognition. Therefore, the recognition server can recognize the user status. In combination with the identified status of the user, the user is further identified to improve the recognition accuracy.

The user involved in this embodiment of the present application may be an M2M terminal or a CPE terminal. Most of the M2M and CPE terminals are stationary and have no mobility. The identification server may construct the static characteristics of the user based on the measurement report in the network And identify the status of the user.

Specifically, since the stationary user does not move or has a small moving range, the measured main serving cell is fixed, and the measured first neighboring cell (that is, the highest-level neighboring cell) and the network topology are relatively fixed and are in the same location. The ratio of the number of times of different first neighboring cells measured on the network is fixed; in addition, the level of the main serving cell measured by the stationary user is relatively stable, the level of the first neighboring cell measured is relatively stable, and the The sequential first neighboring cell level sequence (that is, the first neighboring cell level sequence from large to small) is stable. Referring to FIG. 5, it is a schematic diagram of the topological structure and measurement level characteristics of a stationary user, respectively. Among them, the central position indicates the location of the main serving cell, the size of the shape indicates the level distribution characteristics of the main serving cell, the surrounding position indicates the position of the first neighboring cell, and the size of the shape indicates the level of each first neighboring cell. Distribution characteristics, the shape of the curve represents the distribution characteristics of the neighborhood measured by the user, and the length of the straight line represents the proportion of the neighborhood measured by the user.

It can be understood that the identification server can calculate the first neighboring cell distance between calls and the similarity of the level sequence contour between calls based on the above characteristics. If the first neighboring cell distance between calls is less than the first threshold, or the level sequence contours between calls are similar If the degree is less than the second threshold, the recognition server determines that it is a stationary user. It should be noted that the first threshold value and the second threshold value can be specifically set as required, and this application does not limit this.

To facilitate understanding, a user identification method provided in the embodiment of the present application is specifically illustrated below:

Use 7 * 24-hour network CHR data for M industry identification modeling and analysis. The M industry represents a specific industry name (such as shared bicycles). The number of whitelisted users is 352, and the total number of network users is 96532. Network data Including 22 features such as package length and duration. The input of whitelisted users before iterative matching is shown in Table 1:

Table 1 Inputs before iterative matching of whitelisted users

白名单用户Whitelisted users	网络用户总数Total network users	网络数据Network data
352352	9653296532	包长、时长等22个特征22 characteristics such as package length and duration

Iterative matching is performed on the input whitelist users according to their IMSI. The situation after iterative matching is shown in Table 2:

Table 2 Output of iterative matching of whitelisted users

白名单用户Whitelisted users	网络用户总数Total network users	迭代匹配后M行业用户数Number of users in M industry after iterative matching
352352	9653296532	18531853

It can be seen that after iterative matching of whitelisted users, the number of users has increased to 1853, which is more than 5 times the number of whitelisted users entered previously.

Perform cluster analysis on the obtained network data. See FIG. 6, which is a schematic diagram of a service distribution. Among them, the horizontal axis represents any one of the dimensions of the uplink packet length, uplink duration, downlink packet length, or downlink duration, and the vertical axis represents specific values. It can be seen that the service categories corresponding to network data include three categories (represented by different lines in the figure) Among them, category 1 services account for 37% of the overall proportion, and the average uplink and downlink packet lengths and uplink and downlink durations are close to 0. The network data corresponding to these services is noise data because it cannot reflect the characteristics of the services. Category 0 and category 2 can better reflect the characteristics of the business.

After the noisy data is identified, the counter-example data set is identified through the subsequent EM algorithm, and then the user corresponding to the counter-example data set is identified. The identification is shown in Table 3:

Table 3 User identification based on packet length and duration

It can be seen that by analyzing the four dimensions of the packet length and duration, noise data can be identified. Combining the identified noise data and the proportion of the noise data in the service increases the number of reliable counterexample users.

Referring to FIG. 7, FIG. 7 is a schematic diagram of a comparison of a user recognition effect provided by an embodiment of the present application. It can be seen that in the case of using only the initial whitelist and the traditional EM algorithm to identify counterexamples for modeling, the recognition accuracy rate is 59% and the recall rate is 65%. Among them, the recall rate is the number of users identified. The ratio of the number of correct users to the number of actually correct users. The accuracy rate is the ratio of the number of correct users to the number of identified users. The whitelist and the traditional EM are expanded using iterative matching. When the algorithm identifies counterexamples for modeling, the recognition accuracy rate is 66% and the recall rate is 72%; in the whitelist using iterative matching expansion and the proportion of noise data in the business to identify reliable counterexamples for modeling In this case, the recognition accuracy rate is 78% and the recall rate is 83%. It can be seen that the whitelist using iterative matching expansion and the proportion of noise data in the service are used to identify reliable counterexamples provided in the embodiments of the present application. Modeling can effectively improve the accuracy of user identification, and the accuracy and recall of modeling by using only the initial whitelist and traditional EM algorithm to identify counterexamples At least increased by 15%.

It can be understood that the implementation of the embodiment of the present application does not need to obtain the authorization of the customer to splice multiple data sources, only a small amount of whitelist information and network data need to be obtained. The user whitelist based on iterative matching and the combination of noise data in the service can be Realizing online or offline user identification can effectively improve the accuracy of user identification.

The method described in the embodiments of the present application has been described in detail above. In order to facilitate better implementation of the foregoing solutions in the embodiments of the present application, corresponding devices are provided below for cooperating in implementing the foregoing solutions.

Referring to FIG. 8, FIG. 8 is a schematic structural diagram of an identification server according to an embodiment of the present application. The identification server 800 includes at least: an obtaining unit 810, an identifying unit 820, a calculating unit 830, and a determining unit 840;

An obtaining unit 810, configured to obtain a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;

A recognition unit 820, configured to identify noise data in the positive data set;

A calculating unit 830, configured to calculate a proportion value of the noise data in the positive data set;

A determining unit 840, configured to determine an identification threshold according to the probability value and the ratio value;

The calculation unit 830 is further configured to calculate a probability value that each data in the unlabeled data set is positive data; and the recognition unit 820 is further configured to remove the data from the unlabeled data set according to the recognition threshold. A negative example data set is identified; the identifying unit 820 is further configured to identify the user to be identified according to the positive example data set and the negative example data set.

In a possible implementation manner, the obtaining unit 810 is further configured to: obtain current whitelist information;

The obtaining unit 810 further includes a mapping subunit 8101, configured to map a user whitelist set based on the current whitelist information and network data of the user to be identified;

The obtaining unit further includes a merging sub-unit 8102 for merging the user whitelist set.

In another possible implementation manner, the merge subunit 8102 further includes a conflict deduplication subunit 8103, and the conflict deduplication subunit 8103 is configured to perform conflict deduplication on the user whitelist set through a preset rule. The preset rule includes a priority based on the white list or a mapping time.

In another possible implementation manner, the mapping subunit 8101 is further configured to perform mapping of multiple addresses on the user whitelist in combination with the network data to obtain an address whitelist Bj, where the addresses include public address;

The identification unit 820 is further configured to identify and mark the public address;

The server further includes a judging unit 850 for judging whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are consistent, output the user white list; if they are not consistent, use the address white The list Bj is used as the current white list information, and the steps of mapping the user white list set based on the current white list information and the network data of the user to be identified are repeatedly performed.

In another possible implementation manner, the determining unit 850 is further configured to:

In still another possible implementation manner, the identification unit 820 further includes a cluster analysis subunit 8201, configured to perform a process on the positive data based on an uplink and downlink packet length and an uplink and downlink duration corresponding to the positive data set. Cluster analysis, identifying and labeling the category with the smallest uplink and downlink packet length and smallest uplink and downlink duration as noise data.

In another possible implementation manner, the calculation unit 830 further includes a grouping subunit 8301, which is configured to divide the positive data set into i groups of spy data;

The calculation unit 830 further includes a construction sub-unit 8302 for constructing an iterative EM model according to M and Pi, where M = U + Si, and Pi = P-Si, where Si represents each group of the Spy data, where P represents the positive data set, and U represents the unlabeled data set;

The calculation unit 830 further includes an analysis subunit 8303, configured to analyze each data in the M according to the EM model, to obtain a probability value tj where each data in the M is positive data;

Wherein, i and j are positive integers greater than or equal to 1.

In another possible implementation manner, the determining unit 840 is further configured to:

In another possible implementation manner, the identification unit 820 further includes a take-and-set sub-unit 8202, configured to obtain a counter-set RN by merging the i counter-example data sets RNi obtained through the i-group spy data. .

It should be noted that the implementation of each unit may also correspond to the corresponding description of the method embodiment shown in FIG. 1, which is not repeated here.

Please refer to FIG. 9, which is a schematic structural diagram of another identification server provided by an embodiment of the present application. The identification server 900 includes at least a processor 910, a memory 920, and a transceiver 930. The processor 910, the memory 920, and the transceiver 930 are connected to each other through a bus 940.

The memory 920 includes, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), or Erasable Programmable Read-Only Memory (EPROM) or flash memory. Flash memory). The memory 920 is used to store related instructions and data.

The transceiver 930 may include a receiver and a transmitter, for example, a radio frequency module. The processor 910 described below receives or sends a message. Specifically, it can be understood that the processor 910 receives or sends a message through the transceiver 930. .

The processor 910 may be one or more central processing units (CPUs). When the processor 910 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.

The processor 910 in the identification server 900 is configured to read the program code stored in the memory 920 and perform the following operations:

The processor 910 receives the user whitelist set and the network data of the user to be identified through the transceiver 930. The network data of the user to be identified includes the positive data set and the unlabeled data set corresponding to the user whitelist set.

The processor 910 identifies noise data in the positive data set, and calculates a ratio value of the noise data in the positive data set.

The processor 910 calculates a probability value that each data in the unlabeled data set is positive data.

The processor 910 determines a recognition threshold according to the probability value and the proportion value, and identifies a counter-example data set from the unlabeled data set according to the recognition threshold.

The processor 910 identifies the user to be identified according to the positive data set and the negative data set.

It should be noted that the specific implementation of each operation may also be specifically implemented according to the method in the foregoing method embodiment, and details are not described herein again.

By implementing the embodiments of the present application, the base station can realize the support of different paging cycles in a cell by grouping the paging carriers and performing specific paging configuration for different groups of carriers, while meeting the requirements of short delay and deep coverage. UE's paging requirements.

It should be noted that the specific implementation of each operation may also correspond to the corresponding description of the method embodiment shown in FIG. 1, which is not repeated here.

In summary, according to the embodiment of the present application, the identification server obtains a small amount of user whitelist information and network data, expands the user whitelist by iterative matching, and then identifies noise data through cluster analysis and calculates it. The proportion of the value is constructed. The EM model is constructed to calculate the unlabeled data to obtain the probability value of the positive data. Combined with the proportion of the noise data, a reliable counter-example data set can be identified. Finally, the user to be identified is identified. Can effectively pass the accuracy of recognition.

An embodiment of the present application further provides a computer-readable storage medium. The computer-readable storage medium stores instructions. When the computer-readable storage medium runs on a computer or a processor, the computer or the processor executes any one of the foregoing data transmission methods One or more steps. When each component module of the above device is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in the computer-readable storage medium.

The computer-readable storage medium may be an internal storage unit of the identification server according to any one of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium may also be an external storage device of the identification server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, and a flash memory card (Flash Card) and so on. Further, the computer-readable storage medium may further include both the internal storage unit of the identification server and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the identification server. The computer-readable storage medium described above may also be used to temporarily store data that has been or will be output.

A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using a computer program to instruct related hardware. The program can be stored in a computer-readable storage medium. When the program is executed, Can include the processes of the embodiments of the methods as described above. The foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disc.

The steps in the method of the embodiment of the present application can be adjusted, combined, and deleted according to actual needs.

The modules in the apparatus of the embodiment of the present application may be combined, divided, and deleted according to actual needs.

As mentioned above, the above embodiments are only used to describe the technical solution of the present application, rather than limiting them. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still apply the foregoing The technical solutions described in the embodiments are modified, or some technical features are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions outside the scope of the technical solutions of the embodiments of the present application.

Claims

A user identification method, comprising:

Acquiring a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;

Identify the noise data in the positive data set, and calculate the proportion value of the noise data in the positive data set;

Calculate the probability value of each data in the unlabeled data set as positive data;

Determining a recognition threshold according to the probability value and the proportion value, and identifying a counter-example data set from the unlabeled data set according to the recognition threshold;

According to the positive data set and the negative data set, the user to be identified is identified.
The method according to claim 1, wherein the obtaining a user whitelist set comprises:

Acquiring current whitelist information, and mapping a user whitelist set based on the current whitelist information and network data of the user to be identified;

Merging the user whitelist set.
The method according to claim 2, wherein the current whitelist information includes a plurality of different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user The white list includes user identification information and industry information, and the address white list includes address information and industry information.
The method according to claim 3, wherein the user identification information includes an International Mobile Subscriber Identity (IMSI), and the address information includes an Internet Protocol address (IP).
The method according to claim 2, wherein the merging the user whitelist set comprises:

Deduplication of the user whitelist set is performed through a preset rule, and the preset rule includes a priority based on the whitelist or a mapping time.
The method according to claim 5, wherein after merging the user whitelist set, the method further comprises:

Map the user whitelist with multiple addresses in combination with the network data to obtain an address whitelist Bj, where the addresses include public addresses, and identify and mark the public addresses during the mapping process;

Determine whether the address whitelist Bj is consistent with the obtained current address whitelist Bi, if they are the same, output the user whitelist; if they are not the same, use the address whitelist Bj as the current whitelist information and repeat the execution The steps of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified are described.
The method according to claim 5, wherein after merging the user whitelist set, the method further comprises:

Determine whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai, and if they are consistent, output the user whitelist Aj; if they are not consistent, use the user whitelist Aj as the current White list information, repeating the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
The method according to claim 1, wherein the identifying the noise data in the positive data set and calculating and calculating the proportion value of the noise data in the positive data set comprises:

Perform cluster analysis on the positive data based on the uplink and downlink packet length and uplink and downlink duration corresponding to the positive data set, identify and mark the classification with the smallest uplink and downlink packet length and uplink and downlink duration as noise data, and calculate it A proportion value of the noise data in the positive data set.
The method according to claim 1, wherein the calculating the probability value that each data in the unlabeled data set is positive data comprises:

Divide the positive data set into i groups of spy data;

Build an iterative EM model according to M and Pi, where M = U + Si and Pi = P-Si, where Si represents each set of the spy data, and P represents the positive data set, The U represents the unlabeled data set;

Analyzing each data in the M according to the EM model, and obtaining a probability value tj of each data in the M being positive data;

Wherein, i and j are positive integers greater than or equal to 1.
The method according to claim 9, wherein determining an identification threshold based on the probability value and the proportion value, and identifying a counter-example data set from the unlabeled data set according to the identification threshold comprises:

Combining the probability value tj of each data in M with positive data, it is obtained when the noise data in M is determined as the negative data with the proportional value of the noise data in the positive data as the confidence data. The corresponding probability value t;

Determine the magnitude relationship between tj and t, and add all data corresponding to tj smaller than t to the counter-example data set RNi.
The method of claim 10, further comprising:

A union set is obtained on the i counter-example data sets RNi obtained through the i-group spy data to obtain a counter-example set RN.
An identification server, comprising:

An obtaining unit, configured to obtain a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;

A recognition unit, configured to identify noise data in the positive data set;

A calculation unit, configured to calculate a proportion value of the noise data in the positive data set;

A determining unit, configured to determine an identification threshold according to the probability value and the ratio value;

The calculation unit is further configured to calculate a probability value that each data in the unlabeled data set is positive data; and the recognition unit is further configured to identify the unlabeled data set according to the recognition threshold. Counter-example data set; the identification unit is further configured to identify a user to be identified based on the positive-example data set and the counter-example data set.
The server according to claim 12, wherein the obtaining unit is further configured to: obtain current whitelist information;

The obtaining unit further includes a mapping subunit, configured to map a user whitelist set based on the current whitelist information and network data of the user to be identified;

The obtaining unit further includes a merging subunit for merging the user whitelist set.
The server according to claim 13, wherein the current whitelist information includes a plurality of different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user The white list includes user identification information and industry information, and the address white list includes address information and industry information.
The server according to claim 14, wherein the user identification includes an International Mobile Subscriber Identity (IMSI), and the address information includes an Internet Protocol address (IP).
The server according to claim 13, wherein the merge subunit further comprises a conflict deduplication unit, and the conflict deduplication unit is configured to perform conflict deduplication on the user whitelist set through a preset rule, The preset rule includes a priority based on the white list or a mapping time.
The server according to claim 16, wherein the mapping subunit is further configured to perform mapping of multiple addresses on the user whitelist in combination with the network data to obtain an address whitelist Bj, wherein the address Including public address;

The identification unit is further configured to identify and mark the public address;

The server further includes a judging unit for judging whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are consistent, output the user white list; if they are not consistent, use the address white list As the current white list information, Bj repeatedly executes the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
The server according to claim 16, wherein the determining unit is further configured to:

Determine whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai, and if they are consistent, output the user whitelist Aj; if they are not consistent, use the user whitelist Aj as the current White list information, repeating the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
The server according to claim 12, wherein the identification unit further comprises a cluster analysis subunit, configured to compare the positive data based on the uplink and downlink packet length and uplink and downlink duration corresponding to the positive data set. Perform cluster analysis to identify and mark the category with the smallest uplink and downlink packet length and smallest uplink and downlink duration as noise data.
The server according to claim 12, wherein the calculation unit further comprises a grouping subunit for dividing the set of positive data into i groups of spy data;

The calculation unit further includes a construction sub-unit for constructing an iterative EM model according to M and Pi, where M = U + Si and Pi = P-Si, where Si represents each group of the spy data , P represents the positive data set, and U represents the unlabeled data set;

The calculation unit further includes an analysis subunit, configured to analyze each data in the M according to the EM model, to obtain a probability value tj where each data in the M is positive data;

Wherein, i and j are positive integers greater than or equal to 1.
The server according to claim 20, wherein the determining unit is further configured to:

Combining the probability value tj of each data in M with positive data, it is obtained when the noise data in M is determined as the negative data with the proportional value of the noise data in the positive data as the confidence data. The corresponding probability value t;

Determine the magnitude relationship between tj and t, and add all data corresponding to tj smaller than t to the counter-example data set RNi.
The server according to claim 21, wherein the identification unit further comprises a take-out set sub-unit, configured to obtain a set of counter-examples by combining i counter-example data sets RNi obtained by the i-group spy data. RN.
An identification server, characterized in that the identification server includes: a processor, a memory, and a transceiver, wherein:

The processor, the memory, and the transceiver are connected to each other, the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute a right A user identification method according to any one of 1 to 11 is required.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, cause the processor to execute a program such as The method according to any one of claims 1 to 11.