WO2020014916A1 - Method for identifying user and related device - Google Patents

Method for identifying user and related device Download PDF

Info

Publication number
WO2020014916A1
WO2020014916A1 PCT/CN2018/096239 CN2018096239W WO2020014916A1 WO 2020014916 A1 WO2020014916 A1 WO 2020014916A1 CN 2018096239 W CN2018096239 W CN 2018096239W WO 2020014916 A1 WO2020014916 A1 WO 2020014916A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
user
whitelist
address
positive
Prior art date
Application number
PCT/CN2018/096239
Other languages
French (fr)
Chinese (zh)
Inventor
黄晓光
岳晓贫
郑春芳
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2018/096239 priority Critical patent/WO2020014916A1/en
Publication of WO2020014916A1 publication Critical patent/WO2020014916A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Definitions

  • the present application relates to the technical field of mobile Internet service identification, and in particular, to a service identification method and related equipment.
  • Service identification is a very important topic in the mobile Internet industry, and it is the basis for topics such as user network behavior research and operator intelligent pipelines.
  • the wireless side does not have the function of recording and tracking user information, and the classification information of users and services (such as machine communication (Machine to Machine (M2M), client equipment (Customer Equipment, CPE, etc.) cannot be obtained on the wireless side.
  • M2M Machine to Machine
  • client equipment Customer Equipment, CPE, etc.
  • Wireless side service characteristics include information such as the packet length, terminal capabilities, service duration, and access frequency of the service. Services of the same type (such as Point Of Sale Terminal (POS), measurement tables, etc.) are in these aspects. Shows high similarity.
  • POS Point Of Sale Terminal
  • user classification modeling is performed by obtaining the full amount of user classification information of the existing offices. It is necessary to obtain user account opening information, stitching all business record information from the core network to the wireless side, and wireless side service record information. Obtaining data from multiple nodes is difficult to implement in practical applications. In addition, multiple data sources need to be spliced, and storage and calculation costs are huge. There are no related solutions in the prior art to achieve accurate classification of network users and services.
  • This application provides a user identification method and related equipment, which can obtain more reliable positive and negative data, and improve the accuracy of user identification.
  • a user identification method includes: an identification server obtains a user whitelist set and network data of a user to be identified, and the network data of the user to be identified includes a positive corresponding to the user whitelist set.
  • Example data set and unlabeled data set identify noise data in the positive data set, and calculate the proportion of the noise data in the positive data set; calculate each data in the unlabeled data set Is a probability value of positive data; a recognition threshold value is determined according to the probability value and the proportional value, and a negative data set is identified from the unlabeled data set according to the recognition threshold; according to the positive data set and
  • the counter-example data set is used to identify a user to be identified.
  • the recognition server recognizes the noise data in the positive data set corresponding to the user whitelist set and calculates the proportion of the noise data in the positive data set, and then calculates the unlabeled data set through the EM model calculation.
  • Each data is a probability value of the positive example data.
  • the identification threshold is determined to determine the negative example data set. More reliable positive example data and negative example data can be obtained, thereby improving the accuracy of user identification.
  • an implementation manner in which the identification server obtains the user whitelist set may be: the identification server obtains current whitelist information, and maps the current whitelist information and network data of the user to be identified based on the current whitelist information A user whitelist set is generated; the user whitelist set is merged.
  • the white list information can be expanded and the space of reliable positive data samples can be increased.
  • the current whitelist information includes multiple different types of whitelists
  • the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi
  • the user whitelist includes User identification information and industry information
  • the address whitelist includes address information and industry information.
  • the recognition server can perform iterative mapping based on the current user whitelist Ai or the current address whitelist Bi to expand the whitelist information and increase the space of reliable positive data samples.
  • the user identifier includes an international mobile subscriber identity IMSI
  • the address information includes an Internet protocol address IP.
  • the merging of the user whitelist set by the identification server includes: the identification server performs conflict deduplication on the user whitelist set through a preset rule, and the preset rule includes based on the White list priority, or based on mapping time.
  • the identification server performs conflict deduplication on the user whitelist set based on the priority or mapping time of the whitelist, which can ensure that there are no conflicting and duplicate users in the obtained user whitelist set, and improve the accuracy of the space of positive data samples. Sex.
  • the method further includes:
  • the identification server combines the network data with multiple addresses of the user whitelist to obtain an address whitelist Bj, where the address includes a public address, and the public address is identified and marked during the mapping process;
  • the identification server judges whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are the same, outputs the user white list; if they are not the same, the address white list Bj is used as the current white list information, and repeats Performing the step of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified.
  • the identification server obtains the address white list by performing address mapping on the obtained user white list, which can identify and mark public addresses, and can iterate the mapping process in subsequent iterations. China no longer participates in mapping, which simplifies the mapping process and improves mapping efficiency.
  • the method further includes:
  • the identification server judges whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai. If they are the same, the user whitelist Aj is output. If they are not the same, the user whitelist Aj is used.
  • the step of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified is repeatedly performed.
  • the identification server determines whether to continue iterative mapping by judging whether the user white list obtained after the mapping is consistent with the user white list before the mapping, which can effectively expand
  • the user's whitelisting service has enough comprehensive sample data space for positive examples.
  • the method further includes:
  • the recognition server performs cluster analysis on the positive data based on the uplink and downlink packet length and uplink and downlink duration corresponding to the positive data set, and identifies and marks the classification with the smallest uplink and downlink packet length and uplink and downlink duration as noise data, The proportion value of the noise data in the positive data set is calculated by calculation.
  • the recognition server performs cluster analysis on the positive data through the four dimensions of the uplink and downlink packet length and the uplink and downlink duration, which can accurately distinguish the noise data and calculate the noise data in the positive data set. Percentage value.
  • the recognition server calculates the probability value that each data in the unlabeled data set is positive data including:
  • the identification server divides the positive data set into i groups of spy data
  • the recognition server analyzes each data in the M according to the EM model, and obtains a probability value tj where each data in the M is positive data;
  • i and j are positive integers greater than or equal to 1.
  • the recognition server can analyze each data in M by constructing an EM model, and can accurately obtain the probability value that each data in M is positive data.
  • the identification server determines an identification threshold according to the probability value and the proportion value, and identifies a counter-example data set from the unlabeled data set according to the identification threshold, including:
  • the recognition server combines the probability value tj of each data in M as positive data, and obtains the noise data in M as the negative data by using the ratio of the noise data in the positive data as the confidence value.
  • the probability t corresponding to the time;
  • the recognition server judges the magnitude relationship between tj and t, and adds all data corresponding to tj smaller than t to the counter-example data set RNi.
  • the recognition server can obtain the probability value t corresponding to the noise data in M as the negative data by using the ratio tj of the noise data in the positive data as the confidence, and then determine The size relationship between tj and t, and adding all the data corresponding to tj smaller than t to the counter-example data set RNi can improve the accuracy of the counter-example data set, thereby improving the accuracy of user identification.
  • the method further includes:
  • the identification server obtains a counter set RN by combining the i counter-example data sets RNi obtained through the i-group spy data.
  • the accuracy of the counter-example set RN can be further improved, thereby ensuring that the user identification can be performed accurately.
  • an identification server includes:
  • An obtaining unit configured to obtain a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;
  • a recognition unit configured to identify noise data in the positive data set
  • a calculation unit configured to calculate a proportion value of the noise data in the positive data set
  • a determining unit configured to determine an identification threshold according to the probability value and the ratio value
  • the calculation unit is further configured to calculate a probability value that each data in the unlabeled data set is positive data; and the recognition unit is further configured to identify the unlabeled data set according to the recognition threshold. Counter-example data set; the identification unit is further configured to identify a user to be identified based on the positive-example data set and the counter-example data set.
  • the obtaining unit is further configured to: obtain current whitelist information
  • the obtaining unit further includes a mapping subunit, configured to map a user whitelist set based on the current whitelist information and network data of the user to be identified;
  • the obtaining unit further includes a merging subunit for merging the user whitelist set.
  • the current whitelist information includes multiple different types of whitelists
  • the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi
  • the user whitelist includes User identification information and industry information
  • the address whitelist includes address information and industry information.
  • the user identifier includes an international mobile subscriber identity IMSI
  • the address information includes an Internet protocol address IP.
  • the merging subunit further includes a conflict deduplication subunit
  • the conflict deduplication subunit is configured to perform conflict deduplication on the user whitelist set through a preset rule.
  • the rule includes a priority based on the white list or a mapping time.
  • mapping subunit is further configured to perform mapping of multiple addresses on the user whitelist in combination with the network data to obtain an address whitelist Bj, where the addresses include public addresses ;
  • the identification unit is further configured to identify and mark the public address
  • the server further includes a judging unit for judging whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are consistent, output the user white list; if they are not consistent, use the address white list As the current white list information, Bj repeatedly executes the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
  • the determining unit is further configured to:
  • the identification unit further includes a cluster analysis subunit, configured to cluster the positive data based on the uplink and downlink packet length and the uplink and downlink duration corresponding to the positive data set. Analyze, identify, and mark the classification with the smallest uplink and downlink packet length and smallest uplink and downlink duration as noise data.
  • a cluster analysis subunit configured to cluster the positive data based on the uplink and downlink packet length and the uplink and downlink duration corresponding to the positive data set. Analyze, identify, and mark the classification with the smallest uplink and downlink packet length and smallest uplink and downlink duration as noise data.
  • the calculation unit further includes a grouping subunit, configured to divide the positive data set into i groups of spy data;
  • the calculation unit further includes an analysis subunit, configured to analyze each data in the M according to the EM model, to obtain a probability value tj where each data in the M is positive data;
  • i and j are positive integers greater than or equal to 1.
  • the determining unit is further configured to:
  • the identification unit further includes a fetch set sub-unit, configured to obtain a set of i counter-example data sets RNi obtained by the i-group spy data to obtain a counter-example set RN.
  • an identification server includes a processor, a memory, and a transceiver, where:
  • the processor, the memory, and the transceiver are connected to each other.
  • the memory is used to store a computer program.
  • the computer program includes program instructions.
  • the processor is configured to call the program instructions, and execute the following steps. :
  • the user to be identified is identified.
  • an embodiment of the present application provides a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, where the computer program includes program instructions, and the program instructions are executed by a processor that identifies a server.
  • the processor of the identification server When causing the processor of the identification server to execute the method described in the first aspect or any optional implementation manner of the first aspect.
  • FIG. 1 is a schematic flowchart of a user identification method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of a user whitelist according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an address whitelist according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an iterative matching process of an address whitelist according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a topology structure and measurement level distribution characteristics of a stationary user according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a service distribution provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a user recognition effect comparison provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an identification server according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of another identification server according to an embodiment of the present application.
  • the identification server involved in the embodiment of the present application may be a server for communicating with a terminal device.
  • the identification server may be any kind of device with wireless transceiver function or a chip that can be set on the device.
  • the device includes but is not limited to: evolved Node B (eNB), radio network controller (radio network controller) , RNC), Node B (Node B, NB), base station controller (BSC), base transceiver station (BTS), home base station (e.g., home NodeB, or home NodeB, HNB ), Baseband unit (BBU), access point (AP), wireless relay node, wireless backhaul node, and transmission point (wireless fidelity, WIFI) system TP) or transmission and reception point (TRP), etc., may also be 5G, such as NR, gNB in the system, or transmission point (TRP or TP), one or a group of base stations in the 5G system
  • the antenna panel may also be a network node constituting a gNB or
  • the identification server can obtain required whitelist information and network data (such as service data or service RBI).
  • the whitelist information and network data may be stored by the identification server, or may be obtained by the identification server from other devices, such as a network node or a maintenance node, via the Internet.
  • the network data applied by the identification server can be divided into sample data and test data.
  • the sample data is used for mapping to obtain a user whitelist set to increase the sample space of reliable positive examples, and the test data is used for the identification server to identify the user carrying the test data. Understandably, the sample data may be part of the test data.
  • the embodiments of the present application can be applied to identify a user type offline or online in a mixed service scenario in a network, and can realize resource and experience optimization based on the user type.
  • FIG. 1 is a schematic flowchart of a user identification method according to an embodiment of the present application. The method includes, but is not limited to, the following steps:
  • S110 Obtain a user whitelist set and network data of a user to be identified.
  • the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set.
  • the identification server may send a request message to an operator (mobile operator) to request to obtain the user whitelist set, or the identification server may also obtain the user whitelist set based on the account opening information of the user terminal.
  • an operator mobile operator
  • the identification server may also obtain the user whitelist set based on the account opening information of the user terminal.
  • the identification server may obtain network data of a user to be identified from a network node or a network maintenance node, such as an access point (Access Point, AP). These network data represent the characteristics of the user's business, and can be obtained by analyzing these network data. According to the corresponding business characteristics, it is possible to determine which type of users have these business characteristics.
  • a network maintenance node such as an access point (Access Point, AP).
  • the manner in which the identification server obtains the user whitelist set may include:
  • the mobile operator can record the user's whitelist information
  • the identification server obtains the whitelist information from the mobile operator, and uses it as the current whitelist information, and then combines it with network maintenance
  • the network data obtained by the nodes and the like is mapped to obtain a user whitelist set.
  • the current whitelist information includes a plurality of different types of whitelists
  • the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi
  • the user whitelist includes user identification information and industry information
  • the address whitelist includes address information and industry information.
  • the identification server can obtain different types of whitelists, each of which corresponds to a different industry type, such as shared bicycles, smart meters, smart street lights and other industries.
  • the whitelist includes the user whitelist Ai and the address white For the list Bi, it is worth noting that Ai represents the white list of users corresponding to different industries, and Bi represents the white list of addresses corresponding to different industries.
  • Figure 2 is a schematic diagram of a user whitelist. It can be seen that the user whitelist includes two columns. The left column corresponds to the user's International Mobile Subscriber Identity (IMSI), and the right column corresponds to the It is the name of the industry. It should be noted that IMSI is used to represent different users. It is understandable that other identification information can also be used to distinguish different users. This application does not limit this. The right column corresponds to one. This specific industry name, such as smart meters, is not limited in this application.
  • IMSI International Mobile Subscriber Identity
  • the address whitelist also includes two columns, the left column corresponds to the Internet Protocol address (IP), and the right column corresponds to the industry name.
  • IP addresses are used to represent the peer addresses of different users in the same industry. Different users communicating with the IP can be considered to belong to the same industry. It is understandable that other address information can also be used to represent the same industry.
  • the peer addresses of different users such as short signals, access point names (APN), etc., are not limited in this application.
  • the right column corresponds to a specific industry name, such as a smart meter. This application does not limit this either.
  • the manner in which the identification server merges the user whitelist set may include:
  • the identification server performs conflict deduplication on the user whitelist set through a preset rule, and the preset rule includes a priority based on the whitelist or a mapping time.
  • the identification server After the identification server obtains the current user whitelist information, it will perform the first matching mapping on it.
  • the current whitelist information obtained by the identification server is IMSI1, IMSI2, and IMSI3 belong to industry A (A represents a specific industry Name), and where the peer address of IMSI1 (that is, the address that communicates with IMSI1) includes the IP address of server B and the address of SMS station C (the address information can be a short signal of the SMS station, such as 10086), then contact server B
  • the user who communicates and the user who communicates with the SMS station C also considers it to be a user belonging to the A industry, and adds the tag information of the A industry to it.
  • the address of the SMS station 1 and the IP address of server 1 belongs to the B industry (B represents a specific industry name), that is, the identification server What is obtained is the address whitelist.
  • B represents a specific industry name
  • the address whitelist needs to be mapped and converted to the user whitelist.
  • Users who are about to communicate with the address of the SMS station 1 or the IP address of the server 1 are added with industry B label information, for example, with The users who communicate with the IP address of server 1 include user 1, user 2, and user 3.
  • the IMSI (ie, IMSI1, IMSI2, and IMSI3) of these three users is added with the industry B label information, and the address whitelist is mapped to user white. After the list, each user in the user whitelist is traversed. If the peer address of IMSI1 also includes the IP address of server 2, the user who will communicate with the IP address of server 2 is also considered to be a user in industry B. , Add tag information of industry B to it, and expand users belonging to industry B.
  • IMSI1 belongs to industry A
  • the peer address of IMSI1 includes the IP address of server 1
  • the user communicating with server 1 also includes IMSI2
  • IMSI2 should also belong to industry A
  • IMSI3 belongs to industry B
  • the peer address of IMSI3 also includes The IP address of server 1, then IMSI2 should also belong to the B industry. This is obviously incorrect.
  • a user cannot belong to two different industries at the same time.
  • the identification server needs to perform conflict deduplication through preset rules when merging the user whitelist set.
  • the identification server can perform conflict deduplication according to the priority of the whitelist. For example, the identification server can Set the whitelist priority in advance. If the identification server obtains two initial user whitelists (user whitelist 1 and user whitelist 2), the information in user whitelist 1 includes IMSI1 and IMSI2 belong to industry A, and user whitelist 2 The information includes that IMSI3 and IMSI4 belong to industry B.
  • the identification server sets the level of user whitelist 1 to be higher than the level of user whitelist 2, so it is necessary to give priority to user whitelist 1 when expanding the user whitelist set.
  • the peer address of IMSI1 includes the IP address of server 1, and the peer address of IMSI3 also includes the address of server 1.
  • the priority of user whitelist 1 is The level is higher than the level of user whitelist 2, so you can only add the label information of industry A instead of B plus tag information industry, by setting the priority of the user whitelist can effectively resolve the conflict duplication.
  • the identification server performs conflict deduplication according to the sequence of the mapping expansion time. For example, the identification server obtains two initial user whitelists (user whitelist 1 and user whitelist 2).
  • the information in user whitelist 1 includes IMSI1 and IMSI2 belong to industry A.
  • the information in user whitelist 2 includes IMSI3 and IMSI4. Belonging to industry B, the peer address of IMSI1 includes the IP address of server 1, and the peer address of IMSI3 also includes the address of server 1.
  • server 1 If the server is identified to first expand the user whitelist set for user whitelist 1, then server 1 When other users communicated by the IP address of the company add the industry A tag information, and then expand the user white list 2 to the user white list 2, there is no need to add the industry B's other users communicated by the server 1 IP address.
  • the tag information based on the sequence of the whitelist mapping extension time, can effectively solve the problem of conflict and duplication.
  • the method further includes:
  • the identification server when the current whitelist information obtained by the identification server is the address whitelist Bi, the identification server first maps the address whitelist to the user whitelist, and then expands and merges the user whitelist to obtain the The extended user whitelist set, and then map the user whitelist to multiple addresses (that is, each user in the user whitelist set is mapped to obtain its peer address) to obtain the address whitelist Bj.
  • the Bj obtained from the address mapping may contain some public addresses. At this time, these public addresses need to be identified and marked. The Bj obtained after the mapping is compared with the current Bi.
  • Bj is used as the new current address whitelist, and the above iterative iterative process is repeatedly performed, but in the process of repeated execution, the marked public address in Bj is no longer required Mapping is performed; if the length of Bj is equal to the length of Bi, the user whitelist set corresponding to Bj is output. It can be understood that the user whitelist set is expanded through repeated iterative mapping, and the user whitelist set is output until the length of the obtained address whitelist Bj is equal to the length of the current address whitelist Bi at this time, which can fully expand the user whitelist set. Increase the sample space for reliable positive examples.
  • FIG. 4 is a schematic diagram of an iterative matching process of an address whitelist according to an embodiment of the present application.
  • the current address whitelist entered may be a different type of address whitelist.
  • the length of the current address whitelist indicates the number of peer addresses included in the address whitelist.
  • S403 Map and expand the user whitelist, and add industry tag information to the users in the user whitelist obtained after the expansion.
  • the user whitelist is mapped and expanded based on network data.
  • S405 Perform mapping of multiple addresses on the merged and deduplicated user whitelist to obtain the address whitelist.
  • S406 Compare and judge the obtained address white list with the input current address white list. If they are the same, execute S407; if they are not the same, use the obtained address white list as the current address white list, and execute S401.
  • the current address whitelist obtained by the identification server is that the IP address of server 1 and the IP address of server 2 belong to industry A, and the users communicating with the IP address of server 1 include IMSI1 and IMSI2, and the users communicating with the IP address of server 2 include IMSI3 and IMSI4, so the user whitelist obtained from this address whitelist mapping is IMSI1, IMSI2, IMSI3, and IMSI4 belong to industry A, and the peer address of IMSI1 also includes the IP address of server 3, and the peer address of IMSI3 also includes the SMS station 1 address, and the address of SMS station 1 is a public address (that is, all users may communicate with the SMS station 1).
  • the identification server will identify the address of the SMS station 1 and mark it.
  • the user communicated with the IP address of 3 adds the label information of the A industry to obtain the extended user whitelist set, and then performs multiple address mappings on the extended user whitelist set to obtain a new address whitelist.
  • the address whitelist should be the IP address of server 1, the IP address of server 2 and the IP address of server 3 belong to the A industry.
  • the current address white list is compared, because the length of the obtained address white list is larger than the length of the current address white list entered (that is, it can be further expanded). This is to output the obtained address white list as the current white list.
  • the above steps are performed iteratively until the length of the obtained address white list is equal to the length of the input current address white list. At this time, the iteration process is stopped, and the user white list set corresponding to the obtained address white list is output.
  • the method further includes:
  • the identification server will expand and merge the user whitelist Ai to obtain an expanded user whitelist set Aj, and obtain the extended user whitelist Aj after mapping.
  • the Aj is compared with the current Ai. If the length of Aj (that is, the number of IMSIs contained in Aj) is greater than the length of Ai, then Aj is used as the new current user whitelist and the above-mentioned extended iterative process is repeated; if the length of Aj is If the length is equal to Ai, the user whitelist set corresponding to Aj is output.
  • the user whitelist set is expanded through repeated iterative mapping, and the user whitelist set is output until the length of the obtained user whitelist Aj is equal to the length of the current user whitelist Ai at this time, which can fully expand the user whitelist set. Increase the sample space for reliable positive examples.
  • S120 Identify the noise data in the positive data set, and calculate the proportion value of the noise data in the positive data set.
  • the network there are a large number of services, including all services of users in the white list user set and other users, and these service information can be reflected by data.
  • the services of the users in the user whitelist set are labeled and can be referred to as positive data sets, while the services of all other users can be referred to as unlabeled data sets. Understandably, the unlabeled data set will also contain some positive data.
  • noise data For example, in a Long Term Evolution (LTE) network, most service records (business records or service data) counted by Call History Records (CHR) are very small, and the contribution of traffic is very small, close to 50% of downlink service packets have a length of 0.
  • CHR Call History Records
  • This type of data is noise data for user identification. In fact, the service characteristics will be close to the Gaussian distribution only after removing this part of the noise data.
  • the recognition server identifies the noise data in the positive data set, and calculates a ratio value of the noise data in the positive data set to include:
  • the recognition server can perform cluster analysis on the service based on the four dimensions of the uplink and downlink packet length and uplink and downlink duration of the service corresponding to the positive data set, and can determine the characteristic distribution of the service and mark the one with the smallest packet and duration.
  • the positive data corresponding to the classification is used as noise data, and the proportion value of the noise data in the positive data set is calculated.
  • the identification server may obtain network data from a network node (such as a maintenance node), where the network data includes positive data and data to be identified.
  • a network node such as a maintenance node
  • the identification server obtains 100,000 call record data from the maintenance node, of which 10,000 call record data is labeled positive data corresponding to users in the user whitelist, and the remaining 90,000 call records
  • the data is unlabeled data corresponding to other users. It can be understood that in the unlabeled data, there may be some positive data (only unlabeled) and negative data (that is, the user corresponding to the negative data does not belong to the user whitelist).
  • the calculating the probability value that each data in the unlabeled data set is positive data includes:
  • i and j are positive integers greater than or equal to 1.
  • the identification server randomly divides the obtained positive data set (which can be represented by P) into i groups of spy data randomly.
  • the value range of i can be 5 or more and 10 or less. The value is not limited in this application.
  • the implementation of the EM algorithm involved in the embodiments of the present application includes two parts, initialization and EM iteration.
  • initialization the naive Bayesian model N is constructed using the M and Pi sets.
  • the E stage uses N to predict the M set and the M stage uses the new prediction The results are re-modeled.
  • each sample data in M is analyzed, and the probability value (which can be represented by tj) of each data as positive data is obtained.
  • S140 Determine a recognition threshold according to the probability value and the proportion value, and identify a counter-example data set from the unlabeled data set according to the recognition threshold.
  • the existing practice is to directly determine the recognition threshold based on experience, for example, if the probability value is greater than 0.5, it is considered a positive If the data is less than 0.5, it is considered counter-example data to obtain a counter-example data set.
  • the counter-example data set obtained by this method will have a large error in the subsequent user identification process, making the identification inaccurate.
  • the identification server determines an identification threshold according to the probability value and the proportion value, and identifies a counter-example data set from the unlabeled data set according to the identification threshold, which may include:
  • the probability values corresponding to the 100 data are reordered in ascending order.
  • the proportion of the noise data in the positive data is 30%, but of course it can be other values. , This is only for illustrative purposes, and it is not a limitation.
  • Judging that the noise data in the 100 data is determined as the probability value corresponding to the counter-example data with 30% confidence, that is, among the probability values corresponding to the 100 data, the probability value corresponding to the first 30% is taken as the recognition threshold. For example, if the 100 probability values are in ascending order and the 30th probability value is 0.4, 0.4 is used as the recognition threshold.
  • the probability value corresponding to each data is compared with the recognition threshold (0.4), and all data corresponding to the probability values less than 0.4 are added to the counter-example data set RNi.
  • the recognition threshold and the counter-example data set RNi obtained by the above method have higher reliability than the recognition threshold and the counter-example set obtained through experience, which can make user identification more accurate.
  • the method may further include:
  • a union set is obtained on the i counter-example data sets RNi obtained through the i-group spy data to obtain a counter-example set RN.
  • a counter-example data set RNi can be obtained through the above method. It can be understood that because each set of spy data is different, the final counter-example data set RNi obtained is not the same. All the obtained counter-example data sets RNi are taken as a union set, and the union set is used as the final counter-example data set RN.
  • S150 Identify the user to be identified according to the positive data set and the negative data set.
  • the business data corresponding to the user who does not belong to the user whitelist is analyzed, and the characteristics of the business data (such as time, traffic, and coverage level) are analyzed.
  • Etc. Compare with the positive or negative data to determine whether it belongs to the positive data set or the negative data set, and then determine whether the user corresponding to the business data belongs to the user in the whitelisted industry and complete the comparison. Identification and classification of the user.
  • the embodiments of the present application can implement online or offline user identification. If you need to identify a user offline, you can obtain the business data that the user has generated from the relevant network nodes, and use the above method to analyze the business data by combining the expanded positive and negative data sets. The user performs identification classification. If the user needs to be identified online, the service generated by the user can be obtained in real time from the relevant network node. Using the above method, the business data is analyzed by combining the expanded positive data set and counter data set, and the user is analyzed. Perform classification. It is worth noting that during the online recognition process, it is necessary to continuously obtain the network data generated by the user from the relevant network node, and execute the above method steps multiple times to identify and classify the user to ensure the accuracy of the classification.
  • the recognition server can understand the status of the user (for example, whether it is a static user), it has a significant impact on the user's recognition classification and can further improve the accuracy of recognition. Therefore, the recognition server can recognize the user status. In combination with the identified status of the user, the user is further identified to improve the recognition accuracy.
  • the user involved in this embodiment of the present application may be an M2M terminal or a CPE terminal. Most of the M2M and CPE terminals are stationary and have no mobility.
  • the identification server may construct the static characteristics of the user based on the measurement report in the network And identify the status of the user.
  • the measured main serving cell is fixed, and the measured first neighboring cell (that is, the highest-level neighboring cell) and the network topology are relatively fixed and are in the same location.
  • the ratio of the number of times of different first neighboring cells measured on the network is fixed; in addition, the level of the main serving cell measured by the stationary user is relatively stable, the level of the first neighboring cell measured is relatively stable, and the The sequential first neighboring cell level sequence (that is, the first neighboring cell level sequence from large to small) is stable.
  • FIG. 5 it is a schematic diagram of the topological structure and measurement level characteristics of a stationary user, respectively.
  • the central position indicates the location of the main serving cell
  • the size of the shape indicates the level distribution characteristics of the main serving cell
  • the surrounding position indicates the position of the first neighboring cell
  • the size of the shape indicates the level of each first neighboring cell.
  • Distribution characteristics the shape of the curve represents the distribution characteristics of the neighborhood measured by the user
  • the length of the straight line represents the proportion of the neighborhood measured by the user.
  • the identification server can calculate the first neighboring cell distance between calls and the similarity of the level sequence contour between calls based on the above characteristics. If the first neighboring cell distance between calls is less than the first threshold, or the level sequence contours between calls are similar If the degree is less than the second threshold, the recognition server determines that it is a stationary user. It should be noted that the first threshold value and the second threshold value can be specifically set as required, and this application does not limit this.
  • the M industry represents a specific industry name (such as shared bicycles).
  • the number of whitelisted users is 352, and the total number of network users is 96532.
  • Network data Including 22 features such as package length and duration.
  • the input of whitelisted users before iterative matching is shown in Table 1:
  • Whitelisted users Total network users Network data 352 96532 22 characteristics such as package length and duration
  • FIG. 6 is a schematic diagram of a service distribution.
  • the horizontal axis represents any one of the dimensions of the uplink packet length, uplink duration, downlink packet length, or downlink duration
  • the vertical axis represents specific values.
  • the service categories corresponding to network data include three categories (represented by different lines in the figure) Among them, category 1 services account for 37% of the overall proportion, and the average uplink and downlink packet lengths and uplink and downlink durations are close to 0.
  • the network data corresponding to these services is noise data because it cannot reflect the characteristics of the services.
  • Category 0 and category 2 can better reflect the characteristics of the business.
  • the counter-example data set is identified through the subsequent EM algorithm, and then the user corresponding to the counter-example data set is identified.
  • the identification is shown in Table 3:
  • noise data can be identified. Combining the identified noise data and the proportion of the noise data in the service increases the number of reliable counterexample users.
  • FIG. 7 is a schematic diagram of a comparison of a user recognition effect provided by an embodiment of the present application. It can be seen that in the case of using only the initial whitelist and the traditional EM algorithm to identify counterexamples for modeling, the recognition accuracy rate is 59% and the recall rate is 65%. Among them, the recall rate is the number of users identified. The ratio of the number of correct users to the number of actually correct users. The accuracy rate is the ratio of the number of correct users to the number of identified users.
  • the whitelist and the traditional EM are expanded using iterative matching.
  • the recognition accuracy rate is 66% and the recall rate is 72%; in the whitelist using iterative matching expansion and the proportion of noise data in the business to identify reliable counterexamples for modeling In this case, the recognition accuracy rate is 78% and the recall rate is 83%. It can be seen that the whitelist using iterative matching expansion and the proportion of noise data in the service are used to identify reliable counterexamples provided in the embodiments of the present application. Modeling can effectively improve the accuracy of user identification, and the accuracy and recall of modeling by using only the initial whitelist and traditional EM algorithm to identify counterexamples At least increased by 15%.
  • the implementation of the embodiment of the present application does not need to obtain the authorization of the customer to splice multiple data sources, only a small amount of whitelist information and network data need to be obtained.
  • the user whitelist based on iterative matching and the combination of noise data in the service can be Realizing online or offline user identification can effectively improve the accuracy of user identification.
  • FIG. 8 is a schematic structural diagram of an identification server according to an embodiment of the present application.
  • the identification server 800 includes at least: an obtaining unit 810, an identifying unit 820, a calculating unit 830, and a determining unit 840;
  • An obtaining unit 810 configured to obtain a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;
  • a recognition unit 820 configured to identify noise data in the positive data set
  • a calculating unit 830 configured to calculate a proportion value of the noise data in the positive data set
  • a determining unit 840 configured to determine an identification threshold according to the probability value and the ratio value
  • the calculation unit 830 is further configured to calculate a probability value that each data in the unlabeled data set is positive data; and the recognition unit 820 is further configured to remove the data from the unlabeled data set according to the recognition threshold.
  • a negative example data set is identified; the identifying unit 820 is further configured to identify the user to be identified according to the positive example data set and the negative example data set.
  • the obtaining unit 810 is further configured to: obtain current whitelist information
  • the obtaining unit 810 further includes a mapping subunit 8101, configured to map a user whitelist set based on the current whitelist information and network data of the user to be identified;
  • the obtaining unit further includes a merging sub-unit 8102 for merging the user whitelist set.
  • the current whitelist information includes multiple different types of whitelists
  • the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi
  • the user whitelist includes User identification information and industry information
  • the address whitelist includes address information and industry information.
  • the user identifier includes an international mobile subscriber identity IMSI
  • the address information includes an Internet protocol address IP.
  • the merge subunit 8102 further includes a conflict deduplication subunit 8103, and the conflict deduplication subunit 8103 is configured to perform conflict deduplication on the user whitelist set through a preset rule.
  • the preset rule includes a priority based on the white list or a mapping time.
  • mapping subunit 8101 is further configured to perform mapping of multiple addresses on the user whitelist in combination with the network data to obtain an address whitelist Bj, where the addresses include public address;
  • the identification unit 820 is further configured to identify and mark the public address
  • the server further includes a judging unit 850 for judging whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are consistent, output the user white list; if they are not consistent, use the address white
  • the list Bj is used as the current white list information, and the steps of mapping the user white list set based on the current white list information and the network data of the user to be identified are repeatedly performed.
  • the determining unit 850 is further configured to:
  • the identification unit 820 further includes a cluster analysis subunit 8201, configured to perform a process on the positive data based on an uplink and downlink packet length and an uplink and downlink duration corresponding to the positive data set.
  • Cluster analysis identifying and labeling the category with the smallest uplink and downlink packet length and smallest uplink and downlink duration as noise data.
  • the calculation unit 830 further includes a grouping subunit 8301, which is configured to divide the positive data set into i groups of spy data;
  • the calculation unit 830 further includes an analysis subunit 8303, configured to analyze each data in the M according to the EM model, to obtain a probability value tj where each data in the M is positive data;
  • i and j are positive integers greater than or equal to 1.
  • the determining unit 840 is further configured to:
  • the identification unit 820 further includes a take-and-set sub-unit 8202, configured to obtain a counter-set RN by merging the i counter-example data sets RNi obtained through the i-group spy data. .
  • each unit may also correspond to the corresponding description of the method embodiment shown in FIG. 1, which is not repeated here.
  • FIG. 9 is a schematic structural diagram of another identification server provided by an embodiment of the present application.
  • the identification server 900 includes at least a processor 910, a memory 920, and a transceiver 930.
  • the processor 910, the memory 920, and the transceiver 930 are connected to each other through a bus 940.
  • the memory 920 includes, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), or Erasable Programmable Read-Only Memory (EPROM) or flash memory. Flash memory).
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • EPROM Erasable Programmable Read-Only Memory
  • flash memory Flash memory
  • the transceiver 930 may include a receiver and a transmitter, for example, a radio frequency module.
  • the processor 910 described below receives or sends a message. Specifically, it can be understood that the processor 910 receives or sends a message through the transceiver 930. .
  • the processor 910 may be one or more central processing units (CPUs).
  • CPUs central processing units
  • the CPU may be a single-core CPU or a multi-core CPU.
  • the processor 910 in the identification server 900 is configured to read the program code stored in the memory 920 and perform the following operations:
  • the processor 910 receives the user whitelist set and the network data of the user to be identified through the transceiver 930.
  • the network data of the user to be identified includes the positive data set and the unlabeled data set corresponding to the user whitelist set.
  • the processor 910 identifies noise data in the positive data set, and calculates a ratio value of the noise data in the positive data set.
  • the processor 910 calculates a probability value that each data in the unlabeled data set is positive data.
  • the processor 910 determines a recognition threshold according to the probability value and the proportion value, and identifies a counter-example data set from the unlabeled data set according to the recognition threshold.
  • the processor 910 identifies the user to be identified according to the positive data set and the negative data set.
  • the base station can realize the support of different paging cycles in a cell by grouping the paging carriers and performing specific paging configuration for different groups of carriers, while meeting the requirements of short delay and deep coverage.
  • UE's paging requirements can realize the support of different paging cycles in a cell by grouping the paging carriers and performing specific paging configuration for different groups of carriers, while meeting the requirements of short delay and deep coverage.
  • the identification server obtains a small amount of user whitelist information and network data, expands the user whitelist by iterative matching, and then identifies noise data through cluster analysis and calculates it.
  • the proportion of the value is constructed.
  • the EM model is constructed to calculate the unlabeled data to obtain the probability value of the positive data. Combined with the proportion of the noise data, a reliable counter-example data set can be identified. Finally, the user to be identified is identified. Can effectively pass the accuracy of recognition.
  • An embodiment of the present application further provides a computer-readable storage medium.
  • the computer-readable storage medium stores instructions.
  • the computer-readable storage medium runs on a computer or a processor, the computer or the processor executes any one of the foregoing data transmission methods One or more steps.
  • each component module of the above device is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in the computer-readable storage medium.
  • the computer-readable storage medium may be an internal storage unit of the identification server according to any one of the foregoing embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium may also be an external storage device of the identification server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, and a flash memory card (Flash Card) and so on.
  • the computer-readable storage medium may further include both the internal storage unit of the identification server and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the identification server.
  • the computer-readable storage medium described above may also be used to temporarily store data that has been or will be output.
  • the program can be stored in a computer-readable storage medium.
  • the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disc.
  • the modules in the apparatus of the embodiment of the present application may be combined, divided, and deleted according to actual needs.

Abstract

Provided in an embodiment of the present invention are a method for identifying a user and an identification server. The method mainly comprises: an identification server acquiring a set of user whitelists and network data of a user to be identified; identifying noise data in a positive dataset corresponding to the set of user whitelists, and performing calculation to obtain a ratio of the noise data in the positive dataset; establishing an EM model to calculate a probability that each data item in an unlabeled dataset is a positive data item; determining an identification threshold according to the probability and the ratio, and obtaining a negative dataset; and performing identification of the user. The embodiment of the present invention allows more reliable positive samples and negative samples to be obtained, thereby enhancing accuracy of subsequent modeling and user identification.

Description

一种用户识别方法和相关设备User identification method and related equipment 技术领域Technical field
本申请涉及移动互联网业务识别技术领域,尤其涉及一种业务识别方法和相关设备。The present application relates to the technical field of mobile Internet service identification, and in particular, to a service identification method and related equipment.
背景技术Background technique
业务识别是移动互联网行业一项非常重要的课题,它是用户网络行为研究、运营商智能管道等课题的基础。Service identification is a very important topic in the mobile Internet industry, and it is the basis for topics such as user network behavior research and operator intelligent pipelines.
在无线网络中,由于无法确保对于用户信息100%的及时有效跟踪,并且现有通信协议中,无线侧没有记录和跟踪用户信息的功能,用户和业务的分类信息(如机器类通信(Machine to Machine,M2M)、客户端设备(Customer Premise Equipment,CPE)等)在无线侧无法获取。但随着业务类型的丰富,未来网络上将会承载越来越多种类的业务,这就要求网络同时支撑具有不同业务特征、资源需求的多种业务,为了更合理地进行网络规划、更高效地进行网络优化,对网络中的用户和业务进行分类将是一个重要的突破口。有些场景下还需要实时的业务类型的识别以更好地进行业务的调度和资源的分配。In wireless networks, because it is impossible to ensure 100% timely and effective tracking of user information, and in existing communication protocols, the wireless side does not have the function of recording and tracking user information, and the classification information of users and services (such as machine communication (Machine to Machine (M2M), client equipment (Customer Equipment, CPE, etc.) cannot be obtained on the wireless side. However, with the diversification of service types, more and more types of services will be carried on the network in the future. This requires the network to support multiple services with different service characteristics and resource requirements at the same time. In order to make network planning more reasonable and more efficient, It is an important breakthrough to carry out network optimization and classify users and services in the network. In some scenarios, real-time service type identification is required to better perform service scheduling and resource allocation.
无线侧业务特征包括业务的包长大小、终端能力、业务时长、接入频度等信息,同一类型的业务(如销售点情报管理系统(Point Of Salesterminal,POS)、测量表等)在这些方面表现出很高的相似性。Wireless side service characteristics include information such as the packet length, terminal capabilities, service duration, and access frequency of the service. Services of the same type (such as Point Of Sale Terminal (POS), measurement tables, etc.) are in these aspects. Shows high similarity.
目前是通过获取存量局点全量用户分类信息来进行用户分类建模,需要获取用户开户信息、拼接核心网到无线侧的所有业务记录信息、以及无线侧业务记录信息等,由于需要获得客户授权并获取多个节点的数据,在实际应用中,实现难度巨大,此外,需要拼接多种数据源,存储和计算开销巨大,现有技术还没有相关的解决方案以实现网络用户和业务的准确分类。At present, user classification modeling is performed by obtaining the full amount of user classification information of the existing offices. It is necessary to obtain user account opening information, stitching all business record information from the core network to the wireless side, and wireless side service record information. Obtaining data from multiple nodes is difficult to implement in practical applications. In addition, multiple data sources need to be spliced, and storage and calculation costs are huge. There are no related solutions in the prior art to achieve accurate classification of network users and services.
发明内容Summary of the invention
本申请提供了一种用户识别方法和相关设备,能够获取更多可靠的正例数据和反例数据,提高用户识别的准确性。This application provides a user identification method and related equipment, which can obtain more reliable positive and negative data, and improve the accuracy of user identification.
第一方面,提供了一种用户识别方法,所述方法包括:识别服务器获取用户白名单集合和待识别用户的网络数据,所述待识别用户的网络数据包括所述用户白名单集合对应的正例数据集合和未标记数据集合;识别出所述正例数据集合中的噪声数据,并计算得到所述噪声数据在所述正例数据集合中的比例值;计算未标记数据集合中每个数据为正例数据的概率值;根据所述概率值和所述比例值确定识别阈值,并根据所述识别阈值从所述未标记数据集合中识别出反例数据集合;根据所述正例数据集合和所述反例数据集合,对待识别用户进行识别。In a first aspect, a user identification method is provided. The method includes: an identification server obtains a user whitelist set and network data of a user to be identified, and the network data of the user to be identified includes a positive corresponding to the user whitelist set. Example data set and unlabeled data set; identify noise data in the positive data set, and calculate the proportion of the noise data in the positive data set; calculate each data in the unlabeled data set Is a probability value of positive data; a recognition threshold value is determined according to the probability value and the proportional value, and a negative data set is identified from the unlabeled data set according to the recognition threshold; according to the positive data set and The counter-example data set is used to identify a user to be identified.
通过执行上述方法,识别服务器识别出用户白名单集合对应的正例数据集合中的噪声数据并计算得到该噪声数据在正例数据集合中的比例值,再通过EM模型计算得到未标记数据集合中每个数据为正例数据的概率值,根据该概率值和比例值确定识别阈值确定反例数 据集合,可以获得更多可靠的正例数据和反例数据,进而提高用户识别的准确性。By executing the above method, the recognition server recognizes the noise data in the positive data set corresponding to the user whitelist set and calculates the proportion of the noise data in the positive data set, and then calculates the unlabeled data set through the EM model calculation. Each data is a probability value of the positive example data. According to the probability value and the proportional value, the identification threshold is determined to determine the negative example data set. More reliable positive example data and negative example data can be obtained, thereby improving the accuracy of user identification.
在一种可能的实现方式中,识别服务器获取用户白名单集合的一种实施方式可以是:识别服务器获取当前白名单信息,并基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合;对所述用户白名单集合进行合并。In a possible implementation manner, an implementation manner in which the identification server obtains the user whitelist set may be: the identification server obtains current whitelist information, and maps the current whitelist information and network data of the user to be identified based on the current whitelist information A user whitelist set is generated; the user whitelist set is merged.
通过执行上述方法,可以基于初始得到的白名单信息,通过结合网络数据进行映射,可以扩大该白名单信息,增加可靠的正例数据样本空间。By executing the above method, based on the initially obtained white list information, and by combining with network data for mapping, the white list information can be expanded and the space of reliable positive data samples can be increased.
在又一种可能的实现方式中,所述当前白名单信息包括多种不同类型的白名单,所述白名单包括当前用户白名单Ai和/或当前地址白名单Bi,所述用户白名单包括用户标识信息和行业信息,所述地址白名单包括地址信息和行业信息。In another possible implementation manner, the current whitelist information includes multiple different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user whitelist includes User identification information and industry information, the address whitelist includes address information and industry information.
通过执行上述方法,识别服务器可以基于当前用户白名单Ai或者当前地址白名单Bi进行迭代映射来扩大白名单信息,增加可靠的正例数据样本空间。By performing the above method, the recognition server can perform iterative mapping based on the current user whitelist Ai or the current address whitelist Bi to expand the whitelist information and increase the space of reliable positive data samples.
在又一种可能的实现方式中,所述用户标识包括国际移动用户识别码IMSI,所述地址信息包括互联网协议地址IP。In yet another possible implementation manner, the user identifier includes an international mobile subscriber identity IMSI, and the address information includes an Internet protocol address IP.
在又一种可能的实现方式中,识别服务器对所述用户白名单集合进行合并包括:识别服务器通过预设规则对所述用户白名单集合进行冲突去重,所述预设规则包括基于所述白名单的优先级,或者基于映射时间。In yet another possible implementation manner, the merging of the user whitelist set by the identification server includes: the identification server performs conflict deduplication on the user whitelist set through a preset rule, and the preset rule includes based on the White list priority, or based on mapping time.
通过执行上述方法,识别服务器通过白名单的优先级或者映射时间,对用户白名单集合进行冲突去重,可以保证得到的用户白名单集合中没有冲突重复的用户,提高正例数据样本空间的准确性。By executing the above method, the identification server performs conflict deduplication on the user whitelist set based on the priority or mapping time of the whitelist, which can ensure that there are no conflicting and duplicate users in the obtained user whitelist set, and improve the accuracy of the space of positive data samples. Sex.
在又一种可能的实现方式中,识别服务器对所述用户白名单集合进行合并之后,该方法还包括:In another possible implementation manner, after the identification server merges the user whitelist set, the method further includes:
识别服务器结合所述网络数据对所述用户白名单进行多种地址的映射得到地址白名单Bj,其中,所述地址包括公共地址,所述映射过程中对所述公共地址进行识别并标记;The identification server combines the network data with multiple addresses of the user whitelist to obtain an address whitelist Bj, where the address includes a public address, and the public address is identified and marked during the mapping process;
识别服务器判断所述地址白名单Bj与所述获取到的当前地址白名单Bi是否一致,若一致,输出所述用户白名单;若不一致,以所述地址白名单Bj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。The identification server judges whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are the same, outputs the user white list; if they are not the same, the address white list Bj is used as the current white list information, and repeats Performing the step of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified.
通过执行上述方法,在初始白名单信息是地址白名单的情况下,识别服务器通过对得到的用户白名单进行地址映射得到地址白名单,可以对公共地址进行识别并标记,可以在后续迭代映射过程中不再参与映射,简化了映射过程,提高了映射效率。By executing the above method, when the initial white list information is an address white list, the identification server obtains the address white list by performing address mapping on the obtained user white list, which can identify and mark public addresses, and can iterate the mapping process in subsequent iterations. China no longer participates in mapping, which simplifies the mapping process and improves mapping efficiency.
在又一种可能的实现方式中,识别服务器对所述用户白名单集合进行合并之后,该方法还包括:In another possible implementation manner, after the identification server merges the user whitelist set, the method further includes:
识别服务器判断所述用户白名单集合对应的用户白名单Aj与所述获取到的当前用户白名单Ai是否一致,若一致,输出所述用户白名单Aj,若不一致,以所述用户白名单Aj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。The identification server judges whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai. If they are the same, the user whitelist Aj is output. If they are not the same, the user whitelist Aj is used. As the current whitelist information, the step of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified is repeatedly performed.
通过执行上述方法,在初始白名单信息是用户白名单的情况下,识别服务器通过判断经过映射后得到的用户白名单与映射前的用户白名单是否一致从而决定是否继续迭代映 射,可以保证有效扩大用户白名单的服务,得到足够全面的正例数据样本空间。By executing the above method, when the initial white list information is a user white list, the identification server determines whether to continue iterative mapping by judging whether the user white list obtained after the mapping is consistent with the user white list before the mapping, which can effectively expand The user's whitelisting service has enough comprehensive sample data space for positive examples.
在又一种可能的实现方式中,该方法还包括:In another possible implementation manner, the method further includes:
识别服务器基于所述正例数据集合对应的上下行包长和上下行时长对所述正例数据进行聚类分析,识别并标记所述上下行包长和上下行时长最小的分类作为噪声数据,计算得到所述噪声数据在所述正例数据集合中所占的比例值。The recognition server performs cluster analysis on the positive data based on the uplink and downlink packet length and uplink and downlink duration corresponding to the positive data set, and identifies and marks the classification with the smallest uplink and downlink packet length and uplink and downlink duration as noise data, The proportion value of the noise data in the positive data set is calculated by calculation.
通过执行上述方法,识别服务器通过上下行包长和上下行时长四个维度对所述正例数据进行聚类分析,能够准确区分出噪声数据,并计算得到该噪声数据在正例数据集合中所占的比例值。By executing the above method, the recognition server performs cluster analysis on the positive data through the four dimensions of the uplink and downlink packet length and the uplink and downlink duration, which can accurately distinguish the noise data and calculate the noise data in the positive data set. Percentage value.
在又一种可能的实现方式中,识别服务器计算未标记数据集合中每个数据为正例数据的概率值包括:In another possible implementation manner, the recognition server calculates the probability value that each data in the unlabeled data set is positive data including:
识别服务器将所述正例数据集合分为i组间谍数据;The identification server divides the positive data set into i groups of spy data;
识别服务器根据M和Pi构建迭代EM模型,所述M=U+Si,所述Pi=P-Si,其中,所述Si表示每一组所述间谍数据,所述P表示所述正例数据集合,所述U表示所述未标记数据集合;The recognition server constructs an iterative EM model based on M and Pi, where M = U + Si and Pi = P-Si, where Si represents each set of the spy data, and P represents the positive data Set, where U represents the unlabeled data set;
识别服务器根据所述EM模型对所述M中的每个数据进行分析,得到所述M中每个数据为正例数据的概率值tj;The recognition server analyzes each data in the M according to the EM model, and obtains a probability value tj where each data in the M is positive data;
其中,所述i和所述j为大于等于1的正整数。Wherein, i and j are positive integers greater than or equal to 1.
通过执行上述方法,识别服务器可以通过构建EM模型来分析M中每个数据,可以准确得到M中每个数据为正例数据的概率值。By performing the above method, the recognition server can analyze each data in M by constructing an EM model, and can accurately obtain the probability value that each data in M is positive data.
在又一种可能的实现方式中,识别服务器根据所述概率值和所述比例值确定识别阈值,并根据所述识别阈值从所述未标记数据集合中识别出反例数据集合包括:In another possible implementation manner, the identification server determines an identification threshold according to the probability value and the proportion value, and identifies a counter-example data set from the unlabeled data set according to the identification threshold, including:
识别服务器结合所述M中每个数据为正例数据的概率值tj,得到将所述M中的噪声数据以所述噪声数据在所述正例数据中的比例值作为置信度判定为反例数据时所对应的概率值t;The recognition server combines the probability value tj of each data in M as positive data, and obtains the noise data in M as the negative data by using the ratio of the noise data in the positive data as the confidence value. The probability t corresponding to the time;
识别服务器判断所述tj与所述t的大小关系,将所有小于t的tj所对应的数据加入反例数据集合RNi中。The recognition server judges the magnitude relationship between tj and t, and adds all data corresponding to tj smaller than t to the counter-example data set RNi.
通过执行上述方法,识别服务器可以通过将噪声数据在所述正例数据中的比例值tj作为置信度,得到将所述M中的噪声数据判定为反例数据时对应的概率值t,再通过判断所述tj与所述t的大小关系,将所有小于t的tj所对应的数据加入反例数据集合RNi中,可以提高反例数据集合的准确性,进而提高用户识别的准确性。By executing the above method, the recognition server can obtain the probability value t corresponding to the noise data in M as the negative data by using the ratio tj of the noise data in the positive data as the confidence, and then determine The size relationship between tj and t, and adding all the data corresponding to tj smaller than t to the counter-example data set RNi can improve the accuracy of the counter-example data set, thereby improving the accuracy of user identification.
在又一种可能的实现方式中,该方法还包括:In another possible implementation manner, the method further includes:
识别服务器对通过所述i组间谍数据得到的i个反例数据集合RNi求并集,得到反例集合RN。The identification server obtains a counter set RN by combining the i counter-example data sets RNi obtained through the i-group spy data.
通过执行上述方法,可以进一步提高反例集合RN的准确性,从而保证能够准确的进行用户识别。By executing the above method, the accuracy of the counter-example set RN can be further improved, thereby ensuring that the user identification can be performed accurately.
第二方面,提供了一种识别服务器,所述识别服务器包括:In a second aspect, an identification server is provided. The identification server includes:
获取单元,用于获取用户白名单集合和待识别用户的网络数据,所述待识别用户的网 络数据包括所述用户白名单集合对应的正例数据集合和未标记数据集合;An obtaining unit, configured to obtain a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;
识别单元,用于识别出所述正例数据集合中的噪声数据;A recognition unit, configured to identify noise data in the positive data set;
计算单元,用于计算得到所述噪声数据在所述正例数据集合中的比例值;A calculation unit, configured to calculate a proportion value of the noise data in the positive data set;
确定单元,用于根据所述概率值和所述比例值确定识别阈值;A determining unit, configured to determine an identification threshold according to the probability value and the ratio value;
其中,所述计算单元,还用于计算未标记数据集合中每个数据为正例数据的概率值;所述识别单元,还用于根据所述识别阈值从所述未标记数据集合中识别出反例数据集合;所述识别单元,还用于根据所述正例数据集合和所述反例数据集合,对待识别用户进行识别。The calculation unit is further configured to calculate a probability value that each data in the unlabeled data set is positive data; and the recognition unit is further configured to identify the unlabeled data set according to the recognition threshold. Counter-example data set; the identification unit is further configured to identify a user to be identified based on the positive-example data set and the counter-example data set.
在又一种可能的实现方式中,所述获取单元还用于:获取当前白名单信息;In another possible implementation manner, the obtaining unit is further configured to: obtain current whitelist information;
所述获取单元还包括映射子单元,用于基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合;The obtaining unit further includes a mapping subunit, configured to map a user whitelist set based on the current whitelist information and network data of the user to be identified;
所述获取单元还包括合并子单元,用于对所述用户白名单集合进行合并。The obtaining unit further includes a merging subunit for merging the user whitelist set.
在又一种可能的实现方式中,所述当前白名单信息包括多种不同类型的白名单,所述白名单包括当前用户白名单Ai和/或当前地址白名单Bi,所述用户白名单包括用户标识信息和行业信息,所述地址白名单包括地址信息和行业信息。In another possible implementation manner, the current whitelist information includes multiple different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user whitelist includes User identification information and industry information, the address whitelist includes address information and industry information.
在又一种可能的实现方式中,所述用户标识包括国际移动用户识别码IMSI,所述地址信息包括互联网协议地址IP。In yet another possible implementation manner, the user identifier includes an international mobile subscriber identity IMSI, and the address information includes an Internet protocol address IP.
在又一种可能的实现方式中,所述合并子单元还包括冲突去重子单元,所述冲突去重子单元,用于通过预设规则对所述用户白名单集合进行冲突去重,所述预设规则包括基于所述白名单的优先级,或者基于映射时间。In another possible implementation manner, the merging subunit further includes a conflict deduplication subunit, and the conflict deduplication subunit is configured to perform conflict deduplication on the user whitelist set through a preset rule. Suppose that the rule includes a priority based on the white list or a mapping time.
在又一种可能的实现方式中,所述映射子单元,还用于结合所述网络数据对所述用户白名单进行多种地址的映射得到地址白名单Bj,其中,所述地址包括公共地址;In another possible implementation manner, the mapping subunit is further configured to perform mapping of multiple addresses on the user whitelist in combination with the network data to obtain an address whitelist Bj, where the addresses include public addresses ;
所述识别单元,还用于对所述公共地址进行识别并标记;The identification unit is further configured to identify and mark the public address;
所述服务器还包括判断单元,用于判断所述地址白名单Bj与所述获取到的当前地址白名单Bi是否一致,若一致,输出所述用户白名单;若不一致,以所述地址白名单Bj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。The server further includes a judging unit for judging whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are consistent, output the user white list; if they are not consistent, use the address white list As the current white list information, Bj repeatedly executes the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
在又一种可能的实现方式中,所述判断单元还用于:In another possible implementation manner, the determining unit is further configured to:
判断所述用户白名单集合对应的用户白名单Aj与所述获取到的当前用户白名单Ai是否一致,若一致,输出所述用户白名单Aj,若不一致,以所述用户白名单Aj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。Determine whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai, and if they are consistent, output the user whitelist Aj; if they are not consistent, use the user whitelist Aj as the current White list information, repeating the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
在又一种可能的实现方式中,所述识别单元还包括聚类分析子单元,用于基于所述正例数据集合对应的上下行包长和上下行时长对所述正例数据进行聚类分析,识别并标记所述上下行包长和上下行时长最小的分类作为噪声数据。In another possible implementation manner, the identification unit further includes a cluster analysis subunit, configured to cluster the positive data based on the uplink and downlink packet length and the uplink and downlink duration corresponding to the positive data set. Analyze, identify, and mark the classification with the smallest uplink and downlink packet length and smallest uplink and downlink duration as noise data.
在又一种可能的实现方式中,所述计算单元还包括分组子单元,用于将所述正例数据集合分为i组间谍数据;In another possible implementation manner, the calculation unit further includes a grouping subunit, configured to divide the positive data set into i groups of spy data;
所述计算单元还包括构建子单元,用于根据M和Pi构建迭代EM模型,所述M=U+Si, 所述Pi=P-Si,其中,所述Si表示每一组所述间谍数据,所述P表示所述正例数据集合,所述U表示所述未标记数据集合;The calculation unit further includes a construction sub-unit for constructing an iterative EM model according to M and Pi, where M = U + Si, and Pi = P-Si, where Si represents each set of the spy data , P represents the positive data set, and U represents the unlabeled data set;
所述计算单元还包括分析子单元,用于根据所述EM模型对所述M中的每个数据进行分析,得到所述M中每个数据为正例数据的概率值tj;The calculation unit further includes an analysis subunit, configured to analyze each data in the M according to the EM model, to obtain a probability value tj where each data in the M is positive data;
其中,所述i和所述j为大于等于1的正整数。Wherein, i and j are positive integers greater than or equal to 1.
在又一种可能的实现方式中,所述确定单元还用于:In another possible implementation manner, the determining unit is further configured to:
结合所述M中每个数据为正例数据的概率值tj,得到将所述M中的噪声数据以所述噪声数据在所述正例数据中的比例值作为置信度判定为反例数据时所对应的概率值t;Combining the probability value tj of each data in M with positive data, it is obtained when the noise data in M is determined as the negative data with the proportional value of the noise data in the positive data as the confidence data. The corresponding probability value t;
判断所述tj与所述t的大小关系,将所有小于t的tj所对应的数据加入反例数据集合RNi中。Determine the magnitude relationship between tj and t, and add all data corresponding to tj smaller than t to the counter-example data set RNi.
在又一种可能的实现方式中,所述识别单元还包括取并集子单元,用于对通过所述i组间谍数据得到的i个反例数据集合RNi求并集,得到反例集合RN。In yet another possible implementation manner, the identification unit further includes a fetch set sub-unit, configured to obtain a set of i counter-example data sets RNi obtained by the i-group spy data to obtain a counter-example set RN.
第三方面,提供了一种识别服务器,所述识别服务器包括:处理器、存储器和收发器,其中:According to a third aspect, an identification server is provided. The identification server includes a processor, a memory, and a transceiver, where:
所述处理器、所述存储器和所述收发器相互连接,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行以下步骤:The processor, the memory, and the transceiver are connected to each other. The memory is used to store a computer program. The computer program includes program instructions. The processor is configured to call the program instructions, and execute the following steps. :
获取用户白名单集合和待识别用户的网络数据,所述待识别用户的网络数据包括所述用户白名单集合对应的正例数据集合和未标记数据集合;Acquiring a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;
识别出所述正例数据集合中的噪声数据,并计算得到所述噪声数据在所述正例数据集合中的比例值;Identify the noise data in the positive data set, and calculate the proportion value of the noise data in the positive data set;
计算未标记数据集合中每个数据为正例数据的概率值;Calculate the probability value of each data in the unlabeled data set as positive data;
根据所述概率值和所述比例值确定识别阈值,并根据所述识别阈值从所述未标记数据集合中识别出反例数据集合;Determining a recognition threshold according to the probability value and the proportion value, and identifying a counter-example data set from the unlabeled data set according to the recognition threshold;
根据所述正例数据集合和所述反例数据集合,对待识别用户进行识别。According to the positive data set and the negative data set, the user to be identified is identified.
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被识别服务器的处理器执行时,使所述识别服务器的处理器执行上述第一方面或者第一方面的任意一个可选的实现方式所描述的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program includes program instructions, and the program instructions are executed by a processor that identifies a server. When causing the processor of the identification server to execute the method described in the first aspect or any optional implementation manner of the first aspect.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本申请实施例提供的一种用户识别方法的流程示意图;FIG. 1 is a schematic flowchart of a user identification method according to an embodiment of the present application;
图2为本申请实施例提供的一种用户白名单示意图;FIG. 2 is a schematic diagram of a user whitelist according to an embodiment of the present application; FIG.
图3为本申请实施例提供的一种地址白名单示意图;FIG. 3 is a schematic diagram of an address whitelist according to an embodiment of the present application; FIG.
图4为本申请实施例提供的一种地址白名单迭代匹配流程示意图;FIG. 4 is a schematic diagram of an iterative matching process of an address whitelist according to an embodiment of the present application; FIG.
图5为本申请实施例提供的一种静止用户的拓扑结构和测量电平分布特征示意图;FIG. 5 is a schematic diagram of a topology structure and measurement level distribution characteristics of a stationary user according to an embodiment of the present application; FIG.
图6为本申请实施例提供的一种业务分布示意图;6 is a schematic diagram of a service distribution provided by an embodiment of the present application;
图7为本申请实施例提供的一种用户识别效果对比示意图;FIG. 7 is a schematic diagram of a user recognition effect comparison provided by an embodiment of the present application; FIG.
图8为本申请实施例提供的一种识别服务器的结构示意图;8 is a schematic structural diagram of an identification server according to an embodiment of the present application;
图9为本申请实施例提供的另一种识别服务器的结构示意图。FIG. 9 is a schematic structural diagram of another identification server according to an embodiment of the present application.
具体实施方式detailed description
下面将结合附图对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
本申请实施例涉及的识别服务器可以是用于与终端设备通信的服务器。该识别服务器可以是任意一种具有无线收发功能的设备或可设置于该设备的芯片,该设备包括但不限于:演进型节点B(evolved Node B,eNB)、无线网络控制器(radio network controller,RNC)、节点B(Node B,NB)、基站控制器(base station controller,BSC)、基站收发台(base transceiver station,BTS)、家庭基站(例如,home evolved NodeB,或home Node B,HNB)、基带单元(base band Unit,BBU),无线保真(wireless fidelity,WIFI)系统中的接入点(access point,AP)、无线中继节点、无线回传节点、传输点(transmission point,TP)或者发送接收点(transmission and reception point,TRP)等,还可以为5G,如,NR,系统中的gNB,或,传输点(TRP或TP),5G系统中的基站的一个或一组(包括多个天线面板)天线面板,或者,还可以为构成gNB或传输点的网络节点,如基带单元(BBU),或,分布式单元(distributed unit,DU)等。The identification server involved in the embodiment of the present application may be a server for communicating with a terminal device. The identification server may be any kind of device with wireless transceiver function or a chip that can be set on the device. The device includes but is not limited to: evolved Node B (eNB), radio network controller (radio network controller) , RNC), Node B (Node B, NB), base station controller (BSC), base transceiver station (BTS), home base station (e.g., home NodeB, or home NodeB, HNB ), Baseband unit (BBU), access point (AP), wireless relay node, wireless backhaul node, and transmission point (wireless fidelity, WIFI) system TP) or transmission and reception point (TRP), etc., may also be 5G, such as NR, gNB in the system, or transmission point (TRP or TP), one or a group of base stations in the 5G system The antenna panel (including multiple antenna panels) may also be a network node constituting a gNB or a transmission point, such as a baseband unit (BBU), or a distributed unit (DU).
本申请实施例中,识别服务器可以获取到需要的白名单信息以及网络数据(例如业务数据或业务打点)。该白名单信息和网络数据可以是识别服务器存储的,也可以是识别服务器通过互联网从其它设备,比如网络节点或维护节点等获取到的。识别服务器所应用的网络数据可分为样本数据和测试数据。其中,样本数据用于映射得到用户白名单集合,以增加可靠正例的样本空间,测试数据用于识别服务器对承载所述测试数据的用户进行识别。可以理解,样本数据可以是测试数据的一部分。In the embodiment of the present application, the identification server can obtain required whitelist information and network data (such as service data or service RBI). The whitelist information and network data may be stored by the identification server, or may be obtained by the identification server from other devices, such as a network node or a maintenance node, via the Internet. The network data applied by the identification server can be divided into sample data and test data. The sample data is used for mapping to obtain a user whitelist set to increase the sample space of reliable positive examples, and the test data is used for the identification server to identify the user carrying the test data. Understandably, the sample data may be part of the test data.
本申请实施例可以应用于在网络中混合业务场景下,离线或在线识别用户类型,可以实现基于用户类型的资源和体验优化。The embodiments of the present application can be applied to identify a user type offline or online in a mixed service scenario in a network, and can realize resource and experience optimization based on the user type.
下面详细介绍本申请实施例提供的一种用户识别方法及相关设备。需要说明的是,本申请实施例的展示顺序仅代表实施例的先后顺序,并不代表实施例所提供的技术方案的优劣。The following describes in detail a user identification method and related equipment provided by the embodiments of the present application. It should be noted that the display order of the embodiments of the present application only represents the order of the embodiments, and does not represent the merits of the technical solutions provided by the embodiments.
请参见图1,图1是本申请实施例提供的一种用户识别方法的流程示意图,该方法包括但不限于以下步骤:Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a user identification method according to an embodiment of the present application. The method includes, but is not limited to, the following steps:
S110:获取用户白名单集合和待识别用户的网络数据,所述待识别用户的网络数据包括所述用户白名单集合对应的正例数据集合和未标记数据集合。S110: Obtain a user whitelist set and network data of a user to be identified. The network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set.
具体地,识别服务器可以通过向运营商(移动运营商)发送请求消息,以请求获得用 户白名单集合,或者识别服务器也可以基于用户终端的开户信息来获得用户白名单集合。Specifically, the identification server may send a request message to an operator (mobile operator) to request to obtain the user whitelist set, or the identification server may also obtain the user whitelist set based on the account opening information of the user terminal.
进一步地,识别服务器可以从网络节点或网络维护节点,例如接入点(Access Point,AP)处获取待识别用户的网络数据,这些网络数据表征着用户业务的特征,可以通过分析这些网络数据得出其对应的业务特征,进而可以判断出具有这些业务特征的用户属于哪一类。Further, the identification server may obtain network data of a user to be identified from a network node or a network maintenance node, such as an access point (Access Point, AP). These network data represent the characteristics of the user's business, and can be obtained by analyzing these network data. According to the corresponding business characteristics, it is possible to determine which type of users have these business characteristics.
可选地,识别服务器获取用户白名单集合的方式可以包括:Optionally, the manner in which the identification server obtains the user whitelist set may include:
获取当前白名单信息,并基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合;Acquiring current whitelist information, and mapping a user whitelist set based on the current whitelist information and network data of the user to be identified;
对所述用户白名单集合进行合并。Merging the user whitelist set.
具体地,用户在进行注册或登记时,移动运营商可以记录用户的白名单信息,识别服务器从移动运营商处获取到该白名单信息,并将其作为当前白名单信息,然后结合从网络维护节点等处获取到的网络数据进行映射,得到用户白名单集合。Specifically, when a user performs registration or registration, the mobile operator can record the user's whitelist information, the identification server obtains the whitelist information from the mobile operator, and uses it as the current whitelist information, and then combines it with network maintenance The network data obtained by the nodes and the like is mapped to obtain a user whitelist set.
可选地,所述当前白名单信息包括多种不同类型的白名单,所述白名单包括当前用户白名单Ai和/或当前地址白名单Bi,所述用户白名单包括用户标识信息和行业信息,所述地址白名单包括地址信息和行业信息。Optionally, the current whitelist information includes a plurality of different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user whitelist includes user identification information and industry information The address whitelist includes address information and industry information.
具体地,识别服务器可以获取到不同类型的白名单,每一种白名单对应一种不同的行业类型,例如共享单车、智能电表、智能路灯等行业,该白名单包括用户白名单Ai和地址白名单Bi,值得说明的是,Ai表示不同行业对应的用户白名单,Bi表示不同行业对应的地址白名单。Specifically, the identification server can obtain different types of whitelists, each of which corresponds to a different industry type, such as shared bicycles, smart meters, smart street lights and other industries. The whitelist includes the user whitelist Ai and the address white For the list Bi, it is worth noting that Ai represents the white list of users corresponding to different industries, and Bi represents the white list of addresses corresponding to different industries.
参见图2,图2是一种用户白名单示意图,可以看出,用户白名单包括两列,左边一列对应的是用户的国际移动用户识别码(International Mobile Subscriber Identity,IMSI),右边一列对应的是行业名称,需要说明的是,这里是用IMSI来表示不同的用户,可以理解,也可以用其它的标识信息来区别不同的用户,本申请并不对此做出限制,右边一列对应的是一种具体的行业名称,例如智能电表,本申请也不对此做出限制。See Figure 2. Figure 2 is a schematic diagram of a user whitelist. It can be seen that the user whitelist includes two columns. The left column corresponds to the user's International Mobile Subscriber Identity (IMSI), and the right column corresponds to the It is the name of the industry. It should be noted that IMSI is used to represent different users. It is understandable that other identification information can also be used to distinguish different users. This application does not limit this. The right column corresponds to one. This specific industry name, such as smart meters, is not limited in this application.
参见图3,图3是一种地址白名单示意图,可以看出,地址白名单也包括两列,左边一列对应的是互联网协议地址(Internet Protocol,IP),右边一列对应的是行业名称,需要说明的是,这里是用IP来表示同一行业的不同用户的对端地址,与所述IP通信的不同用户都可以认为是属于同一行业,可以理解,也可以用其它的地址信息来表示同一行业的不同用户的对端地址,例如短信号、接入点名称(Access Point Name,APN)等,本申请并不对此做出限制,右边一列对应的是一种具体的行业名称,例如智能电表,本申请也不对此做出限制。Refer to Figure 3, which is a schematic diagram of an address whitelist. It can be seen that the address whitelist also includes two columns, the left column corresponds to the Internet Protocol address (IP), and the right column corresponds to the industry name. It is noted that IP addresses are used to represent the peer addresses of different users in the same industry. Different users communicating with the IP can be considered to belong to the same industry. It is understandable that other address information can also be used to represent the same industry. The peer addresses of different users, such as short signals, access point names (APN), etc., are not limited in this application. The right column corresponds to a specific industry name, such as a smart meter. This application does not limit this either.
可选地,识别服务器在对该用户白名单集合进行合并的方式可以包括:Optionally, the manner in which the identification server merges the user whitelist set may include:
识别服务器通过预设规则对所述用户白名单集合进行冲突去重,所述预设规则包括基于所述白名单的优先级,或者基于映射时间。The identification server performs conflict deduplication on the user whitelist set through a preset rule, and the preset rule includes a priority based on the whitelist or a mapping time.
具体地,在识别服务器获取到当前用户白名单信息之后,会对其进行第一次匹配映射,例如识别服务器获取到的当前白名单信息是IMSI1、IMSI2、IMSI3属于A行业(A表示一个具体行业名称),而其中IMSI1的对端地址(即与IMSI1通信的地址)包括服务器B的IP地址和短信台C的地址(该地址信息可以是短信台的短信号,例如10086),那么与服务 器B通信的用户和与短信台C通信的用户也将其认为是属于A行业的用户,对其加上A行业的标签信息,按照同样的思路,遍历当前白名单信息中的其它用户,对属于A行业的用户进行扩充,可以理解,经过上述匹配映射,可以扩大A行业对应的用户白名单集合,增加可靠正例的样本空间。Specifically, after the identification server obtains the current user whitelist information, it will perform the first matching mapping on it. For example, the current whitelist information obtained by the identification server is IMSI1, IMSI2, and IMSI3 belong to industry A (A represents a specific industry Name), and where the peer address of IMSI1 (that is, the address that communicates with IMSI1) includes the IP address of server B and the address of SMS station C (the address information can be a short signal of the SMS station, such as 10086), then contact server B The user who communicates and the user who communicates with the SMS station C also considers it to be a user belonging to the A industry, and adds the tag information of the A industry to it. According to the same idea, it traverses the other users in the current whitelist information, and Expanding users in the industry, it can be understood that after the above matching mapping, the user whitelist set corresponding to the industry A can be expanded, and the sample space of reliable positive examples can be increased.
当识别服务器获取到的当前白名单信息是短信台1的地址和服务器1的IP地址(短信台的地址可以是短信号,例如10086)属于B行业(B表示一个具体行业名称),即识别服务器获取到的是地址白名单,这时需要对地址白名单进行映射转换为用户白名单,即将与短信台1的地址或服务器1的IP地址通信的用户加上B行业的标签信息,例如,与服务器1的IP地址通信的用户包括用户1、用户2和用户3,对这三个用户的IMSI(即IMSI1、IMSI2、IMSI3)加上B行业的标签信息,在将地址白名单映射为用户白名单之后,遍历该用户白名单中的每一个用户,若IMSI1的对端地址还包括服务器2的IP地址,则将与服务器2的IP地址进行通信的用户也将其认为是属于B行业的用户,对其加上B行业的标签信息,对属于B行业的用户进行扩充。When the current whitelist information obtained by the identification server is the address of SMS station 1 and the IP address of server 1 (the address of the SMS station can be a short signal, for example, 10086) belongs to the B industry (B represents a specific industry name), that is, the identification server What is obtained is the address whitelist. At this time, the address whitelist needs to be mapped and converted to the user whitelist. Users who are about to communicate with the address of the SMS station 1 or the IP address of the server 1 are added with industry B label information, for example, with The users who communicate with the IP address of server 1 include user 1, user 2, and user 3. The IMSI (ie, IMSI1, IMSI2, and IMSI3) of these three users is added with the industry B label information, and the address whitelist is mapped to user white. After the list, each user in the user whitelist is traversed. If the peer address of IMSI1 also includes the IP address of server 2, the user who will communicate with the IP address of server 2 is also considered to be a user in industry B. , Add tag information of industry B to it, and expand users belonging to industry B.
可以理解,通过上述方法可以有效扩充不同行业对应的用户白名单集合,但是可能会存在冲突重复的问题,即一个用户可能会具有两个或者多个不同行业的标签信息。例如,IMSI1属于A行业,IMSI1的对端地址包括服务器1的IP地址,而与服务器1通信的用户还包括IMSI2,则IMSI2也应该属于A行业;IMSI3属于B行业,IMSI3的对端地址也包括服务器1的IP地址,那么IMSI2也应该属于B行业,这显然是不正确的,一个用户不能同时属于两个不同的行业。It can be understood that the above method can effectively expand the user whitelist set corresponding to different industries, but there may be a problem of conflict and duplication, that is, a user may have label information of two or more different industries. For example, IMSI1 belongs to industry A, and the peer address of IMSI1 includes the IP address of server 1, and the user communicating with server 1 also includes IMSI2, then IMSI2 should also belong to industry A; IMSI3 belongs to industry B, and the peer address of IMSI3 also includes The IP address of server 1, then IMSI2 should also belong to the B industry. This is obviously incorrect. A user cannot belong to two different industries at the same time.
故为了解决上述冲突重复问题,所以需要识别服务器在对用户白名单集合进行合并的时候通过预设规则进行冲突去重,识别服务器可以根据白名单的优先级进行冲突去重,例如,识别服务器可以预先设置白名单的优先等级,若识别服务器获取到两个初始用户白名单(用户白名单1和用户白名单2),用户白名单1中的信息包括IMSI1和IMSI2属于行业A,用户白名单2中的信息包括IMSI3和IMSI4属于行业B,识别服务器将用户白名单1的等级设置为比用户白名单2的等级高,那么在进行用户白名单集合扩充时就需要优先考虑对用户白名单1进行扩充,例如,IMSI1的对端地址包括服务器1的IP地址,而IMSI3的对端地址也包括服务器1的地址,那么对于服务器1的IP地址所通信的其它用户,由于用户白名单1的优先级等级高于用户白名单2的等级,所以就只能加上A行业的标签信息,而不能加上B行业的标签信息,通过对用户白名单设置优先等级,可以有效解决冲突重复的问题。Therefore, in order to solve the above conflict duplication problem, the identification server needs to perform conflict deduplication through preset rules when merging the user whitelist set. The identification server can perform conflict deduplication according to the priority of the whitelist. For example, the identification server can Set the whitelist priority in advance. If the identification server obtains two initial user whitelists (user whitelist 1 and user whitelist 2), the information in user whitelist 1 includes IMSI1 and IMSI2 belong to industry A, and user whitelist 2 The information includes that IMSI3 and IMSI4 belong to industry B. The identification server sets the level of user whitelist 1 to be higher than the level of user whitelist 2, so it is necessary to give priority to user whitelist 1 when expanding the user whitelist set. Extension, for example, the peer address of IMSI1 includes the IP address of server 1, and the peer address of IMSI3 also includes the address of server 1. For other users communicated by the IP address of server 1, the priority of user whitelist 1 is The level is higher than the level of user whitelist 2, so you can only add the label information of industry A instead of B plus tag information industry, by setting the priority of the user whitelist can effectively resolve the conflict duplication.
或者是,识别服务器根据映射扩充时间的先后顺序进行冲突去重。例如,识别服务器获取到两个初始用户白名单(用户白名单1和用户白名单2),用户白名单1中的信息包括IMSI1和IMSI2属于行业A,用户白名单2中的信息包括IMSI3和IMSI4属于行业B,IMSI1的对端地址包括服务器1的IP地址,而IMSI3的对端地址也包括服务器1的地址,若识别服务器先对用户白名单1进行用户白名单集合扩充,那么就将服务器1的IP地址所通信的其它用户加上A行业的标签信息,再对用户白名单2进行用户白名单集合扩充时,就不需要再对服务器1的IP地址所通信的其它用户加上B行业的标签信息,基于白名单映射扩充时间的先后顺序,可以有效解决冲突重复的问题。Or, the identification server performs conflict deduplication according to the sequence of the mapping expansion time. For example, the identification server obtains two initial user whitelists (user whitelist 1 and user whitelist 2). The information in user whitelist 1 includes IMSI1 and IMSI2 belong to industry A. The information in user whitelist 2 includes IMSI3 and IMSI4. Belonging to industry B, the peer address of IMSI1 includes the IP address of server 1, and the peer address of IMSI3 also includes the address of server 1. If the server is identified to first expand the user whitelist set for user whitelist 1, then server 1 When other users communicated by the IP address of the company add the industry A tag information, and then expand the user white list 2 to the user white list 2, there is no need to add the industry B's other users communicated by the server 1 IP address. The tag information, based on the sequence of the whitelist mapping extension time, can effectively solve the problem of conflict and duplication.
需要说明的是,上述只是对两个用户白名单中的用户的对端地址有重复重叠的情况进行了分析对于多个用户白名单中的用户的对端地址有重复重叠的情况,仍可以用上述对白名单设置优先等级或者基于白名单映射扩充时间的先后顺序或者其它类似的方法进行冲突去重,本申请并不对此做出限制。It should be noted that the above only analyzes the cases where the peer addresses of the users in the two user whitelists overlap repeatedly. For the cases where the peer addresses of the users in the multiple user whitelists overlap repeatedly, it can still be used. The foregoing priority setting for the white list, or the conflicting and deduplication based on the sequence of the white list mapping extension time, or other similar methods, is not limited in this application.
可选地,识别服务器在对用户白名单集合进行合并之后,该方法还包括:Optionally, after the identification server merges the user whitelist set, the method further includes:
结合所述网络数据对所述用户白名单进行多种地址的映射得到地址白名单Bj,其中,所述地址包括公共地址,所述映射过程中对所述公共地址进行识别并标记;Map the user whitelist with multiple addresses in combination with the network data to obtain an address whitelist Bj, where the addresses include public addresses, and identify and mark the public addresses during the mapping process;
判断所述地址白名单Bj与所述获取到的当前地址白名单Bi是否一致,若一致,输出所述用户白名单;若不一致,以所述地址白名单Bj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。Determine whether the address whitelist Bj is consistent with the obtained current address whitelist Bi, if they are the same, output the user whitelist; if they are not the same, use the address whitelist Bj as the current whitelist information and repeat the execution The steps of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified are described.
具体地,在识别服务器获取到的当前白名单信息是地址白名单Bi的情况下,识别服务器会先将该地址白名单映射为用户白名单,然后对该用户白名单进行扩充合并后,得到已扩充的用户白名单集合,再将该用户白名单进行多种地址的映射(即将用户白名单集合中的每个用户进行映射得到其对端地址),得到地址白名单Bj,值得说明的是,在进行地址映射得到的Bj中,可能包含一些公共的地址,这时需要对这些公共地址进行识别并标记,对映射后得到的该Bj和当前Bi进行比较,若Bj的长度(即Bj包含的对端地址的数目)大于Bi的长度,则将Bj作为新的当前地址白名单,重复执行上述的扩充迭代过程,但是在重复执行的过程中,对于Bj中已标记的公共地址则不再需要进行映射;若Bj的长度等于Bi的长度,则输出Bj对应的用户白名单集合。可以理解,通过重复迭代映射对用户白名单集合进行扩充,直至得到的地址白名单Bj的长度和此时当前地址白名单Bi的长度相等才输出用户白名单集合,可以充分扩充用户白名单集合,增加可靠正例的样本空间。Specifically, when the current whitelist information obtained by the identification server is the address whitelist Bi, the identification server first maps the address whitelist to the user whitelist, and then expands and merges the user whitelist to obtain the The extended user whitelist set, and then map the user whitelist to multiple addresses (that is, each user in the user whitelist set is mapped to obtain its peer address) to obtain the address whitelist Bj. It is worth explaining that The Bj obtained from the address mapping may contain some public addresses. At this time, these public addresses need to be identified and marked. The Bj obtained after the mapping is compared with the current Bi. If the length of Bj (that is, the Bj contains The number of peer addresses) is greater than the length of Bi, then Bj is used as the new current address whitelist, and the above iterative iterative process is repeatedly performed, but in the process of repeated execution, the marked public address in Bj is no longer required Mapping is performed; if the length of Bj is equal to the length of Bi, the user whitelist set corresponding to Bj is output. It can be understood that the user whitelist set is expanded through repeated iterative mapping, and the user whitelist set is output until the length of the obtained address whitelist Bj is equal to the length of the current address whitelist Bi at this time, which can fully expand the user whitelist set. Increase the sample space for reliable positive examples.
参见图4,图4本申请实施例提供的一种地址白名单迭代匹配流程示意图。Referring to FIG. 4, FIG. 4 is a schematic diagram of an iterative matching process of an address whitelist according to an embodiment of the present application.
S401:输入当前地址白名单。S401: Enter the current address white list.
其中,输入的当前地址白名单可以是不同类型的地址白名单。The current address whitelist entered may be a different type of address whitelist.
其中,该当前地址白名单的长度表示该地址白名单包含的对端地址的数目。The length of the current address whitelist indicates the number of peer addresses included in the address whitelist.
S402:将该地址白名单映射为用户白名单。S402: Map the address whitelist to the user whitelist.
S403:对该用户白名单进行映射扩充,并对扩充后得到的用户白名单中的用户加上行业标签信息。S403: Map and expand the user whitelist, and add industry tag information to the users in the user whitelist obtained after the expansion.
其中,结合网络数据对该用户白名单进行映射扩充。Among them, the user whitelist is mapped and expanded based on network data.
S404:对扩充后的用户白名单进行合并去重。S404: Merging and deduplication of the expanded user whitelist.
S405:对合并去重后的用户白名单进行多种地址的映射,得到地址白名单。S405: Perform mapping of multiple addresses on the merged and deduplicated user whitelist to obtain the address whitelist.
S406:对得到的地址白名单和输入的当前地址白名单进行比较判断,若一致,执行S407;若不一致,则将得到的地址白名单作为当前地址白名单,执行S401。S406: Compare and judge the obtained address white list with the input current address white list. If they are the same, execute S407; if they are not the same, use the obtained address white list as the current address white list, and execute S401.
S407:输出用户白名单。S407: Output the user white list.
例如,识别服务器获取到的当前地址白名单是服务器1的IP地址和服务器2的IP地址属于A行业,而服务器1的IP地址通信的用户包括IMSI1和IMSI2,服务器2的IP地址通信的用户包括IMSI3和IMSI4,所以该地址白名单映射得到的用户白名单是IMSI1、IMSI2、IMSI3和IMSI4属于A行业,而IMSI1的对端地址还包括服务器3的IP地址,IMSI3 的对端地址还包括短信台1的地址,而短信台1的地址是一个公共地址(即可能所有用户都会与该短信台1进行通信),这时识别服务器会识别出该短信台1的地址并进行标记,而对于与服务器3的IP地址所通信的用户加上A行业的标签信息,得到扩充的用户白名单集合,再对得到扩充后的用户白名单集合进行多种地址的映射得到新的地址白名单,这时的地址白名单应该是服务器1的IP地址、服务器2的IP地址和服务器3的IP地址属于A行业,将该地址白名单与输入的当前地址白名单进行比较,因为得到的地址白名单的长度比输入的当前地址白名单的长度要大(即表示还可以继续进行扩充),这是将得到的地址白名单作为当前白名单输出,迭代执行上述步骤,直到得到的地址白名单的长度和输入的当前地址白名单的长度相等,这时停止执行上述迭代过程,输出得到的地址白名单对应的用户白名单集合。For example, the current address whitelist obtained by the identification server is that the IP address of server 1 and the IP address of server 2 belong to industry A, and the users communicating with the IP address of server 1 include IMSI1 and IMSI2, and the users communicating with the IP address of server 2 include IMSI3 and IMSI4, so the user whitelist obtained from this address whitelist mapping is IMSI1, IMSI2, IMSI3, and IMSI4 belong to industry A, and the peer address of IMSI1 also includes the IP address of server 3, and the peer address of IMSI3 also includes the SMS station 1 address, and the address of SMS station 1 is a public address (that is, all users may communicate with the SMS station 1). At this time, the identification server will identify the address of the SMS station 1 and mark it. The user communicated with the IP address of 3 adds the label information of the A industry to obtain the extended user whitelist set, and then performs multiple address mappings on the extended user whitelist set to obtain a new address whitelist. The address whitelist should be the IP address of server 1, the IP address of server 2 and the IP address of server 3 belong to the A industry. The current address white list is compared, because the length of the obtained address white list is larger than the length of the current address white list entered (that is, it can be further expanded). This is to output the obtained address white list as the current white list. The above steps are performed iteratively until the length of the obtained address white list is equal to the length of the input current address white list. At this time, the iteration process is stopped, and the user white list set corresponding to the obtained address white list is output.
可选地,识别服务器在对用户白名单集合进行合并之后,该方法还包括:Optionally, after the identification server merges the user whitelist set, the method further includes:
判断所述用户白名单集合对应的用户白名单Aj与所述获取到的当前用户白名单Ai是否一致,若一致,输出所述用户白名单Aj,若不一致,以所述用户白名单Aj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。Determine whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai, and if they are consistent, output the user whitelist Aj; if they are not consistent, use the user whitelist Aj as the current White list information, repeating the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
具体地,在识别服务器获取到的当前白名单信息是用户白名单Ai的情况下,识别服务器会对该用户白名单Ai进行扩充合并后,得到已扩充的用户白名单集合Aj,对映射后得到的该Aj和当前Ai进行比较,若Aj的长度(即Aj包含的IMSI的数目)大于Ai的长度,则将Aj作为新的当前用户白名单,重复执行上述的扩充迭代过程;若Aj的长度等于Ai的长度,则输出Aj对应的用户白名单集合。可以理解,通过重复迭代映射对用户白名单集合进行扩充,直至得到的用户白名单Aj的长度和此时当前用户白名单Ai的长度相等才输出用户白名单集合,可以充分扩充用户白名单集合,增加可靠正例的样本空间。Specifically, in a case where the current whitelist information obtained by the identification server is a user whitelist Ai, the identification server will expand and merge the user whitelist Ai to obtain an expanded user whitelist set Aj, and obtain the extended user whitelist Aj after mapping. The Aj is compared with the current Ai. If the length of Aj (that is, the number of IMSIs contained in Aj) is greater than the length of Ai, then Aj is used as the new current user whitelist and the above-mentioned extended iterative process is repeated; if the length of Aj is If the length is equal to Ai, the user whitelist set corresponding to Aj is output. It can be understood that the user whitelist set is expanded through repeated iterative mapping, and the user whitelist set is output until the length of the obtained user whitelist Aj is equal to the length of the current user whitelist Ai at this time, which can fully expand the user whitelist set. Increase the sample space for reliable positive examples.
S120:识别出所述正例数据集合中的噪声数据,并计算得到所述噪声数据在所述正例数据集合中的比例值。S120: Identify the noise data in the positive data set, and calculate the proportion value of the noise data in the positive data set.
具体地,在网络中,存在大量的业务,包括白名单用户集合中的用户和其它用户的所有业务,这些业务信息可以通过数据来体现。对于用户白名单集合中的用户的业务进行了标记,可以称为正例数据集合,而对于所有其它用户的业务,可以称为未标记数据集合。可以理解,未标记数据集合中也会包含一些正例数据。Specifically, in the network, there are a large number of services, including all services of users in the white list user set and other users, and these service information can be reflected by data. The services of the users in the user whitelist set are labeled and can be referred to as positive data sets, while the services of all other users can be referred to as unlabeled data sets. Understandably, the unlabeled data set will also contain some positive data.
需要说明的是,虽然正例数据集合都进行了标记,但是其中一些数据对应的业务量非常小,映射到用户业务上一般体现为心跳信息等业务,这类业务频度非常高并且不同用户类型的此类业务十分相似,不能体现该用户对应的行业特征,会大大影响用户分类的准确度,这些数据可以称为噪声数据。例如,在长期演进(Long Term Evolution,LTE)网络中,大多数呼叫历史记录(Call History Record,CHR)所统计的业务打点(业务记录或业务数据)非常小,贡献的流量也很少,接近50%的下行业务包长都为0,这类数据对于用户识别来说属于噪声数据,实际上的业务特征只有去掉了这部分噪声数据才会接近高斯分布。It should be noted that although the positive data set is labeled, some of the data corresponds to a very small amount of traffic, which is mapped to user services and is generally reflected in services such as heartbeat information. Such services are very frequent and have different user types. This type of business is very similar and cannot reflect the industry characteristics of the user, which will greatly affect the accuracy of user classification. These data can be called noise data. For example, in a Long Term Evolution (LTE) network, most service records (business records or service data) counted by Call History Records (CHR) are very small, and the contribution of traffic is very small, close to 50% of downlink service packets have a length of 0. This type of data is noise data for user identification. In fact, the service characteristics will be close to the Gaussian distribution only after removing this part of the noise data.
可选地,识别服务器识别出所述正例数据集合中的噪声数据,并计算得到所述噪声数据在所述正例数据集合中的比例值可以包括:Optionally, the recognition server identifies the noise data in the positive data set, and calculates a ratio value of the noise data in the positive data set to include:
基于所述正例数据集合对应的上下行包长和上下行时长对所述正例数据进行聚类分析,识别并标记所述上下行包长和上下行时长最小的分类作为噪声数据,计算得到所述噪声数据在所述正例数据集合中所占的比例值。Perform cluster analysis on the positive data based on the uplink and downlink packet length and uplink and downlink duration corresponding to the positive data set, identify and mark the classification with the smallest uplink and downlink packet length and uplink and downlink duration as noise data, and calculate it A proportion value of the noise data in the positive data set.
具体地,识别服务器可以基于正例数据集合所对应的业务的上下行包长和上下行时长四个维度对该业务进行聚类分析,可以确定业务的特征分布,标记出包长和时长最小的分类所对应的正例数据作为噪声数据,计算得到该噪声数据在正例数据集合中所占的比例值。Specifically, the recognition server can perform cluster analysis on the service based on the four dimensions of the uplink and downlink packet length and uplink and downlink duration of the service corresponding to the positive data set, and can determine the characteristic distribution of the service and mark the one with the smallest packet and duration. The positive data corresponding to the classification is used as noise data, and the proportion value of the noise data in the positive data set is calculated.
S130:计算未标记数据集合中每个数据为正例数据的概率值。S130: Calculate a probability value that each data in the unlabeled data set is positive data.
具体地,识别服务器可以从网络节点(例如维护节点)处获取网络数据,该网络数据包括正例数据和待识别数据。例如,识别服务器从维护节点处获取到十万条呼叫记录数据,其中有一万条呼叫记录数据是用户白名单中的用户所对应的已标记的正例数据,而余下的九万条呼叫记录数据是其它用户对应的未标记数据,可以理解,在未标记数据中,可能存在一部分正例数据(只是未标记)和反例数据(即该反例数据所对应的用户不属于用户白名单)。Specifically, the identification server may obtain network data from a network node (such as a maintenance node), where the network data includes positive data and data to be identified. For example, the identification server obtains 100,000 call record data from the maintenance node, of which 10,000 call record data is labeled positive data corresponding to users in the user whitelist, and the remaining 90,000 call records The data is unlabeled data corresponding to other users. It can be understood that in the unlabeled data, there may be some positive data (only unlabeled) and negative data (that is, the user corresponding to the negative data does not belong to the user whitelist).
可选地,所述计算未标记数据集合中每个数据为正例数据的概率值包括:Optionally, the calculating the probability value that each data in the unlabeled data set is positive data includes:
将所述正例数据集合分为i组间谍数据;Divide the positive data set into i groups of spy data;
根据M和P构建迭代EM模型,所述M=U+Si,所述Pi=P-Si,其中,所述Si表示每一组所述间谍数据,所述P表示所述正例数据集合,所述U表示所述未标记数据集合;Build an iterative EM model according to M and P, where M = U + Si and Pi = P-Si, where Si represents each set of the spy data, and P represents the positive data set, The U represents the unlabeled data set;
根据所述EM模型对所述M中的每个数据进行分析,得到所述M中每个数据为正例数据的概率值tj;Analyzing each data in the M according to the EM model, and obtaining a probability value tj of each data in the M being positive data;
其中,所述i和所述j为大于等于1的正整数。Wherein, i and j are positive integers greater than or equal to 1.
具体地,识别服务器将获取到的正例数据集合(可以用P进行表示)随机的平均分为i组间谍数据,i的取值范围可以是大于等于5,小于等于10,对于i具体取何值,本申请并不限定。Specifically, the identification server randomly divides the obtained positive data set (which can be represented by P) into i groups of spy data randomly. The value range of i can be 5 or more and 10 or less. The value is not limited in this application.
进一步地,对于其中每一组间谍数据(可以用Si进行表示),将其加入未标记数据集合(可以用U进行表示)中得到新的未标记数据集合(可以用M进行表示)。将P-Si作为正例数据集合(可以用Pi进行表示),M作为反例数据集合,利用Pi和M构建迭代EM模型。Further, for each set of spy data (which can be represented by Si), it is added to the unlabeled data set (which can be represented by U) to obtain a new unlabeled data set (which can be represented by M). Taking P-Si as the positive example data set (which can be represented by Pi) and M as the negative example data set, Pi and M are used to construct an iterative EM model.
具体地,本申请实施例所涉及的EM算法实现包括初始化和EM迭代两部分,初始化时利用M和Pi集合构建朴素贝叶斯模型N,E阶段用N来预测M集合,M阶段用新预测的结果重新进行建模。Specifically, the implementation of the EM algorithm involved in the embodiments of the present application includes two parts, initialization and EM iteration. During initialization, the naive Bayesian model N is constructed using the M and Pi sets. The E stage uses N to predict the M set and the M stage uses the new prediction The results are re-modeled.
利用得到的EM模型,分析M中每个样本数据,得到每个数据为正例数据的概率值(可以用tj表示)。By using the obtained EM model, each sample data in M is analyzed, and the probability value (which can be represented by tj) of each data as positive data is obtained.
S140:根据所述概率值和所述比例值确定识别阈值,并根据所述识别阈值从所述未标记数据集合中识别出反例数据集合。S140: Determine a recognition threshold according to the probability value and the proportion value, and identify a counter-example data set from the unlabeled data set according to the recognition threshold.
需要说明的是,在通过构建的EM模型分析得到M中每个数据为正例数据的概率值tj后,现有做法是直接根据经验确定识别阈值,例如直接将概率值大于0.5的认为是正例数据,小于0.5的认为是反例数据,以此得到反例数据集合,通过这种做法得到的反例数据集合,在后续的用户识别过程中会有较大的误差,使得识别并不准确。It should be noted that after the EM model analysis is performed to obtain the probability value tj of each data in M as positive data, the existing practice is to directly determine the recognition threshold based on experience, for example, if the probability value is greater than 0.5, it is considered a positive If the data is less than 0.5, it is considered counter-example data to obtain a counter-example data set. The counter-example data set obtained by this method will have a large error in the subsequent user identification process, making the identification inaccurate.
可选地,识别服务器根据所述概率值和所述比例值确定识别阈值,并根据所述识别阈值从所述未标记数据集合中识别出反例数据集合可以包括:Optionally, the identification server determines an identification threshold according to the probability value and the proportion value, and identifies a counter-example data set from the unlabeled data set according to the identification threshold, which may include:
结合所述M中每个数据为正例数据的概率值tj,得到将所述M中的噪声数据以所述噪声数据在所述正例数据中的比例值作为置信度判定为反例数据时所对应的概率值t;Combining the probability value tj of each data in M with positive data, it is obtained when the noise data in M is determined as the negative data with the proportional value of the noise data in the positive data as the confidence data. The corresponding probability value t;
判断所述tj与所述t的大小关系,将所有小于t的tj所对应的数据加入反例数据集合RNi中。Determine the magnitude relationship between tj and t, and add all data corresponding to tj smaller than t to the counter-example data set RNi.
例如,假定M中包括100个数据,将这100个数据对应的概率值按照从小到大的顺序进行重新排序,噪声数据在正例数据中的比例值为30%,当然也可以为其它的数值,这里只是示例性的进行说明,并不对此作出限制。判断将这100个数据中的噪声数据以30%置信度判定为反例数据所对应的概率值,即在这100个数据对应的概率值中,取前30%所对应的概率值作为识别阈值,例如,这100个概率值按照从小到大的顺序,第30个概率值为0.4,则将0.4作为识别阈值。For example, assuming M contains 100 data, the probability values corresponding to the 100 data are reordered in ascending order. The proportion of the noise data in the positive data is 30%, but of course it can be other values. , This is only for illustrative purposes, and it is not a limitation. Judging that the noise data in the 100 data is determined as the probability value corresponding to the counter-example data with 30% confidence, that is, among the probability values corresponding to the 100 data, the probability value corresponding to the first 30% is taken as the recognition threshold. For example, if the 100 probability values are in ascending order and the 30th probability value is 0.4, 0.4 is used as the recognition threshold.
进一步地,再将每个数据对应的概率值与识别阈值(0.4)进行比较,将所有小于0.4的概率值所对应的数据加入反例数据集合RNi中。Further, the probability value corresponding to each data is compared with the recognition threshold (0.4), and all data corresponding to the probability values less than 0.4 are added to the counter-example data set RNi.
可以看出,通过上述方法得到的识别阈值和反例数据集合RNi比通过经验得到的识别阈值和反例集合具有更高的可靠性,能够使得用户识别更准确。It can be seen that the recognition threshold and the counter-example data set RNi obtained by the above method have higher reliability than the recognition threshold and the counter-example set obtained through experience, which can make user identification more accurate.
可选地,识别服务器在得到反例数据集合RNi之后,所述方法还可以包括:Optionally, after the identification server obtains the counter-example data set RNi, the method may further include:
对通过所述i组间谍数据得到的i个反例数据集合RNi求并集,得到反例集合RN。A union set is obtained on the i counter-example data sets RNi obtained through the i-group spy data to obtain a counter-example set RN.
具体地,对于每一组间谍数据都可以通过上述方法得到一个反例数据集合RNi,可以理解,由于每一组间谍数据都不一样,所以最终得到的反例数据集合RNi并不完成相同,识别服务器对得到的所有反例数据集合RNi取并集,将该并集作为最终的反例数据集合RN。Specifically, for each group of spy data, a counter-example data set RNi can be obtained through the above method. It can be understood that because each set of spy data is different, the final counter-example data set RNi obtained is not the same. All the obtained counter-example data sets RNi are taken as a union set, and the union set is used as the final counter-example data set RN.
可以理解,通过将所有的反例数据集合RNi求并集得到反例数据集合RN,可以扩大反例数据集合的范围,提高反例数据集合的可靠性,有利于进行后续的用户识别。It can be understood that by combining all the counter-example data sets RNi to obtain the counter-example data set RN, the scope of the counter-example data set can be expanded, the reliability of the counter-example data set can be improved, and subsequent user identification can be facilitated.
S150:根据所述正例数据集合和所述反例数据集合,对待识别用户进行识别。S150: Identify the user to be identified according to the positive data set and the negative data set.
具体地,结合得到的经过扩充的正例数据集合和反例数据集合,对不属于用户白名单中的用户所对应的业务数据进行分析,通过对该业务数据的特性(例如时间、流量、覆盖水平等)与正例数据或者反例数据进行比对,判断其是否属于正例数据集合或者是反例数据集合,进而判断出该业务数据对应的用户是否属于白名单中的行业所对应的用户,完成对该用户的识别和分类。Specifically, combined with the obtained expanded positive example data set and negative example data set, the business data corresponding to the user who does not belong to the user whitelist is analyzed, and the characteristics of the business data (such as time, traffic, and coverage level) are analyzed. Etc.) Compare with the positive or negative data to determine whether it belongs to the positive data set or the negative data set, and then determine whether the user corresponding to the business data belongs to the user in the whitelisted industry and complete the comparison. Identification and classification of the user.
进一步地,本申请实施例可以实现在线或离线的用户识别。若需要对用户进行离线识别,可以从相关网络节点中获取到该用户已经产生的业务数据,利用上述方法,结合得到的经过扩充的正例数据集合和反例数据集合对该业务数据进行分析,对该用户进行识别分类。若需要对用户进行在线识别,可以从相关网络节点中实时获取该用户产生的业务,利用上述方法,结合得到的经过扩充的正例数据集合和反例数据集合对该业务数据进行分析,对该用户进行识别分类。值得说明的是,在线识别过程中,需要不断从相关网络节点处获取该用户产生的网络数据,并多次执行上述方法步骤,对该用户进行识别分类,保证识别分类的准确性。Further, the embodiments of the present application can implement online or offline user identification. If you need to identify a user offline, you can obtain the business data that the user has generated from the relevant network nodes, and use the above method to analyze the business data by combining the expanded positive and negative data sets. The user performs identification classification. If the user needs to be identified online, the service generated by the user can be obtained in real time from the relevant network node. Using the above method, the business data is analyzed by combining the expanded positive data set and counter data set, and the user is analyzed. Perform classification. It is worth noting that during the online recognition process, it is necessary to continuously obtain the network data generated by the user from the relevant network node, and execute the above method steps multiple times to identify and classify the user to ensure the accuracy of the classification.
需要说明的是,识别服务器若能够了解到用户的状态(例如是否为静止用户),对于用 户的识别分类具有显著影响,能够进一步的提高识别的准确性,故识别服务器可以对用户状态进行识别,并结合识别出的该用户的状态对该用户进行进一步的识别,以提高其识别准确性。It should be noted that if the recognition server can understand the status of the user (for example, whether it is a static user), it has a significant impact on the user's recognition classification and can further improve the accuracy of recognition. Therefore, the recognition server can recognize the user status. In combination with the identified status of the user, the user is further identified to improve the recognition accuracy.
本申请实施例所涉及到的用户可以是M2M的终端或者是CPE终端,大部分的M2M和CPE终端是静止的,不具有移动性,识别服务器可以基于网络中的测量报告构建出用户的静止特征,并对用户进行状态识别。The user involved in this embodiment of the present application may be an M2M terminal or a CPE terminal. Most of the M2M and CPE terminals are stationary and have no mobility. The identification server may construct the static characteristics of the user based on the measurement report in the network And identify the status of the user.
具体地,静止用户由于不移动或者移动范围较小,所以测量到的主服务小区固定,测量到的第一邻区(即电平最高的邻区)以及网络拓扑相对固定,而且在同一个位置上测量到的不同的第一邻区的次数比例固定;此外,静止用户测量到的主服务小区电平相对稳定,测量到的第一邻区的电平相对稳定,而且测量到的拓扑内的顺序第一邻区电平序列(即第一邻区从大到小的电平序列)稳定。参见图5,是一种静止用户的拓扑结构和测量电平分别特征示意图。其中,处于中心位置的表示主服务小区位置,其形状的大小表示主服务小区的电平分布特征,处于四周位置的表示第一邻区位置,其形状的大小表示各个第一邻区的电平分布特征,曲线的形状表示用户测量到的邻区的分布特征,直线的长短表示用户测量到的邻区的占比。Specifically, since the stationary user does not move or has a small moving range, the measured main serving cell is fixed, and the measured first neighboring cell (that is, the highest-level neighboring cell) and the network topology are relatively fixed and are in the same location. The ratio of the number of times of different first neighboring cells measured on the network is fixed; in addition, the level of the main serving cell measured by the stationary user is relatively stable, the level of the first neighboring cell measured is relatively stable, and the The sequential first neighboring cell level sequence (that is, the first neighboring cell level sequence from large to small) is stable. Referring to FIG. 5, it is a schematic diagram of the topological structure and measurement level characteristics of a stationary user, respectively. Among them, the central position indicates the location of the main serving cell, the size of the shape indicates the level distribution characteristics of the main serving cell, the surrounding position indicates the position of the first neighboring cell, and the size of the shape indicates the level of each first neighboring cell. Distribution characteristics, the shape of the curve represents the distribution characteristics of the neighborhood measured by the user, and the length of the straight line represents the proportion of the neighborhood measured by the user.
可以理解,识别服务器可以基于上述特征计算用户的呼叫间第一邻区距离以及呼叫间电平序列轮廓相似度,若呼叫间第一邻区距离小于第一阈值,或者呼叫间电平序列轮廓相似度小于第二阈值,识别服务器将其判定为静止用户。需要说明的是,第一阈值和第二阈值可以根据需要进行具体设置,本申请并不对此做出限制。It can be understood that the identification server can calculate the first neighboring cell distance between calls and the similarity of the level sequence contour between calls based on the above characteristics. If the first neighboring cell distance between calls is less than the first threshold, or the level sequence contours between calls are similar If the degree is less than the second threshold, the recognition server determines that it is a stationary user. It should be noted that the first threshold value and the second threshold value can be specifically set as required, and this application does not limit this.
为了便于理解,下面对于本申请实施例提供的一种用户识别方法进行具体举例说明:To facilitate understanding, a user identification method provided in the embodiment of the present application is specifically illustrated below:
使用7*24小时网络的CHR数据进行M行业识别建模和分析,M行业表示一种具体的行业名称(例如共享单车),输入白名单用户为352个,网络用户总数为96532个,网络数据包括包长、时长等22个特征。白名单用户在迭代匹配前的输入情况如表1所示:Use 7 * 24-hour network CHR data for M industry identification modeling and analysis. The M industry represents a specific industry name (such as shared bicycles). The number of whitelisted users is 352, and the total number of network users is 96532. Network data Including 22 features such as package length and duration. The input of whitelisted users before iterative matching is shown in Table 1:
表1 白名单用户迭代匹配前的输入情况Table 1 Inputs before iterative matching of whitelisted users
白名单用户Whitelisted users 网络用户总数Total network users 网络数据Network data
352352 9653296532 包长、时长等22个特征22 characteristics such as package length and duration
对输入的白名单用户根据其IMSI进行迭代匹配,其经过迭代匹配后的情况如表2所示:Iterative matching is performed on the input whitelist users according to their IMSI. The situation after iterative matching is shown in Table 2:
表2 白名单用户迭代匹配后的输出情况Table 2 Output of iterative matching of whitelisted users
白名单用户Whitelisted users 网络用户总数Total network users 迭代匹配后M行业用户数Number of users in M industry after iterative matching
352352 9653296532 18531853
可以看出,白名单用户经过迭代匹配后,其用户数增加到了1853个,较之前输入的白名单用户数增长了5倍多。It can be seen that after iterative matching of whitelisted users, the number of users has increased to 1853, which is more than 5 times the number of whitelisted users entered previously.
对获取到的网络数据进行聚类分析,参见图6,图6是一种业务分布示意图。其中,横轴表示上行包长、上行时长、下行包长或下行时长任意一个维度,纵轴表示具体数值,可以看出,网络数据对应的业务类别包括三类(图中用不同线条进行表示),其中,类别1的业务占整体比例的37%,并且其上下行包长和上下行时长均值都接近于0,这部分业务所对应的网络数据作为噪声数据,因为其不能反映业务的特征,类别0和类别2可以较好的反映业务的特征。Perform cluster analysis on the obtained network data. See FIG. 6, which is a schematic diagram of a service distribution. Among them, the horizontal axis represents any one of the dimensions of the uplink packet length, uplink duration, downlink packet length, or downlink duration, and the vertical axis represents specific values. It can be seen that the service categories corresponding to network data include three categories (represented by different lines in the figure) Among them, category 1 services account for 37% of the overall proportion, and the average uplink and downlink packet lengths and uplink and downlink durations are close to 0. The network data corresponding to these services is noise data because it cannot reflect the characteristics of the services. Category 0 and category 2 can better reflect the characteristics of the business.
在识别出噪声数据后,通过后续的EM算法识别出反例数据集合,进而识别出该反例数据集合对应的用户。识别情况如表3所示:After the noisy data is identified, the counter-example data set is identified through the subsequent EM algorithm, and then the user corresponding to the counter-example data set is identified. The identification is shown in Table 3:
表3 根据包长和时长的用户识别情况Table 3 User identification based on packet length and duration
Figure PCTCN2018096239-appb-000001
Figure PCTCN2018096239-appb-000001
可以看出,通过包长和时长四个维度分析,可以识别出噪声数据,结合识别出的噪声数据和该噪声数据在业务中所占的比例,增加了可靠反例用户数。It can be seen that by analyzing the four dimensions of the packet length and duration, noise data can be identified. Combining the identified noise data and the proportion of the noise data in the service increases the number of reliable counterexample users.
参见图7,图7是本申请实施例提供的一种用户识别效果对比示意图。可以看出,在仅使用初始白名单以及传统的EM算法识别反例进行建模的情况下,其识别准确率为59%,召回率为65%,其中,召回率为识别出的用户数中的正确的用户数与实际上正确的用户数的比值,准确率为识别出的用户数中的正确的用户数与识别出的用户数的比值;在使用迭代匹配扩充后的白名单以及传统的EM算法识别反例进行建模的情况下,其识别准确率为66%,召回率为72%;在使用迭代匹配扩充的白名单以及结合噪声数据在业务中所占的比例识别可靠反例进行建模的情况下,其识别准确率为78%,召回率为83%,可以看出,通过本申请实施例提供的使用迭代匹配扩充的白名单以及结合噪声数据在业务中所占的比例识别可靠反例进行建模,可以有效提高用户识别准确性,比仅使用初始白名单以及传统的EM算法识别反例进行建模的准确率以及召回率至少提高15%以上。Referring to FIG. 7, FIG. 7 is a schematic diagram of a comparison of a user recognition effect provided by an embodiment of the present application. It can be seen that in the case of using only the initial whitelist and the traditional EM algorithm to identify counterexamples for modeling, the recognition accuracy rate is 59% and the recall rate is 65%. Among them, the recall rate is the number of users identified. The ratio of the number of correct users to the number of actually correct users. The accuracy rate is the ratio of the number of correct users to the number of identified users. The whitelist and the traditional EM are expanded using iterative matching. When the algorithm identifies counterexamples for modeling, the recognition accuracy rate is 66% and the recall rate is 72%; in the whitelist using iterative matching expansion and the proportion of noise data in the business to identify reliable counterexamples for modeling In this case, the recognition accuracy rate is 78% and the recall rate is 83%. It can be seen that the whitelist using iterative matching expansion and the proportion of noise data in the service are used to identify reliable counterexamples provided in the embodiments of the present application. Modeling can effectively improve the accuracy of user identification, and the accuracy and recall of modeling by using only the initial whitelist and traditional EM algorithm to identify counterexamples At least increased by 15%.
可以理解,实施本申请实施例,不需要获得客户授权拼接多种数据源,只需要获取少量白名单信息和网络数据,基于迭代匹配扩充用户白名单并结合噪声数据在业务中所占的比例可以实现在线或离线的用户识别,能够有效提高用户识别的准确性。It can be understood that the implementation of the embodiment of the present application does not need to obtain the authorization of the customer to splice multiple data sources, only a small amount of whitelist information and network data need to be obtained. The user whitelist based on iterative matching and the combination of noise data in the service can be Realizing online or offline user identification can effectively improve the accuracy of user identification.
上述详细阐述了本申请实施例的方法,为了便于更好地实施本申请实施例的上述方案,相应地,下面还提供用于配合实施上述方案的相关装置。The method described in the embodiments of the present application has been described in detail above. In order to facilitate better implementation of the foregoing solutions in the embodiments of the present application, corresponding devices are provided below for cooperating in implementing the foregoing solutions.
参见图8,图8为本申请实施例提供的一种识别服务器的结构示意图,该识别服务器800至少包括:获取单元810、识别单元820、计算单元830和确定单元840;其中:Referring to FIG. 8, FIG. 8 is a schematic structural diagram of an identification server according to an embodiment of the present application. The identification server 800 includes at least: an obtaining unit 810, an identifying unit 820, a calculating unit 830, and a determining unit 840;
获取单元810,用于获取用户白名单集合和待识别用户的网络数据,所述待识别用户的网络数据包括所述用户白名单集合对应的正例数据集合和未标记数据集合;An obtaining unit 810, configured to obtain a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;
识别单元820,用于识别出所述正例数据集合中的噪声数据;A recognition unit 820, configured to identify noise data in the positive data set;
计算单元830,用于计算得到所述噪声数据在所述正例数据集合中的比例值;A calculating unit 830, configured to calculate a proportion value of the noise data in the positive data set;
确定单元840,用于根据所述概率值和所述比例值确定识别阈值;A determining unit 840, configured to determine an identification threshold according to the probability value and the ratio value;
其中,所述计算单元830,还用于计算未标记数据集合中每个数据为正例数据的概率值;所述识别单元820,还用于根据所述识别阈值从所述未标记数据集合中识别出反例数据集合;所述识别单元820,还用于根据所述正例数据集合和所述反例数据集合,对待识别用户进行识别。The calculation unit 830 is further configured to calculate a probability value that each data in the unlabeled data set is positive data; and the recognition unit 820 is further configured to remove the data from the unlabeled data set according to the recognition threshold. A negative example data set is identified; the identifying unit 820 is further configured to identify the user to be identified according to the positive example data set and the negative example data set.
在一种可能的实现方式中,所述获取单元810还用于:获取当前白名单信息;In a possible implementation manner, the obtaining unit 810 is further configured to: obtain current whitelist information;
所述获取单元810还包括映射子单元8101,用于基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合;The obtaining unit 810 further includes a mapping subunit 8101, configured to map a user whitelist set based on the current whitelist information and network data of the user to be identified;
所述获取单元还包括合并子单元8102,用于对所述用户白名单集合进行合并。The obtaining unit further includes a merging sub-unit 8102 for merging the user whitelist set.
在又一种可能的实现方式中,所述当前白名单信息包括多种不同类型的白名单,所述白名单包括当前用户白名单Ai和/或当前地址白名单Bi,所述用户白名单包括用户标识信息和行业信息,所述地址白名单包括地址信息和行业信息。In another possible implementation manner, the current whitelist information includes multiple different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user whitelist includes User identification information and industry information, the address whitelist includes address information and industry information.
在又一种可能的实现方式中,所述用户标识包括国际移动用户识别码IMSI,所述地址信息包括互联网协议地址IP。In yet another possible implementation manner, the user identifier includes an international mobile subscriber identity IMSI, and the address information includes an Internet protocol address IP.
在又一种可能的实现方式中,所述合并子单元8102还包括冲突去重子单元8103,所述冲突去重子单元8103,用于通过预设规则对所述用户白名单集合进行冲突去重,所述预设规则包括基于所述白名单的优先级,或者基于映射时间。In another possible implementation manner, the merge subunit 8102 further includes a conflict deduplication subunit 8103, and the conflict deduplication subunit 8103 is configured to perform conflict deduplication on the user whitelist set through a preset rule. The preset rule includes a priority based on the white list or a mapping time.
在又一种可能的实现方式中,所述映射子单元8101,还用于结合所述网络数据对所述用户白名单进行多种地址的映射得到地址白名单Bj,其中,所述地址包括公共地址;In another possible implementation manner, the mapping subunit 8101 is further configured to perform mapping of multiple addresses on the user whitelist in combination with the network data to obtain an address whitelist Bj, where the addresses include public address;
所述识别单元820,还用于对所述公共地址进行识别并标记;The identification unit 820 is further configured to identify and mark the public address;
所述服务器还包括判断单元850,用于判断所述地址白名单Bj与所述获取到的当前地址白名单Bi是否一致,若一致,输出所述用户白名单;若不一致,以所述地址白名单Bj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。The server further includes a judging unit 850 for judging whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are consistent, output the user white list; if they are not consistent, use the address white The list Bj is used as the current white list information, and the steps of mapping the user white list set based on the current white list information and the network data of the user to be identified are repeatedly performed.
在又一种可能的实现方式中,所述判断单元850还用于:In another possible implementation manner, the determining unit 850 is further configured to:
判断所述用户白名单集合对应的用户白名单Aj与所述获取到的当前用户白名单Ai是否一致,若一致,输出所述用户白名单Aj,若不一致,以所述用户白名单Aj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。Determine whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai, and if they are consistent, output the user whitelist Aj; if they are not consistent, use the user whitelist Aj as the current White list information, repeating the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
在又一种可能的实现方式中,所述识别单元820还包括聚类分析子单元8201,用于基于所述正例数据集合对应的上下行包长和上下行时长对所述正例数据进行聚类分析,识别并标记所述上下行包长和上下行时长最小的分类作为噪声数据。In still another possible implementation manner, the identification unit 820 further includes a cluster analysis subunit 8201, configured to perform a process on the positive data based on an uplink and downlink packet length and an uplink and downlink duration corresponding to the positive data set. Cluster analysis, identifying and labeling the category with the smallest uplink and downlink packet length and smallest uplink and downlink duration as noise data.
在又一种可能的实现方式中,所述计算单元830还包括分组子单元8301,用于将所述正例数据集合分为i组间谍数据;In another possible implementation manner, the calculation unit 830 further includes a grouping subunit 8301, which is configured to divide the positive data set into i groups of spy data;
所述计算单元830还包括构建子单元8302,用于根据M和Pi构建迭代EM模型,所述M=U+Si,所述Pi=P-Si,其中,所述Si表示每一组所述间谍数据,所述P表示所述正例数据集合,所述U表示所述未标记数据集合;The calculation unit 830 further includes a construction sub-unit 8302 for constructing an iterative EM model according to M and Pi, where M = U + Si, and Pi = P-Si, where Si represents each group of the Spy data, where P represents the positive data set, and U represents the unlabeled data set;
所述计算单元830还包括分析子单元8303,用于根据所述EM模型对所述M中的每个数据进行分析,得到所述M中每个数据为正例数据的概率值tj;The calculation unit 830 further includes an analysis subunit 8303, configured to analyze each data in the M according to the EM model, to obtain a probability value tj where each data in the M is positive data;
其中,所述i和所述j为大于等于1的正整数。Wherein, i and j are positive integers greater than or equal to 1.
在又一种可能的实现方式中,所述确定单元840还用于:In another possible implementation manner, the determining unit 840 is further configured to:
结合所述M中每个数据为正例数据的概率值tj,得到将所述M中的噪声数据以所述噪声数据在所述正例数据中的比例值作为置信度判定为反例数据时所对应的概率值t;Combining the probability value tj of each data in M with positive data, it is obtained when the noise data in M is determined as the negative data with the proportional value of the noise data in the positive data as the confidence data. The corresponding probability value t;
判断所述tj与所述t的大小关系,将所有小于t的tj所对应的数据加入反例数据集合RNi中。Determine the magnitude relationship between tj and t, and add all data corresponding to tj smaller than t to the counter-example data set RNi.
在又一种可能的实现方式中,所述识别单元820还包括取并集子单元8202,用于对通 过所述i组间谍数据得到的i个反例数据集合RNi求并集,得到反例集合RN。In another possible implementation manner, the identification unit 820 further includes a take-and-set sub-unit 8202, configured to obtain a counter-set RN by merging the i counter-example data sets RNi obtained through the i-group spy data. .
需要说明的是,各个单元的实现还可以对应参照图1所示的方法实施例的相应描述,这里不再赘述。It should be noted that the implementation of each unit may also correspond to the corresponding description of the method embodiment shown in FIG. 1, which is not repeated here.
请参见图9,图9是本申请实施例提供的另一种识别服务器的结构示意图。该识别服务器900至少包括:处理器910、存储器920和收发器930,该处理器910、存储器920和收发器930通过总线940相互连接。Please refer to FIG. 9, which is a schematic structural diagram of another identification server provided by an embodiment of the present application. The identification server 900 includes at least a processor 910, a memory 920, and a transceiver 930. The processor 910, the memory 920, and the transceiver 930 are connected to each other through a bus 940.
存储器920包括但不限于是随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)或可擦除可编程只读存储器(Erasable Programmable Read-Only Mmory,EPROM或者快闪存储器),该存储器920用于存储相关指令及数据。The memory 920 includes, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), or Erasable Programmable Read-Only Memory (EPROM) or flash memory. Flash memory). The memory 920 is used to store related instructions and data.
该收发器930可以包括一个接收器和一个发送器,例如,无线射频模块,以下描述的处理器910接收或者发送某个消息,具体可以理解为该处理器910通过该收发器930来接收或者发送。The transceiver 930 may include a receiver and a transmitter, for example, a radio frequency module. The processor 910 described below receives or sends a message. Specifically, it can be understood that the processor 910 receives or sends a message through the transceiver 930. .
处理器910可以是一个或多个中央处理器(Central Processing Unit,CPU),在处理器910是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。The processor 910 may be one or more central processing units (CPUs). When the processor 910 is a CPU, the CPU may be a single-core CPU or a multi-core CPU.
该识别服务器900中的处理器910用于读取该存储器920中存储的程序代码,执行以下操作:The processor 910 in the identification server 900 is configured to read the program code stored in the memory 920 and perform the following operations:
处理器910通过收发器930接收用户白名单集合和待识别用户的网络数据,所述待识别用户的网络数据包括所述用户白名单集合对应的正例数据集合和未标记数据集合。The processor 910 receives the user whitelist set and the network data of the user to be identified through the transceiver 930. The network data of the user to be identified includes the positive data set and the unlabeled data set corresponding to the user whitelist set.
处理器910识别出所述正例数据集合中的噪声数据,并计算得到所述噪声数据在所述正例数据集合中的比例值。The processor 910 identifies noise data in the positive data set, and calculates a ratio value of the noise data in the positive data set.
处理器910计算未标记数据集合中每个数据为正例数据的概率值。The processor 910 calculates a probability value that each data in the unlabeled data set is positive data.
处理器910根据所述概率值和所述比例值确定识别阈值,并根据所述识别阈值从所述未标记数据集合中识别出反例数据集合。The processor 910 determines a recognition threshold according to the probability value and the proportion value, and identifies a counter-example data set from the unlabeled data set according to the recognition threshold.
处理器910根据所述正例数据集合和所述反例数据集合,对待识别用户进行识别。The processor 910 identifies the user to be identified according to the positive data set and the negative data set.
需要说明的是,各个操作的具体实现还可根据上述方法实施例中的方法具体实现,此处不再赘述。It should be noted that the specific implementation of each operation may also be specifically implemented according to the method in the foregoing method embodiment, and details are not described herein again.
实施本申请实施例,基站通过将寻呼载波进行分组,并对不同组的载波进行具体的寻呼配置,可以实现一个小区内不同的寻呼周期的支持,同时满足短时延和深覆盖的UE的寻呼需求。By implementing the embodiments of the present application, the base station can realize the support of different paging cycles in a cell by grouping the paging carriers and performing specific paging configuration for different groups of carriers, while meeting the requirements of short delay and deep coverage. UE's paging requirements.
需要说明的是,各个操作的具体实现还可以对应参照图1所示的方法实施例的相应描述,此处不再赘述。It should be noted that the specific implementation of each operation may also correspond to the corresponding description of the method embodiment shown in FIG. 1, which is not repeated here.
综上所述,通过所述本申请实施例,识别服务器通过获取到少量用户白名单信息和网络数据,对用户白名单进行迭代匹配进行扩充,然后通过聚类分析识别出噪声数据并计算得到其占的比例值,构建EM模型对未标记数据进行计算得到其为正例数据的概率值,再结合噪声数据所占的比例值,可以识别出可靠的反例数据集合,最后对待识别用户进行识别,可以有效通过识别的准确性。In summary, according to the embodiment of the present application, the identification server obtains a small amount of user whitelist information and network data, expands the user whitelist by iterative matching, and then identifies noise data through cluster analysis and calculates it. The proportion of the value is constructed. The EM model is constructed to calculate the unlabeled data to obtain the probability value of the positive data. Combined with the proportion of the noise data, a reliable counter-example data set can be identified. Finally, the user to be identified is identified. Can effectively pass the accuracy of recognition.
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指 令,当其在计算机或处理器上运行时,使得计算机或处理器执行上述任一个数据传输方法中的一个或多个步骤。上述装置的各组成模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在所述计算机可读取存储介质中。An embodiment of the present application further provides a computer-readable storage medium. The computer-readable storage medium stores instructions. When the computer-readable storage medium runs on a computer or a processor, the computer or the processor executes any one of the foregoing data transmission methods One or more steps. When each component module of the above device is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in the computer-readable storage medium.
上述计算机可读存储介质可以是前述任一实施例所述的识别服务器的内部存储单元,例如硬盘或内存。上述计算机可读存储介质也可以是上述识别服务器的外部存储设备,例如配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,上述计算机可读存储介质还可以既包括上述识别服务器的内部存储单元也包括外部存储设备。上述计算机可读存储介质用于存储上述计算机程序以及上述识别服务器所需的其他程序和数据。上述计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of the identification server according to any one of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium may also be an external storage device of the identification server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, and a flash memory card (Flash Card) and so on. Further, the computer-readable storage medium may further include both the internal storage unit of the identification server and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the identification server. The computer-readable storage medium described above may also be used to temporarily store data that has been or will be output.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可通过计算机程序来指令相关的硬件来完成,该的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可存储程序代码的介质。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using a computer program to instruct related hardware. The program can be stored in a computer-readable storage medium. When the program is executed, Can include the processes of the embodiments of the methods as described above. The foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disc.
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。The steps in the method of the embodiment of the present application can be adjusted, combined, and deleted according to actual needs.
本申请实施例装置中的模块可以根据实际需要进行合并、划分和删减。The modules in the apparatus of the embodiment of the present application may be combined, divided, and deleted according to actual needs.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。As mentioned above, the above embodiments are only used to describe the technical solution of the present application, rather than limiting them. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still apply the foregoing The technical solutions described in the embodiments are modified, or some technical features are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions outside the scope of the technical solutions of the embodiments of the present application.

Claims (24)

  1. 一种用户识别方法,其特征在于,包括:A user identification method, comprising:
    获取用户白名单集合和待识别用户的网络数据,所述待识别用户的网络数据包括所述用户白名单集合对应的正例数据集合和未标记数据集合;Acquiring a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;
    识别出所述正例数据集合中的噪声数据,并计算得到所述噪声数据在所述正例数据集合中的比例值;Identify the noise data in the positive data set, and calculate the proportion value of the noise data in the positive data set;
    计算未标记数据集合中每个数据为正例数据的概率值;Calculate the probability value of each data in the unlabeled data set as positive data;
    根据所述概率值和所述比例值确定识别阈值,并根据所述识别阈值从所述未标记数据集合中识别出反例数据集合;Determining a recognition threshold according to the probability value and the proportion value, and identifying a counter-example data set from the unlabeled data set according to the recognition threshold;
    根据所述正例数据集合和所述反例数据集合,对待识别用户进行识别。According to the positive data set and the negative data set, the user to be identified is identified.
  2. 如权利要求1所述的方法,其特征在于,所述获取用户白名单集合包括:The method according to claim 1, wherein the obtaining a user whitelist set comprises:
    获取当前白名单信息,并基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合;Acquiring current whitelist information, and mapping a user whitelist set based on the current whitelist information and network data of the user to be identified;
    对所述用户白名单集合进行合并。Merging the user whitelist set.
  3. 如权利要求2所述的方法,其特征在于,所述当前白名单信息包括多种不同类型的白名单,所述白名单包括当前用户白名单Ai和/或当前地址白名单Bi,所述用户白名单包括用户标识信息和行业信息,所述地址白名单包括地址信息和行业信息。The method according to claim 2, wherein the current whitelist information includes a plurality of different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user The white list includes user identification information and industry information, and the address white list includes address information and industry information.
  4. 如权利要求3所述的方法,其特征在于,所述用户标识信息包括国际移动用户识别码IMSI,所述地址信息包括互联网协议地址IP。The method according to claim 3, wherein the user identification information includes an International Mobile Subscriber Identity (IMSI), and the address information includes an Internet Protocol address (IP).
  5. 如权利要求2所述的方法,其特征在于,所述对所述用户白名单集合进行合并包括:The method according to claim 2, wherein the merging the user whitelist set comprises:
    通过预设规则对所述用户白名单集合进行冲突去重,所述预设规则包括基于所述白名单的优先级,或者基于映射时间。Deduplication of the user whitelist set is performed through a preset rule, and the preset rule includes a priority based on the whitelist or a mapping time.
  6. 如权利要求5所述的方法,其特征在于,对所述用户白名单集合进行合并之后,所述方法还包括:The method according to claim 5, wherein after merging the user whitelist set, the method further comprises:
    结合所述网络数据对所述用户白名单进行多种地址的映射得到地址白名单Bj,其中,所述地址包括公共地址,所述映射过程中对所述公共地址进行识别并标记;Map the user whitelist with multiple addresses in combination with the network data to obtain an address whitelist Bj, where the addresses include public addresses, and identify and mark the public addresses during the mapping process;
    判断所述地址白名单Bj与所述获取到的当前地址白名单Bi是否一致,若一致,输出所述用户白名单;若不一致,以所述地址白名单Bj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。Determine whether the address whitelist Bj is consistent with the obtained current address whitelist Bi, if they are the same, output the user whitelist; if they are not the same, use the address whitelist Bj as the current whitelist information and repeat the execution The steps of mapping a user whitelist set based on the current whitelist information and network data of the user to be identified are described.
  7. 如权利要求5所述的方法,其特征在于,在对所述用户白名单集合进行合并之后,所述方法还包括:The method according to claim 5, wherein after merging the user whitelist set, the method further comprises:
    判断所述用户白名单集合对应的用户白名单Aj与所述获取到的当前用户白名单Ai是否一致,若一致,输出所述用户白名单Aj,若不一致,以所述用户白名单Aj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。Determine whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai, and if they are consistent, output the user whitelist Aj; if they are not consistent, use the user whitelist Aj as the current White list information, repeating the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
  8. 如权利要求1所述的方法,其特征在于,所述识别出所述正例数据集合中的噪声数据,并计算得到所述噪声数据在所述正例数据集合中的比例值包括:The method according to claim 1, wherein the identifying the noise data in the positive data set and calculating and calculating the proportion value of the noise data in the positive data set comprises:
    基于所述正例数据集合对应的上下行包长和上下行时长对所述正例数据进行聚类分析,识别并标记所述上下行包长和上下行时长最小的分类作为噪声数据,计算得到所述噪声数据在所述正例数据集合中所占的比例值。Perform cluster analysis on the positive data based on the uplink and downlink packet length and uplink and downlink duration corresponding to the positive data set, identify and mark the classification with the smallest uplink and downlink packet length and uplink and downlink duration as noise data, and calculate it A proportion value of the noise data in the positive data set.
  9. 如权利要求1所述的方法,其特征在于,所述计算未标记数据集合中每个数据为正例数据的概率值包括:The method according to claim 1, wherein the calculating the probability value that each data in the unlabeled data set is positive data comprises:
    将所述正例数据集合分为i组间谍数据;Divide the positive data set into i groups of spy data;
    根据M和Pi构建迭代EM模型,所述M=U+Si,所述Pi=P-Si,其中,所述Si表示每一组所述间谍数据,所述P表示所述正例数据集合,所述U表示所述未标记数据集合;Build an iterative EM model according to M and Pi, where M = U + Si and Pi = P-Si, where Si represents each set of the spy data, and P represents the positive data set, The U represents the unlabeled data set;
    根据所述EM模型对所述M中的每个数据进行分析,得到所述M中每个数据为正例数据的概率值tj;Analyzing each data in the M according to the EM model, and obtaining a probability value tj of each data in the M being positive data;
    其中,所述i和所述j为大于等于1的正整数。Wherein, i and j are positive integers greater than or equal to 1.
  10. 如权利要求9所述的方法,其特征在于,所述根据所述概率值和所述比例值确定识别阈值,并根据所述识别阈值从所述未标记数据集合中识别出反例数据集合包括:The method according to claim 9, wherein determining an identification threshold based on the probability value and the proportion value, and identifying a counter-example data set from the unlabeled data set according to the identification threshold comprises:
    结合所述M中每个数据为正例数据的概率值tj,得到将所述M中的噪声数据以所述噪声数据在所述正例数据中的比例值作为置信度判定为反例数据时所对应的概率值t;Combining the probability value tj of each data in M with positive data, it is obtained when the noise data in M is determined as the negative data with the proportional value of the noise data in the positive data as the confidence data. The corresponding probability value t;
    判断所述tj与所述t的大小关系,将所有小于t的tj所对应的数据加入反例数据集合RNi中。Determine the magnitude relationship between tj and t, and add all data corresponding to tj smaller than t to the counter-example data set RNi.
  11. 如权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, further comprising:
    对通过所述i组间谍数据得到的i个反例数据集合RNi求并集,得到反例集合RN。A union set is obtained on the i counter-example data sets RNi obtained through the i-group spy data to obtain a counter-example set RN.
  12. 一种识别服务器,其特征在于,包括:An identification server, comprising:
    获取单元,用于获取用户白名单集合和待识别用户的网络数据,所述待识别用户的网络数据包括所述用户白名单集合对应的正例数据集合和未标记数据集合;An obtaining unit, configured to obtain a user whitelist set and network data of a user to be identified, where the network data of the user to be identified includes a positive data set and an unlabeled data set corresponding to the user whitelist set;
    识别单元,用于识别出所述正例数据集合中的噪声数据;A recognition unit, configured to identify noise data in the positive data set;
    计算单元,用于计算得到所述噪声数据在所述正例数据集合中的比例值;A calculation unit, configured to calculate a proportion value of the noise data in the positive data set;
    确定单元,用于根据所述概率值和所述比例值确定识别阈值;A determining unit, configured to determine an identification threshold according to the probability value and the ratio value;
    其中,所述计算单元,还用于计算未标记数据集合中每个数据为正例数据的概率值;所述识别单元,还用于根据所述识别阈值从所述未标记数据集合中识别出反例数据集合; 所述识别单元,还用于根据所述正例数据集合和所述反例数据集合,对待识别用户进行识别。The calculation unit is further configured to calculate a probability value that each data in the unlabeled data set is positive data; and the recognition unit is further configured to identify the unlabeled data set according to the recognition threshold. Counter-example data set; the identification unit is further configured to identify a user to be identified based on the positive-example data set and the counter-example data set.
  13. 如权利要求12所述的服务器,其特征在于,所述获取单元还用于:获取当前白名单信息;The server according to claim 12, wherein the obtaining unit is further configured to: obtain current whitelist information;
    所述获取单元还包括映射子单元,用于基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合;The obtaining unit further includes a mapping subunit, configured to map a user whitelist set based on the current whitelist information and network data of the user to be identified;
    所述获取单元还包括合并子单元,用于对所述用户白名单集合进行合并。The obtaining unit further includes a merging subunit for merging the user whitelist set.
  14. 如权利要求13所述的服务器,其特征在于,所述当前白名单信息包括多种不同类型的白名单,所述白名单包括当前用户白名单Ai和/或当前地址白名单Bi,所述用户白名单包括用户标识信息和行业信息,所述地址白名单包括地址信息和行业信息。The server according to claim 13, wherein the current whitelist information includes a plurality of different types of whitelists, the whitelist includes a current user whitelist Ai and / or a current address whitelist Bi, and the user The white list includes user identification information and industry information, and the address white list includes address information and industry information.
  15. 如权利要求14所述的服务器,其特征在于,所述用户标识包括国际移动用户识别码IMSI,所述地址信息包括互联网协议地址IP。The server according to claim 14, wherein the user identification includes an International Mobile Subscriber Identity (IMSI), and the address information includes an Internet Protocol address (IP).
  16. 如权利要求13所述的服务器,其特征在于,所述合并子单元还包括冲突去重单元,所述冲突去重单元,用于通过预设规则对所述用户白名单集合进行冲突去重,所述预设规则包括基于所述白名单的优先级,或者基于映射时间。The server according to claim 13, wherein the merge subunit further comprises a conflict deduplication unit, and the conflict deduplication unit is configured to perform conflict deduplication on the user whitelist set through a preset rule, The preset rule includes a priority based on the white list or a mapping time.
  17. 如权利要求16所述的服务器,其特征在于,所述映射子单元,还用于结合所述网络数据对所述用户白名单进行多种地址的映射得到地址白名单Bj,其中,所述地址包括公共地址;The server according to claim 16, wherein the mapping subunit is further configured to perform mapping of multiple addresses on the user whitelist in combination with the network data to obtain an address whitelist Bj, wherein the address Including public address;
    所述识别单元,还用于对所述公共地址进行识别并标记;The identification unit is further configured to identify and mark the public address;
    所述服务器还包括判断单元,用于判断所述地址白名单Bj与所述获取到的当前地址白名单Bi是否一致,若一致,输出所述用户白名单;若不一致,以所述地址白名单Bj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。The server further includes a judging unit for judging whether the address white list Bj is consistent with the obtained current address white list Bi, and if they are consistent, output the user white list; if they are not consistent, use the address white list As the current white list information, Bj repeatedly executes the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
  18. 如权利要求16所述的服务器,其特征在于,所述判断单元还用于:The server according to claim 16, wherein the determining unit is further configured to:
    判断所述用户白名单集合对应的用户白名单Aj与所述获取到的当前用户白名单Ai是否一致,若一致,输出所述用户白名单Aj,若不一致,以所述用户白名单Aj作为当前白名单信息,重复执行所述基于所述当前白名单信息和所述待识别用户的网络数据映射出用户白名单集合的步骤。Determine whether the user whitelist Aj corresponding to the user whitelist set is consistent with the obtained current user whitelist Ai, and if they are consistent, output the user whitelist Aj; if they are not consistent, use the user whitelist Aj as the current White list information, repeating the step of mapping a user white list set based on the current white list information and network data of the user to be identified.
  19. 如权利要求12所述的服务器,其特征在于,所述识别单元还包括聚类分析子单元,用于基于所述正例数据集合对应的上下行包长和上下行时长对所述正例数据进行聚类分析,识别并标记所述上下行包长和上下行时长最小的分类作为噪声数据。The server according to claim 12, wherein the identification unit further comprises a cluster analysis subunit, configured to compare the positive data based on the uplink and downlink packet length and uplink and downlink duration corresponding to the positive data set. Perform cluster analysis to identify and mark the category with the smallest uplink and downlink packet length and smallest uplink and downlink duration as noise data.
  20. 如权利要求12所述的服务器,其特征在于,所述计算单元还包括分组子单元,用于将所述正例数据集合分为i组间谍数据;The server according to claim 12, wherein the calculation unit further comprises a grouping subunit for dividing the set of positive data into i groups of spy data;
    所述计算单元还包括构建子单元,用于根据M和Pi构建迭代EM模型,所述M=U+Si,所述Pi=P-Si,其中,所述Si表示每一组所述间谍数据,所述P表示所述正例数据集合,所述U表示所述未标记数据集合;The calculation unit further includes a construction sub-unit for constructing an iterative EM model according to M and Pi, where M = U + Si and Pi = P-Si, where Si represents each group of the spy data , P represents the positive data set, and U represents the unlabeled data set;
    所述计算单元还包括分析子单元,用于根据所述EM模型对所述M中的每个数据进行分析,得到所述M中每个数据为正例数据的概率值tj;The calculation unit further includes an analysis subunit, configured to analyze each data in the M according to the EM model, to obtain a probability value tj where each data in the M is positive data;
    其中,所述i和所述j为大于等于1的正整数。Wherein, i and j are positive integers greater than or equal to 1.
  21. 如权利要求20所述的服务器,其特征在于,所述确定单元还用于:The server according to claim 20, wherein the determining unit is further configured to:
    结合所述M中每个数据为正例数据的概率值tj,得到将所述M中的噪声数据以所述噪声数据在所述正例数据中的比例值作为置信度判定为反例数据时所对应的概率值t;Combining the probability value tj of each data in M with positive data, it is obtained when the noise data in M is determined as the negative data with the proportional value of the noise data in the positive data as the confidence data. The corresponding probability value t;
    判断所述tj与所述t的大小关系,将所有小于t的tj所对应的数据加入反例数据集合RNi中。Determine the magnitude relationship between tj and t, and add all data corresponding to tj smaller than t to the counter-example data set RNi.
  22. 如权利要求21所述的服务器,其特征在于,所述识别单元还包括取并集子单元,用于对通过所述i组间谍数据得到的i个反例数据集合RNi求并集,得到反例集合RN。The server according to claim 21, wherein the identification unit further comprises a take-out set sub-unit, configured to obtain a set of counter-examples by combining i counter-example data sets RNi obtained by the i-group spy data. RN.
  23. 一种识别服务器,其特征在于,所述识别服务器包括:处理器、存储器和收发器,其中:An identification server, characterized in that the identification server includes: a processor, a memory, and a transceiver, wherein:
    所述处理器、所述存储器和所述收发器相互连接,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如权利要求1至11任意一项所述的用户识别方法。The processor, the memory, and the transceiver are connected to each other, the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute a right A user identification method according to any one of 1 to 11 is required.
  24. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时,使所述处理器执行如权利要求1至11任意一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, cause the processor to execute a program such as The method according to any one of claims 1 to 11.
PCT/CN2018/096239 2018-07-19 2018-07-19 Method for identifying user and related device WO2020014916A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/096239 WO2020014916A1 (en) 2018-07-19 2018-07-19 Method for identifying user and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/096239 WO2020014916A1 (en) 2018-07-19 2018-07-19 Method for identifying user and related device

Publications (1)

Publication Number Publication Date
WO2020014916A1 true WO2020014916A1 (en) 2020-01-23

Family

ID=69164209

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/096239 WO2020014916A1 (en) 2018-07-19 2018-07-19 Method for identifying user and related device

Country Status (1)

Country Link
WO (1) WO2020014916A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679019A (en) * 2012-09-10 2014-03-26 腾讯科技(深圳)有限公司 Malicious file identifying method and device
CN105577660A (en) * 2015-12-22 2016-05-11 国家电网公司 DGA domain name detection method based on random forest
CN106682906A (en) * 2015-11-10 2017-05-17 阿里巴巴集团控股有限公司 Risk identification and business processing method and device
CN107220867A (en) * 2017-04-20 2017-09-29 北京小度信息科技有限公司 object control method and device
CN107770132A (en) * 2016-08-18 2018-03-06 中兴通讯股份有限公司 A kind of method and device detected to algorithm generation domain name

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679019A (en) * 2012-09-10 2014-03-26 腾讯科技(深圳)有限公司 Malicious file identifying method and device
CN106682906A (en) * 2015-11-10 2017-05-17 阿里巴巴集团控股有限公司 Risk identification and business processing method and device
CN105577660A (en) * 2015-12-22 2016-05-11 国家电网公司 DGA domain name detection method based on random forest
CN107770132A (en) * 2016-08-18 2018-03-06 中兴通讯股份有限公司 A kind of method and device detected to algorithm generation domain name
CN107220867A (en) * 2017-04-20 2017-09-29 北京小度信息科技有限公司 object control method and device

Similar Documents

Publication Publication Date Title
WO2016045336A1 (en) Method and device for reporting and acquiring terminal capability
US9100926B2 (en) Methods and apparatus to determine a base station location
CN112312301B (en) User terminal positioning method, device, equipment and computer storage medium
KR102409127B1 (en) APPARATUS, METHOD and RECODING MEDIUM for SETTING WIRELESS MESH NETWORK
WO2018112825A1 (en) Positioning method based on wi-fi access point, and device
CN111372183B (en) Method, device, equipment and storage medium for identifying poor terminal
WO2014032505A1 (en) Method and device for simulation test
CN109936820B (en) User terminal positioning method and device
WO2018010693A1 (en) Method and apparatus for identifying information from rogue base station
US10887130B2 (en) Dynamic intelligent analytics VPN instantiation and/or aggregation employing secured access to the cloud network device
CN108112031B (en) Network type determination method and device
US20230059954A1 (en) Method, electronic device and non-transitory computer-readable storage medium for determining indoor radio transmitter distribution
WO2020014916A1 (en) Method for identifying user and related device
CN112383936A (en) Method and device for evaluating number of accessible users
CN112035490B (en) Electric vehicle information monitoring method, device and system based on cloud platform
CN108848139B (en) Attendance tracking method, device, terminal, server and storage medium
CN109756887B (en) High-speed rail accompanying mobile terminal identification method and device and computer readable storage medium
US10484105B2 (en) Method and apparatus for constructing wireless positioning feature library
CN108156011B (en) Method and equipment for clustering wireless access points
Maller et al. Cloud-in-the-Loop simulation of C-V2X application relocation distortions in Kubernetes based Edge Cloud environment
CN105307212A (en) Processing method for traffic data of base station and base station
JP2017106798A (en) Positioning program, positioning method and positioning device
CN116866976B (en) Method and electronic equipment for rural area mobile network service
US20130324135A1 (en) Facilitation of determination of antenna location
Guan et al. Performance analysis of polling‐based MAC protocol with retrial for Internet of Things

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18926618

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18926618

Country of ref document: EP

Kind code of ref document: A1