CN111106980A - Bandwidth binding detection method and device - Google Patents

Bandwidth binding detection method and device Download PDF

Info

Publication number
CN111106980A
CN111106980A CN201911300738.9A CN201911300738A CN111106980A CN 111106980 A CN111106980 A CN 111106980A CN 201911300738 A CN201911300738 A CN 201911300738A CN 111106980 A CN111106980 A CN 111106980A
Authority
CN
China
Prior art keywords
host
key
same
user
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911300738.9A
Other languages
Chinese (zh)
Other versions
CN111106980B (en
Inventor
韩南
雷葆华
叶志钢
王赟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Greenet Information Service Co Ltd
Original Assignee
Wuhan Greenet Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Greenet Information Service Co Ltd filed Critical Wuhan Greenet Information Service Co Ltd
Priority to CN201911300738.9A priority Critical patent/CN111106980B/en
Publication of CN111106980A publication Critical patent/CN111106980A/en
Application granted granted Critical
Publication of CN111106980B publication Critical patent/CN111106980B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to the technical field of bandwidth detection, and provides a bandwidth binding detection method and device. The method comprises the steps that broadband account numbers are taken as statistical dimensions, if the fact that the number of times that different fixed user IDs appear in the same broadband account number combination at different moments exceeds a first preset number of times is counted, the broadband account numbers in the combination are marked as first suspected bindings; counting the same user ID, summarizing the times of the detection periods in one day according to the user ID, and marking the broadband account combination as a second suspected binding if the judgment result shows that any continuous specified number of detection periods appear in different broadband accounts in the same broadband account combination; and confirming the broadband account combination meeting the first suspected binding and the second suspected binding at the same time as a bound broadband account. The invention can enable the telecom operator to effectively discover the broadband account with the binding condition.

Description

Bandwidth binding detection method and device
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of bandwidth detection, in particular to a bandwidth binding detection method and device.
[ background of the invention ]
In the traditional fixed network service of the operator, a situation of bandwidth binding private connection is often encountered, for example: the broadband wholesaler can bundle 10 100 million broadband into 1 gigabit broadband to be provided for users as a private line. However, the price of 10 hundred tera broadband is far lower than 1 gigabit private line, and the telecom operator lacks an effective means to detect the bandwidth binding private connection, so that great economic loss is brought to the telecom operator. Telecommunication operators urgently need effective technical means for discovering and attacking bandwidth-bound users who are privately connected.
In view of the above, overcoming the drawbacks of the prior art is an urgent problem in the art.
[ summary of the invention ]
The technical problem to be solved by the invention is that a telecom operator lacks an effective means to detect the bandwidth binding private access, so that great economic loss is brought to the telecom operator. Telecommunication operators urgently need effective technical means for discovering and attacking bandwidth-bound users who are privately connected.
The technical problem to be further solved by the invention is how to find the user ID capable of efficiently and uniquely identifying the user object in different application scenes to detect the user, and to cover all users as much as possible.
The invention adopts the following technical scheme:
in a first aspect, the present invention provides a bandwidth bundling detection method, including:
taking the broadband account as a statistical dimension, if the counted times that a plurality of different fixed user IDs appear in the same broadband account combination at different moments exceed a first preset time, marking the broadband account in the combination as a first suspected bundle;
counting the same user ID, summarizing the times of the same user ID in all detection periods in one day according to the user ID, and judging whether the same user ID has any detection period with continuous designated detection number appearing in different broadband accounts in the same broadband account combination; if the judgment result shows that any continuous specified number of detection cycles appear in different broadband account numbers in the same broadband account number combination, marking the broadband account number combination as a second suspected binding;
and confirming the broadband account combination meeting the first suspected binding and the second suspected binding at the same time as a bound broadband account.
Preferably, the first preset times are 5-10 times specifically; the specified number is specifically 3-5 times.
Preferably, before performing the statistics, the method further comprises:
outputting a corresponding relation table of user ID-broadband account numbers in each detection period according to the detection periods;
merging the data items of the same user ID according to the corresponding relation table; recording different broadband account numbers associated with the same user ID into a transition table in the merging process, and deleting data items with the quantity of 1 of the broadband account numbers associated with the broadband account numbers in the merging process; the transition table is used for an analysis object in a statistical process.
Preferably, the method includes the following steps of setting an identifier corresponding to the same IP as I, an identifier corresponding to different IP as I, an identifier corresponding to the same Host-Key and the same value of the Host-Key as V, and an identifier corresponding to the same Host-Key and the different value of the Host-Key as V, wherein the identifiers include:
setting the identifiers corresponding to the conditions that the IP is the same, the Host-Key is the same and the value values of the Host-Key are the same as the IV; setting identifiers corresponding to the conditions that the IP is the same, the Host-Key is the same and the value values of the Host-Key are different as Iv; setting the identifiers corresponding to the situations that the IPs are different, the Host-Key is the same and the value values of the Host-Key are the same as iV; setting the identifiers corresponding to different IP, the same Host-Key and different values of the Host-Key as iv;
calculating the score of each group of Host-Key according to the corresponding relation that the larger the parameter value of IV and IV is, the better the parameter value of IV and IV is, the smaller the parameter value of IV and IV is, the better the parameter value of IV and IV is, and according to IV, IV and IV which correspond to each group of Host-Key statistics;
and dynamically screening the user ID in the current data analysis scene according to the score of each group of Host-Key.
Preferably, the calculating the score of each group of Host-keys according to the IV, and IV corresponding to each group of Host-Key statistics specifically includes:
calculating the Score of each Host-key group according to the formula Score (IV)/(IV); alternatively, the first and second electrodes may be,
calculating the Score of each group Host-key according to the formula Score (IV-IV) × (IV-iV); alternatively, the first and second electrodes may be,
the Score for each set of Host-key was calculated according to the formula Score (IV + IV)/(IV + IV) 100.
Preferably, the dynamically screening out the user ID in the current data analysis scene according to the score of each group of Host-keys specifically includes:
taking a Host-Key with a computed score before a preset first ranking value as a user ID dynamically generated by a current data analysis scene; wherein, the preset first ranking value is 200-500 or the ranking is located at the top 10% of the total as the preset first ranking value.
Preferably, when the user ID of the corresponding IP address needs to be determined for the same IP address in the user ID in the current data analysis scenario is dynamically screened out according to the score of each group of Host-keys, the method includes:
and aiming at the same IP address, sequencing the corresponding Host-Key calculation scores, and taking the Host-Key with the rank before the second preset rank as the user ID corresponding to the IP address.
Preferably, before calculating the score of each set of Host-keys according to the IV, and IV corresponding to each set of Host-Key statistics, the method further comprises:
determining whether the corresponding ratio exceeds a first preset threshold value according to the ratio of the parameter values of the Iv and/or Iv in the total statistical quantity;
and if the ratio of the Iv and/or the iV in the total statistical quantity exceeds the first preset threshold value, skipping the score calculation of the corresponding Host-Key combination.
Preferably, if the ratio of Iv and/or Iv in the total statistical number exceeds the first preset threshold, the method further comprises:
for the fact that the ratio of Iv in the total statistical quantity exceeds the first preset threshold, further analyzing the name of each Key under the Iv condition under the same IP; and if the Key is determined to be the user name or the device MAC address, recording the corresponding IP in a log as a potential studio.
In a second aspect, the present invention further provides a bandwidth bundling detection apparatus, configured to implement the bandwidth bundling detection method in the first aspect, where the apparatus includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor for performing the bandwidth bundling detection method of the first aspect.
In a third aspect, the present invention also provides a non-transitory computer storage medium storing computer-executable instructions for execution by one or more processors for performing the bandwidth bundling detection method of the first aspect.
The invention can enable the telecom operator to effectively discover the broadband account with the binding condition. The real-time performance is high, and the newly bound broadband account can be detected the next day. The efficiency is high, millions of broadband account numbers are saved all over, and the detection can be finished within hours.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a schematic flow chart of a bandwidth bundling detection method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a bandwidth bundling detection method according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a user ID discovery method according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating a user ID discovery method according to an embodiment of the present invention;
fig. 5 shows data contents in an example of a user ID discovery method according to an embodiment of the present invention;
fig. 6 shows data contents in an example of a user ID discovery method according to an embodiment of the present invention;
fig. 7 shows data contents in an example of a user ID discovery method according to an embodiment of the present invention;
fig. 8 illustrates data contents in an example of a user ID discovery method according to an embodiment of the present invention;
fig. 9 is data contents in an example of a user ID discovery method according to an embodiment of the present invention;
fig. 10 shows data contents in an example of a user ID discovery method according to an embodiment of the present invention;
fig. 11 shows data contents in an example of a user ID discovery method according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of a device for detecting broadband bundling according to an embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the description of the present invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are for convenience only to describe the present invention without requiring the present invention to be necessarily constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In order to help operators to find the condition of bandwidth bundling in time, a bandwidth bundling detection method is invented. The method adopts a cross analysis method of user unique ID + broadband account + internet surfing time to detect the change rule of the broadband account used by the user for surfing the internet in time and quantity. Thereby effectively discovering the condition of bandwidth binding. The invention is suitable for service scenes with different user scales, and bound account numbers can be found in a short time from tens of thousands to tens of millions of user scales.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Example 1:
an embodiment 1 of the present invention provides a bandwidth bundling detection method, as shown in fig. 1, the method includes:
in step 101, the broadband account is taken as a statistical dimension, and if it is counted that the number of times that different fixed user IDs appear in the same broadband account combination at different times exceeds a first preset number of times, the broadband account in the combination is marked as a first suspected bundle.
In step 102, counting the same user ID, counting the number of times of occurrence in all detection periods in one day, summarizing according to the user ID, and determining whether the same user ID has any detection period with continuously specified detection number appearing in different broadband accounts in the same broadband account combination; and if the judgment result shows that any continuous specified number of detection cycles appear in different broadband account numbers in the same broadband account number combination, marking the broadband account number combination as a second suspected binding. For example 48 detection periods a day.
Wherein the first preset times are 5-10 times specifically; the specified number is specifically 3-5 times. In a specific application scenario, the parameter values may also be adjusted according to the actual use effect of the method, and therefore, the parameter is not limited to the given parameter interval.
In step 103, for a combination of broadband account numbers that simultaneously satisfy the first suspected binding and the second suspected binding, it is determined as a bound broadband account number.
The embodiment of the invention can enable a telecom operator to effectively discover the broadband account with the binding condition. The real-time performance is high, and the newly bound broadband account can be detected the next day. The efficiency is high, millions of broadband account numbers are saved all over, and the detection can be finished within hours.
Before the foregoing step 101-103 is implemented in the embodiment of the present invention, it is preferable to perform a round of screening on the original data, as shown in fig. 2, before performing the statistics, the method further includes:
in step 201, according to the detection period, each detection period outputs a table of correspondence between user ID and broadband account.
In step 202, merging the data items of the same user ID according to the corresponding relation table; recording different broadband account numbers associated with the same user ID into a transition table in the merging process, and deleting data items (which are indicated to be normal and therefore deleted) with the quantity of 1 broadband account numbers associated with the broadband account numbers in the merging process; the transition table is used for an analysis object in a statistical process.
Example 2:
the embodiment of the invention explains the specific process of the method for screening the original data in the embodiment 1 by taking a plurality of data tables as examples; compared with the above step 201-202, the presentation is more detailed, and the presentation form of the table is further simplified.
First is the original table (i.e., the representation of the transition table in example 1):
according to the existing program, every 30 minutes, a corresponding relation table of user ID-broadband account is obtained. The 30 minutes is set here mainly in consideration of a time interval that may occur during a legitimate handover of the same user ID, for example, the time required for the user to return home from the company is set to be more than 30 minutes. And merging the same user ID, and deleting the data with the broadband account number of 1. The user ID in this sheet is unique.
The typical table format given is: the first column is: user ID, first column: the number of broadband accounts, the third column is: a broadband account 1, a broadband account 2, and a broadband account 3.
Table 1:
1410382011 3 0351001093759,035101603438,tyadsl03515551803
1364292122 3 035601101739,035601105428,035601105431
1356679180 3 035102079880,035102097096,tyadsl03513107606
second, the intermediate table:
and summarizing the corresponding relation tables of the 48 user ID-broadband account numbers into a table by adding a list of time in 1 day.
The typical table format given is: the first column is: user ID, second column: the number of broadband accounts, the third column is: the fourth column of the broadband account number combination is as follows: a time period. Such as shown in table 2 below.
Table 2:
Figure BDA0002321718670000071
the corresponding analytical procedure in example 1 was followed:
and summarizing according to the broadband account number combinations based on the table 2, and calculating the number of the same broadband account number combinations.
The typical table format given is: the first column is: user ID, second column: the same number of broadband account combinations, the third column is: the fourth column of the broadband account number combination is as follows: time; such as shown in table 3 below.
Table 3:
1306060969 4 01077736,035901218335 2019101417
754094744 4 01077736,035901218335 2019101506
2186506407 4 01077736,035901218335 2019101506
3618795540 4 01077736,035901218335 2019101507
based on table 2, the number of occurrences of the same user ID in one day is counted. And summarizing according to the user ID, and judging whether the same user ID appears in any continuous 3 time periods.
The typical table format given is: the first column is: user ID, second column: the number of occurrences of the user ID, column three: broadband account group, the fourth column is: time (at least 3 consecutive time periods). Such as shown in table 4 below.
Table 4:
Figure BDA0002321718670000072
Figure BDA0002321718670000081
5. based on table 3, duplicate removal is performed by broadband account combination, and only 10 and more than 10 account groups are reserved. Such as shown in table 5 below.
Table 5:
035702241208,035702256990 1852
035702187789,035702262661 1484
035101170273,jst351040812 51
035801032209,035801032239 47
6. based on table 4, duplicate removal is performed according to the broadband account number combination, and only 1 row is reserved for the completely same account number combination. Such as shown in table 6 below.
Table 6:
1538075479 3 035700799424,jst357052265
1538075479 3 jst357052264,jst357052265
1579288322 3 035103079172,035103079185
1579288322 3 035103079185,035103089068
and then merging according to the user ID, and reserving one broadband account combination with the same broadband account combination. Table 6 is output.
Table 6:
1538075479 035700799424,jst357052264,jst357052265
1579288322 035103079172,035103079185,035103089068
the final results are shown in the table:
based on table 5, the broadband account combination is split, and deduplicated according to the broadband account, then the area to which the account in AAA belongs is queried, plus the detection reason "multiple times and other broadband accounts are bound for use", and finally the result table 7 is output:
table 7:
035702241208 linfen Multiple times and other broadband accounts are bound for use 2019-10-15
035702256990 Linfen Multiple times and other broadband accounts are bound for use 2019-10-15
035702187789 Taiyuan Multiple times and other broadband accounts are bound for use 2019-10-15
035702262661 Taiyuan Multiple times and other broadband accounts are bound for use 2019-10-15
035101170273 Taiyuan Multiple times and other broadband accounts are bound for use 2019-10-15
jst351040812 Taiyuan Multiple times and other broadband accounts are bound for use 2019-10-15
035801032209 Dao Tong (great homology of medicine) Multiple times and other broadband accounts are bound for use 2019-10-15
035801032239 Dao Tong (great homology of medicine) Multiple times and other broadband accounts are bound for use 2019-10-15
Based on table six, the broadband account combination is split, and deduplicated according to the broadband account, then the area to which the account in AAA belongs is queried, plus the detection reason, "at least 3 consecutive detection cycles are switched and used by the same user ID", and finally the result is output table 8:
table 8:
Figure BDA0002321718670000091
binding details, merging the result table 1 and the result table 2, deleting duplicate broadband accounts, and deleting the cause of detection (duplicate removal). Then, in the first column of table 5 and the second column of table 6, each broadband account in the deduplicated result is searched, and all the directly associated broadband accounts and the indirectly associated accounts (associated accounts of the associated accounts) thereof are searched. 1 row per account, no duplication.
For example, when querying the broadband 035001605355, the table below shows 035001466027 and 035001466043 linked accounts at the same time:
current account number Associated account
035001605355 035001466027,035001466043
Example 3:
the information available in the HTTP protocol traffic log generated by the operator at present includes necessary field information [ traffic source IP address, traffic destination IP address, traffic source port number, traffic destination port number, Host, URI, User-Agent, log timestamp ]. The inventor finds that in the process of finding an ID capable of uniquely identifying a user, because of the requirement of uniqueness of the user ID, a [ traffic source IP address ] is required to be used as an identified object, so that the [ traffic source IP address ] is useful, and information of fields such as [ traffic destination IP address ], [ traffic source port number ], [ traffic destination port number ], [ log timestamp ] has a plurality of values no matter whether the traffic sent from a certain IP is a single device or a plurality of devices, so that the traffic is not distinctive and cannot be utilized. Like the [ User-Agent ] field, the same kind of device has the same value, and there is no distinction, and thus the ID as a unique identification of the User is not sufficient.
In summary, it is analyzed that the Host information represents a target server for traffic access, and the URI includes a user query parameter, and multiple key value pairs, such as key value, carry information such as devices, applications, and users. Therefore, by further analyzing the content of [ Host: key value ] (representing the value of the parameter value under the name of key parameter in the same case of Host), it can be used to further mine and analyze the information of the user ID. Next, the mechanism for realizing the method will be further explained.
In the invention, Uniform Resource Identifier (URI) field content divides each pair of parameter pairs by an & 'symbol, and divides parameter names (Key) and parameter values (Value) by the &' symbol; as can be seen from the above analysis, a valid network message may have a plurality of key-value pairs, which are used to carry out information of devices, applications, users, and the like. Since some parameter name keys have different meanings or names in different applications, for example, when a Key is expressed as "u", the application a represents "user", but the application B represents "url", and thus the Key alone is not unique enough as an index, in the embodiments of the present invention, "Host-Key" is used as an index in common, and then the source IP, Value and the following 4 counters (the counter corresponding to IV, and the initial Value of the counter corresponding to IV is 1) appearing in each piece of traffic log data are added to the corresponding index, so that a dictionary set is formed. The structure of the construction is shown in table 9 below.
Table 9:
Figure BDA0002321718670000111
when statistics are performed in each "Host-Key", Value may have the same or different Value within the same IP and between different IPs. As an ID for uniquely identifying a user, ideally, for a large number of different IPs, different values appearing in the same "Host-Key" are more, which means that the Key often appears when a large number of users use the device, and Value values are different and represent uniqueness, and meanwhile, the same Value among different IPs is less, otherwise, the Key does not have identification degrees for different devices;
meanwhile, the number of occurrences of the same value in the designated "Host-Key" is high when the IP is the same, so as to ensure that the Key can be collected within a limited sampling time, that is, the ID as a unique identification user needs to be reproduced.
Description of IV, IV:
(1) IV: and counting the number of the Value of the same Host-Key in the same IP.
(2) Iv: and counting the number of different values of Value of the same Host-Key in the same IP. The smaller the value, the better. If the value of a Key is changed all the time when a user uses the device, the user cannot be used as a user ID to distinguish the user, such as a random number and a Key of the current time class, which are all changed all the time.
(3) iV: counting the number of values of the same Host-Key appearing in different IPs. Since the values generated by different users when using the device are as different as possible with a large degree of discrimination.
(4) iv: counting the number of different values of Value of the same Host-Key between different IPs. The larger the value is, the more the "Host-Key" appears in high frequency on different devices, and the maximum reference value is achieved.
Embodiment 1 of the present invention provides a user ID discovery method, in which an identifier corresponding to the same IP case is I, an identifier corresponding to different IP cases is I, an identifier corresponding to the same Host-Key case and the same value of the Host-Key case is V, and an identifier corresponding to the same Host-Key case and the different value of the Host-Key case is V, as shown in fig. 3, the method includes:
in step 301, setting the identifier corresponding to the case that the IPs are the same and the Host-keys are the same and the value values of the Host-keys are the same as IV; setting identifiers corresponding to the conditions that the IP is the same, the Host-Key is the same and the value values of the Host-Key are different as Iv; setting the identifiers corresponding to the situations that the IPs are different, the Host-Key is the same and the value values of the Host-Key are the same as iV; and setting the identifiers corresponding to the cases that the IP is different, the Host-Key is the same and the value values of the Host-Key are different as iv.
In step 302, according to the correspondence relationship that the larger the parameter values of IV and IV are, the better the parameter values of IV and IV are, the smaller the parameter values of IV and IV are, the better the parameter values of IV and IV are, the score of each group of Host-Key is calculated according to IV, IV and IV which correspond to each group of Host-Key statistics.
In step 303, the user ID in the current data analysis scenario is dynamically filtered out according to the score of each group of Host-keys.
The discovery of the user ID in the embodiment of the invention is learned through field data, is not preset, and has field adaptability. The ID of the user can be uniquely identified in full through field data, so that the number is more comprehensive, objective and accurate than artificial definition.
In this embodiment of the present invention, the calculating the score of each group of Host-keys according to the IV, and IV of the statistics corresponding to each group of Host-keys specifically includes:
calculating the Score of each Host-key group according to the formula one, wherein Score Is (IV) and (IV) and IV Is (IV) and iV); alternatively, the first and second electrodes may be,
calculating the Score of each group of Host-key according to the formula two, wherein Score is (IV-IV) × (IV-iV); alternatively, the first and second electrodes may be,
the Score for each set of Host-key was calculated according to the formula three, (IV + IV)/(IV + IV) × 100.
The three formulas have the characteristics, wherein the first formula has the largest difference and is suitable for a scene with large data volume analyzed currently, so that the differences of different Host-Key combinations can be pulled; formula three is the most accurate formula to calculate, but the disadvantage is that the calculated result has a high probability of overlapping the calculated scores. The specific formula is determined according to the data amount involved in the actual situation and the complexity in the data.
In the embodiment of the present invention, the user ID in the current data analysis scenario is dynamically screened out, and at least the following two cases can be distinguished in the actual implementation process.
Situation one,
And taking the Host-Key with the computed score being positioned before the preset first ranking value as the user ID dynamically generated by the current data analysis scene. The preset first ranking value is 200-500 or the ranking is located at the top 10% of the total as the preset first ranking value, preferably, the user ID before the first ranking value is also considered comprehensively, and the corresponding traffic data or data size can cover more than 80% of the data size in the whole current scene. In a specific implementation process, if the coverage is lower than 70%, the first ranking value may be adjusted, that is, the first ranking value is increased, so that more Host-Key combinations are screened out as the user ID.
Case two:
specifically, when the user ID of the same IP address needs to be determined, the method includes:
and aiming at the same IP address, sequencing the corresponding Host-Key calculation scores, and taking the Host-Key with the rank before the second preset rank as the user ID corresponding to the IP address. The second preset ranking includes, but is not limited to, any value between 10 and 50, and may also be adjusted according to specific situations, and it is generally preferable to ensure that a percentage of data calibrated by the Host-Key specified by the second preset ranking accounts for more than 80% (here, in the internet field, the number of messages may be referred to, and in other fields, the size of data may be referred to), that is, coverage characteristics of the selected user ID in the total data volume are ensured, which indicates that no necessary user ID is omitted.
The difference between the two situations is that the situation is based on the IP in the data as the data identifier, and the IP is not carried in each data; at this time, according to the method provided by the embodiment of the present invention, the user ID is obtained by directly calculating each Host-Key score. In case two, it is required to specify how to quickly determine a Host-Key, which can be used to represent the ID of each user in the lan environment, in a situation where it is ensured that the IP is effectively carried in the data and a user ID can be obtained instead of the IP or in a situation where multiple terminals share one IP in the lan environment.
In the embodiment of the present invention, the Key in the Host-Key includes, but is not limited to, any one of an equipment MAC address, a mobile phone number, a user name, an application ID, an IMEI number, and location information.
In the embodiment of the present invention, in order to facilitate data statistics, for acquired data, according to the counted host (i) -key (i), according to the formula:
Host(i)-Key(i):[IP(1),Value(1),IV(1),Iv(1),iV(1),iv(1)],
(iii) data combing [ IP (2), Value (2), IV (2), IV (2), iV (2), IV (2) ], …, [ IP (j), Value (j), IV (j), IV (j) ]; wherein, i is the mark number of the Host-Key combination, and j is the specific IP number under each Host-Key combination. The above formula is particularly suitable for analyzing for an IP a user ID that can replace the IP as a device or user identification.
Example 3:
in the present example, a problem was found when testing using the method set forth in example 1, especially when further looking for a unique ID in case two of example 1, with reference to the following example two:
table two:
Host Key value IV Iv iV iv
dns.weixin.qq.com uin 1305491620 50 3 1188 24332
the Key of "uin" in table two is verified to be a micro-signal, and it can be seen that when iV takes a maximum value, the same micro-signal under 1188 different IPs occurs, which is not in accordance with the reality, and it is seen that 1188 same micro-signals occurring between different IPs are "0", which is obviously not a real micro-signal, and is a special value. Therefore, in order to avoid similar statistical deviation in other keys, the secondary maximum count value is selected as the final statistical result for 4 counts, and the test effect is good.
(1) IV: counting the number of values with the same Value of same Host-Key in the same IP, arranging the count values according to the size, and taking the next largest Value as the final result. The reason for taking the next largest value: since a larger value indicates that the user is using the device, this value may be generated, but may result in some deviation introduced by the particular value when taking the maximum value.
(2) Iv: counting the number of different values of Value of the same Host-Key in the same IP, arranging the count values according to the size, and taking the next largest Value as the final result.
(3) iV: counting the number of values with the same Value of the same Host-Key among different IPs, arranging the count values according to the size, and taking the next largest Value as the final result.
(4) iv: counting the number of different values of Value of the same Host-Key among different IPs, arranging the count values according to the size, and taking the next largest Value as the final result.
The corresponding method process is shown in fig. 4, and comprises the following steps:
in step 401, the constructed data is input for comparison.
Specifically, as shown in embodiment 1, according to the statistical host (i) -key (i), according to the formula:
host (i) -Key (i) [ IP (1), Value (1), IV (1), IV (1), iV (1), IV (1) ], [ IP (2), Value (2), IV (2), IV (2), iV (2), IV (2) ], …, [ IP (j), Value (j), IV (j), IV (j) and IV (j) ] for data combing; wherein, i is the mark number of the Host-Key combination, and j is the specific IP number under each Host-Key combination. The above formula is particularly suitable for analyzing for an IP an ID that can replace the IP as a device or user identification.
In step 402, the number of the same statistical Value is assigned to the IV, with the same IP and the same Host-Key.
In step 403, the numbers with different IPs and the same Host-keys and the same statistical Value are assigned to iV.
In step 404, Iv is assigned with the same IP and the same Host-Key and the different statistical Value values.
In step 405, iv is assigned with the numbers with different statistical Value values, where the IP is different and the Host-Key is the same.
In step 406, the final ID is screened out after calculating the counts IV, IV.
Example 4:
the embodiment of the present invention further performs example tests and verifications from different angles with respect to the ID discovery method proposed in embodiment 21. The method comprises the following specific steps:
after "score according to each group of Host-keys" in step 303 in embodiment 2, further arranged in descending order by iv, IDs with high occurrence frequency can be selected, and they reflect the situation of the overlay network user, i.e. the coverage or effect when applied, to a certain extent. We select the traditional ID (IMEI, MAC, IP, IDFA, IMSI, etc. that know its meaning) first, and then make manual discrimination on the rest to get all IDs.
Incomplete Key list of traditional IDs:
'imei','imsi','uuid','idfa','idfv','deviceid','androidid','mac','ip','userid','user_id','uid','token','phone','phone_num','openudid','deviceId','device_id','android_id','_mac','_imei','username','user ip','userId','user','uniqueId','uname','uin','udid','ucid','tk','suu id','suid','signature','sign','openid','openUDID','nickname','model','mobile','mid','macid','macaddress','mac_address','dviceid','device_type','device_type','device','dev_id','d_model','cuid','cpu','clien tid','client_userid','_device','_androidId','UID'
(1) analysis of GET-type data
In the Uri ID list, for IDs with iv greater than 50, all conventional IDs such as MAC addresses, IMEI numbers, etc. are selected first (in the conventional IDs, the case of a false ID can also be determined according to the case of a counter, even if the value is not really as indicated by the Key name). The screening idea of the non-traditional ID is as follows: the Value of IV is larger, the smaller the Value of IV is, the better the Value of IV is, the difference between IvAll and IV is larger, and thus the non-traditional ID is comprehensively selected by combining the Host, the Key name, the Value form and the like.
The larger the IV is, the higher the detection rate is, the larger the IV is, the higher the probability of acquisition in the sampling time is, so that the two IDs with the largest two values in the obtained uri _ get final table are taken as representatives to be analyzed:
mode one, iv maximum ID:
FIG. 5 is a partial ID screenshot with the original table (containing a list of all IDs) ranked top. FIG. 6 is an ID screenshot for the top 3 of the uri _ get final table (containing a list of selected uri _ get IDs).
The IDs ranked 1 in the uri _ get final table (FIG. 6) are ranked 8 in the original table (FIG. 5). In order to verify that the iOS system mobile phone is used for capturing the Archie technology APP for about 5 minutes during testing, in the complete pcap packet, the ID with the domain name of't 7z. cut. iqiyi.com' Key of 'f' is always equal to the value of the ID with the domain name of't 7z. cut. iqiyi.com' Key of 'idfa' (as shown in fig. 7 and 8), which indicates that the value of 'f' generated by each device is unique, at this time, the Iv value is the number of shared devices under a certain IP, and the IvAll value is the number of IP with shared internet access of all the IP. If the value of "f" is not unique on each device, i.e. there are multiple values, then Iv cannot represent the number of devices shared under the IP, and it is observed that in this case, the Iv value is often relatively large, and the first 7 IDs in fig. 5 can be illustrated.
It was found that the domain name "t 7z. cut. iqi. com" appeared 75 times, idfa "appeared 20 times, and" f "appeared 50 times, i.e., the detection rate of such ID as" f "was greater than" idfa ", and" idfa "was ranked 73 (" f "ranks 8) in the original table, as shown in fig. 9:
when Key is f, the iv value is 10597, while the iv value of idfa is 1884, the detection rate of f is about 6 times of that of idfa; when Key is "f", its IvAll value is 2988, and when Key is "idfa", its IvAll value is 99, and it can be found that the shared detection rate of "f" is about 30 times that of "idfa". Therefore, in view of the whole, the detection range is narrow by using the conventional ID only, the detection strength is not enough, and the detection rate is improved greatly after the non-conventional ID is added. In the original table (FIG. 5), "kp" and "g" ranked 10 and 11, whose values were also found in the packet to be equal to the value of "idfa," are ranked 2 and 3 in the uri _ get final table (FIG. 6), respectively. This just reflects a wider coverage of non-legacy IDs.
In addition, the reason why the first 7 IDs (fig. 5) of the original table are unavailable: the IV values of the two IDs at the top 2 are relatively too small, while the IV value is large; the Iv values of the first 3 and 6 and 7 are too large, i.e. the values of the IDs appearing in the same IP are said to change continuously; the iV values for the 3 rd and 6 th IDs are very large, i.e. too many identical values occur for different IPs (some keys such as device model numbers, although iV will also be large, can still be used to some extent as IDs). And it can be roughly derived from the names of keys that "ri" is a random number, "lgt" represents longitude, "ltt" represents latitude, and the like, and these keys have randomness and cannot correspond to a device.
Mode two, ID with maximum IV:
ID ranking 14 with Host "cache, video, iqiiyi.com" Key "k _ uid". In the same pcap packet, the domain name appears 13 times, and the "k _ uid" also appears 13 times, that is, the Key appears as long as the Host appears, so the IV value is relatively large and reasonable, and the probability of collecting the Key in the limited sampling time is relatively large. Note that "k _ uid" is also not a legacy ID, and here also represents the advantage of we choosing a non-legacy ID. The value of "k _ uid" is also "idfa", as shown in fig. 11:
the device we tested to use is the iOS system, so there is idfa this Key, but the value in figure 10 is in the form of imei number, so we conclude that its value is the native imei value in android phones. In any case, the value is unique in both the iOS system and the android system. An Iv of 11 indicates that 11 devices under an IP are sharing, and an Iv all of 1227 indicates that 1227 IP devices are sharing.
A total of 311 IDs (including legacy and non-legacy IDs) is eventually obtained. Similarly, the top 500 Cookie IDs with iv greater than 97 were analyzed to yield 367 IDs.
(2) Analysis of POST type data
The first 30 Uri IDs with iv greater than 19 were selected for analysis to yield 24 IDs, and the first 30 Cookie IDs with iv greater than 17 were selected for analysis to yield 30 IDs.
ID table screening formula:
since IV is larger, the better, the smaller iV, the better, the larger IV, and the better IvAll, TopN terms can be selected in descending order of IV. Then, according to the formula of (IV x IV)/(iV x IvAll), a Score is calculated, and the scores are sorted in a descending order, so that topM optimal terms (M < N) can be screened.
The embodiment of the present invention shows that, when the method described in embodiment 1 is implemented in combination with a specific application scenario, the ordering policy can be appropriately adjusted, so as to obtain a better result.
Example 5:
fig. 12 is a schematic diagram of an architecture of a bandwidth bundling detection apparatus according to an embodiment of the present invention. The bandwidth bundling detection device of the present embodiment comprises one or more processors 21 and a memory 22. In fig. 12, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 12 illustrates the connection by a bus as an example.
The memory 22, which is a non-volatile computer-readable storage medium, may be used to store a non-volatile software program and a non-volatile computer-executable program, such as the bandwidth binding detection method in embodiment 1. The processor 21 executes the bandwidth binding detection method by executing non-volatile software programs and instructions stored in the memory 22.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The program instructions/modules are stored in the memory 22 and, when executed by the one or more processors 21, perform the bandwidth binding detection method of embodiment 1 described above, for example, perform the steps shown in fig. 1-4 described above.
It should be noted that, for the information interaction, execution process and other contents between the modules and units in the apparatus and system, the specific contents may refer to the description in the embodiment of the method of the present invention because the same concept is used as the embodiment of the processing method of the present invention, and are not described herein again.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A bandwidth binding detection method is characterized by comprising the following steps:
taking the broadband account as a statistical dimension, if the counted times that a plurality of different fixed user IDs appear in the same broadband account combination at different moments exceed a first preset time, marking the broadband account in the combination as a first suspected bundle;
counting the same user ID, summarizing the times of the same user ID in all detection periods in one day according to the user ID, and judging whether the same user ID has any detection period with continuous designated detection number appearing in different broadband accounts in the same broadband account combination; if the judgment result shows that any continuous specified number of detection cycles appear in different broadband account numbers in the same broadband account number combination, marking the broadband account number combination as a second suspected binding;
and confirming the broadband account combination meeting the first suspected binding and the second suspected binding at the same time as a bound broadband account.
2. The method according to claim 1, wherein the first predetermined number of times is specifically 5 to 10 times; the specified number is specifically 3-5 times.
3. The method of claim 1, wherein prior to performing the statistics, the method further comprises:
outputting a corresponding relation table of user ID-broadband account numbers in each detection period according to the detection periods;
merging the data items of the same user ID according to the corresponding relation table; recording different broadband account numbers associated with the same user ID into a transition table in the merging process, and deleting data items with the quantity of 1 of the broadband account numbers associated with the broadband account numbers in the merging process; the transition table is used for an analysis object in a statistical process.
4. The bandwidth binding detection method according to claim 1, wherein an identifier corresponding to the same case of IP is I, an identifier corresponding to a different case of IP is I, an identifier corresponding to the same case of Host-Key and the same value of Host-Key is V, and an identifier corresponding to the same case of Host-Key and the different case of value of Host-Key is V, the method comprising:
setting the identifiers corresponding to the conditions that the IP is the same, the Host-Key is the same and the value values of the Host-Key are the same as the IV; setting identifiers corresponding to the conditions that the IP is the same, the Host-Key is the same and the value values of the Host-Key are different as Iv; setting the identifiers corresponding to the situations that the IPs are different, the Host-Key is the same and the value values of the Host-Key are the same as iV; setting the identifiers corresponding to different IP, the same Host-Key and different values of the Host-Key as iv;
calculating the score of each group of Host-Key according to the corresponding relation that the larger the parameter value of IV and IV is, the better the parameter value of IV and IV is, the smaller the parameter value of IV and IV is, the better the parameter value of IV and IV is, and according to IV, IV and IV which correspond to each group of Host-Key statistics;
and dynamically screening the user ID in the current data analysis scene according to the score of each group of Host-Key.
5. The method according to claim 4, wherein the calculating the score of each group of Host-keys according to the IV, and IV corresponding to each group of Host-Key statistics specifically includes:
calculating the Score of each Host-key group according to the formula Score (IV)/(IV); alternatively, the first and second electrodes may be,
calculating the Score of each group Host-key according to the formula Score (IV-IV) × (IV-iV); alternatively, the first and second electrodes may be,
the Score for each set of Host-key was calculated according to the formula Score (IV + IV)/(IV + IV) 100.
6. The method according to claim 5, wherein the dynamically filtering out the user ID in the current data analysis scenario according to the score of each Host-Key group specifically comprises:
taking a Host-Key with a computed score before a preset first ranking value as a user ID dynamically generated by a current data analysis scene; wherein, the preset first ranking value is 200-500 or the ranking is located at the top 10% of the total as the preset first ranking value.
7. The method according to claim 5, wherein the step of dynamically screening out user IDs in a current data analysis scenario according to the score of each group of Host-keys, specifically when the user IDs of corresponding IP addresses need to be determined for the same IP address, includes:
and aiming at the same IP address, sequencing the corresponding Host-Key calculation scores, and taking the Host-Key with the rank before the second preset rank as the user ID corresponding to the IP address.
8. The method of claim 4, wherein before calculating the score of each set of Host-keys according to the IV, and IV corresponding to each set of Host-Key statistics, the method further comprises:
determining whether the corresponding ratio exceeds a first preset threshold value according to the ratio of the parameter values of the Iv and/or Iv in the total statistical quantity;
and if the ratio of the Iv and/or the iV in the total statistical quantity exceeds the first preset threshold value, skipping the score calculation of the corresponding Host-Key combination.
9. The method of claim 8, wherein if the ratio of Iv and/or Iv to the total statistical number exceeds the first predetermined threshold, the method further comprises:
for the fact that the ratio of Iv in the total statistical quantity exceeds the first preset threshold, further analyzing the name of each Key under the Iv condition under the same IP; and if the Key is determined to be the user name or the device MAC address, recording the corresponding IP in a log as a potential studio.
10. A bandwidth binding detection apparatus, the apparatus comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor for the bandwidth bundling detection method of any of claims 1-9.
CN201911300738.9A 2019-12-17 2019-12-17 Bandwidth binding detection method and device Active CN111106980B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911300738.9A CN111106980B (en) 2019-12-17 2019-12-17 Bandwidth binding detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911300738.9A CN111106980B (en) 2019-12-17 2019-12-17 Bandwidth binding detection method and device

Publications (2)

Publication Number Publication Date
CN111106980A true CN111106980A (en) 2020-05-05
CN111106980B CN111106980B (en) 2021-08-03

Family

ID=70423134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911300738.9A Active CN111106980B (en) 2019-12-17 2019-12-17 Bandwidth binding detection method and device

Country Status (1)

Country Link
CN (1) CN111106980B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111970175A (en) * 2020-08-26 2020-11-20 武汉绿色网络信息服务有限责任公司 Method and device for malicious sharing detection of network-access account

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102377620A (en) * 2011-12-09 2012-03-14 浙江大学 Method for detecting broadband private connection based on open system interconnection (OSI) transmission layer timestamp
CN104243618A (en) * 2014-07-02 2014-12-24 北京润通丰华科技有限公司 Method and system based on client behaviour identification network sharing
CN107666404A (en) * 2016-07-29 2018-02-06 中国电信股份有限公司 Broadband network user identification method and device
CN109120625A (en) * 2018-08-29 2019-01-01 北京润通丰华科技有限公司 A kind of big bandwidth private connects analysis and knows method for distinguishing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102377620A (en) * 2011-12-09 2012-03-14 浙江大学 Method for detecting broadband private connection based on open system interconnection (OSI) transmission layer timestamp
CN104243618A (en) * 2014-07-02 2014-12-24 北京润通丰华科技有限公司 Method and system based on client behaviour identification network sharing
CN107666404A (en) * 2016-07-29 2018-02-06 中国电信股份有限公司 Broadband network user identification method and device
CN109120625A (en) * 2018-08-29 2019-01-01 北京润通丰华科技有限公司 A kind of big bandwidth private connects analysis and knows method for distinguishing

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111970175A (en) * 2020-08-26 2020-11-20 武汉绿色网络信息服务有限责任公司 Method and device for malicious sharing detection of network-access account

Also Published As

Publication number Publication date
CN111106980B (en) 2021-08-03

Similar Documents

Publication Publication Date Title
CN105591973B (en) Application identification method and device
AU2016262640B2 (en) Node de-duplication in a network monitoring system
EP3905622A1 (en) Botnet detection method and system, and storage medium
CN111028085A (en) Network shooting range asset information acquisition method and device based on active and passive combination
CN110245273B (en) Method for acquiring APP service feature library and corresponding device
Choi et al. Automated classifier generation for application-level mobile traffic identification
US20100290353A1 (en) Apparatus and method for classifying network packet data
CN111106980B (en) Bandwidth binding detection method and device
CN107948022B (en) Identification method and identification device for peer-to-peer network traffic
US11050629B2 (en) Fingerprint determination for network mapping
Cukier et al. A statistical analysis of attack data to separate attacks
CN110995887B (en) ID association method and device
CN111079044B (en) Sharing detection method and device
CN111049944B (en) ID discovery method and device
CN111031068B (en) DNS analysis method based on complex network
CN111368294B (en) Virus file identification method and device, storage medium and electronic device
CN112954027B (en) Network service characteristic determination method and device
KR101605187B1 (en) Apparatus and method for collecting unknown traffic flow to analysis application traffic
Saidi Characterizing the IoT ecosystem at scale
Irwin et al. A geopolitical analysis of long term internet network telescope traffic
CN117834213A (en) Method and device for detecting PCDN (personal digital assistant) illegal account number of home-wide user
CN116389416A (en) Massive IPv6 address identification method, system, electronic equipment and storage medium
CN117112519A (en) Data processing method and device
CN114328705A (en) Method and device for drawing data map and combing flow direction based on gallery and electronic equipment
CN117041070A (en) Network space mapping node discovery and attribution judging method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant