CN110020234B - Method and device for determining broadband network access point information - Google Patents

Method and device for determining broadband network access point information Download PDF

Info

Publication number
CN110020234B
CN110020234B CN201710635652.6A CN201710635652A CN110020234B CN 110020234 B CN110020234 B CN 110020234B CN 201710635652 A CN201710635652 A CN 201710635652A CN 110020234 B CN110020234 B CN 110020234B
Authority
CN
China
Prior art keywords
access point
user
broadband network
network access
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710635652.6A
Other languages
Chinese (zh)
Other versions
CN110020234A (en
Inventor
李明洋
孙静博
杨明川
刘杨
刘康
曹诗苑
全硕
左闯
卢毅
杜帅
贺群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN201710635652.6A priority Critical patent/CN110020234B/en
Publication of CN110020234A publication Critical patent/CN110020234A/en
Application granted granted Critical
Publication of CN110020234B publication Critical patent/CN110020234B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The invention discloses a method and a device for determining broadband network access point information, and relates to the field of network data classification. The method comprises the following steps: extracting user information in broadband network data; extracting user characteristics according to the user information; the user characteristics are used as the input quantity of a classifier, and the classifier is used for preliminarily predicting the broadband network access point information corresponding to the user; and training a classifier by using the labeled broadband network access point information to construct a machine learning model. The invention improves the efficiency of confirming the broadband network access point information and provides data for filling missing information and correcting the information.

Description

Method and device for determining broadband network access point information
Technical Field
The present invention relates to the field of network data classification, and in particular, to a method and an apparatus for determining broadband network access point information.
Background
Usually, when a user transacts a broadband internet service, some basic information, such as a contact, a broadband type (home, business, etc.), etc., is recorded. The practical use of the broadband network plays an important role in analyzing network traffic data, and the reasons are as follows:
first, understanding the actual broadband use is helpful to develop business recommendation, for example, accurate marketing such as car buying prediction can be performed for a family unit, and packages with different flow rates can be provided for small-sized enterprises and large-sized enterprises. Second, it is possible to prevent users who are broadband in the enterprise from intentionally concealing actual use, and to evade review of information or reduce necessary overhead. Thirdly, the broadband network does not identify the individual user in a mobile network environment, so that the user in the network is not unique, various analyses performed around the user cannot be performed smoothly, and if the home access point can be distinguished accurately, the personal behavior analysis in the home can be tracked more easily.
Therefore, accurate broadband network access point information is beneficial to the development of data analysis and the development of operator services. However, the actual recording of the broadband network access point in the operator system has the following problems due to the user's filling in haphazardness, the intention to conceal the real usage, etc.:
first, when collecting information, situations such as missing filling of a user can inevitably occur, resulting in data loss, and thus the original family/enterprise classification label is incomplete. According to statistics, the access points with classification labels in a certain commercial power fixed network only account for about 29% of the total number, and a large amount of information needs to be traced. Second, there may be situations when the actual filling is performed, such as users hiding the actual usage, or there is a change in the broadband usage, resulting in the partial data records being not practical, such as the type of access point being inconsistent or inaccurate with the actual usage. Third, once home residences were used as the work environment, mini-entertainment venues, etc. of the original companies, the actual use of broadband network access points varied, but the operator was not aware of the situation.
Disclosure of Invention
The invention aims to solve the technical problem that a method and a device for determining broadband network access point information can improve the efficiency of confirming the broadband network access point information.
According to an aspect of the present invention, a method for determining broadband network access point information is provided, including: extracting user information in broadband network data; extracting user characteristics according to the user information; the user characteristics are used as the input quantity of a classifier, and the classifier is used for preliminarily predicting the broadband network access point information corresponding to the user; and training a classifier by using the labeled broadband network access point information to construct a machine learning model.
Further, the method further comprises: re-labeling the labeled broadband network access point information according to the strong hypothesis condition, and giving a training weight; and training the classifier by using the re-labeled broadband network access point information, taking the user characteristics as the input quantity of the classifier, and preliminarily predicting the broadband network access point information corresponding to the user by using a binary classification model.
Further, the re-labeling the labeled broadband network access point information according to the strong assumption condition, and giving the training weight includes: if the equipment number of the broadband network access point is larger than or equal to a first threshold value, marking the type of the broadband network access point as an enterprise, and giving a first training weight; if the equipment number of the broadband access point is less than or equal to a second threshold value, marking the type of the broadband network access point as a family, and giving a second training weight; if the equipment number of the broadband access point is larger than a second threshold value and smaller than a first threshold value, keeping the marked type of the broadband network access point, and giving a third training weight; wherein the first training weight and the second training weight are greater than the third training weight.
Further, the method further comprises: and training a classifier by taking preset time as a unit based on a voting mechanism of time lapse, and updating the prediction result of the broadband network access point information corresponding to the user by adopting a relative majority voting method.
Further, the user information comprises user account information and user behavior information; the extracting the user information in the broadband network data comprises the following steps: matching user account information from a Uniform Resource Locator (URL) of broadband network data through a regular expression of the user account; and acquiring user behavior information from the URL of the broadband network data based on the primary domain name.
Further, the user characteristics comprise user account characteristics and user behavior characteristics; the user account characteristics and the user behavior characteristics are determined based on the number of user accounts, the user behavior enrichment and/or the access point active time period.
According to another aspect of the present invention, there is also provided an apparatus for determining broadband network access point information, including: the data collector is used for extracting user information in the broadband network data; a feature extractor for extracting user features according to the user information; the classifier is used for taking the user characteristics as the input quantity of the classifier and preliminarily predicting the broadband network access point information corresponding to the user by utilizing the classifier; the classifier utilizes the labeled broadband network access point information to train and construct a machine learning model.
Further, the apparatus further comprises: the sample resetter is used for re-labeling the labeled broadband network access point information according to the strong hypothesis condition and endowing the labeled broadband network access point information with a training weight; the classifier is also used for training by using the re-labeled broadband network access point information, and preliminarily predicting the broadband network access point information corresponding to the user by using the binary classification model by using the user characteristics as the input quantity of the classifier.
Further, the sample resetting device is also used for marking the type of the broadband network access point as an enterprise and endowing the enterprise with a first training weight if the number of the devices of the broadband network access point is greater than or equal to a first threshold value; if the equipment number of the broadband access point is less than or equal to a second threshold value, marking the type of the broadband network access point as a family, and giving a second training weight; if the equipment number of the broadband access point is larger than a second threshold value and smaller than a first threshold value, keeping the marked type of the broadband network access point, and giving a third training weight; wherein the first training weight and the second training weight are greater than the third training weight.
Further, the apparatus further comprises: and the result voter is used for training the classifier by taking the preset time as a unit based on a voting mechanism of time lapse and updating the prediction result of the broadband network access point information corresponding to the user by adopting a relative majority voting method.
Further, the user information comprises user account information and user behavior information; the data acquisition unit is also used for matching user account information from a Uniform Resource Locator (URL) of the broadband network data through a regular expression of the user account; and acquiring user behavior information from the URL of the broadband network data based on the primary domain name.
Further, the user characteristics comprise user account characteristics and user behavior characteristics; the feature extractor is further configured to determine user account features and user behavior features based on the number of user accounts, the user behavior enrichment, and/or the access point active time period.
According to another aspect of the present invention, there is also provided an apparatus for determining broadband network access point information, including: a memory; and a processor coupled to the memory, the processor configured to perform the method as described above based on instructions stored in the memory.
According to another aspect of the present invention, a computer-readable storage medium is also proposed, on which computer program instructions are stored, which instructions, when executed by a processor, implement the steps of the above-described method.
Compared with the prior art, the method and the device have the advantages that the user information is extracted from the massive broadband network data, the user characteristics are extracted according to the user information, then the broadband network access point information corresponding to the user is preliminarily predicted by utilizing the constructed learning model, and the efficiency of confirming the broadband network access point information is improved.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
The invention will be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
fig. 1 is a flowchart illustrating an embodiment of a method for determining broadband network access point information according to the present invention.
Fig. 2 is a flowchart illustrating another embodiment of the method for determining broadband network access point information according to the present invention.
Fig. 3 is a flowchart illustrating a method for determining broadband network access point information according to still another embodiment of the present invention.
Fig. 4 is a schematic diagram of the present invention for re-labeling the labeled types of the broadband network access points.
FIG. 5 is a diagram illustrating the result voting operation mechanism of the present invention.
Fig. 6 is a schematic structural diagram of an embodiment of the apparatus for determining broadband network access point information according to the present invention.
Fig. 7 is a schematic structural diagram of another embodiment of the apparatus for determining broadband network access point information according to the present invention.
Fig. 8 is a schematic structural diagram of an apparatus for determining broadband network access point information according to still another embodiment of the present invention.
Fig. 9 is a schematic structural diagram of an apparatus for determining broadband network access point information according to another embodiment of the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
Fig. 1 is a flowchart illustrating an embodiment of a method for determining broadband network access point information according to the present invention.
In step 110, the user information in the broadband network data is extracted. The user information may include user account information, user behavior information, and the like.
In step 120, user features are extracted from the user information. For example, user account characteristics, user behavior characteristics, and the like are obtained through the user account information and the user behavior information, where the user account characteristics may include access point account login frequency, account entropy, account login day-night ratio, account login weekday-weekend ratio, and the like, and the user behavior characteristics may include access point PV (page view) total amount, browsing richness, PV volume day-night ratio, PV volume weekend-weekend ratio, and the like.
In step 130, the user characteristics are used as input quantity of a classifier, and the classifier is used to preliminarily predict broadband network access point information corresponding to the user, wherein the classifier is trained by using the labeled broadband network access point information to construct a machine learning model. For example, the broadband network access point with the type labeled is used as a sample, the classifier is trained, the user characteristics are used as the input quantity of the classifier, and the type of the broadband network access point corresponding to the user is preliminarily predicted.
In the embodiment, the user information is extracted from the massive broadband network data, the user characteristics are extracted according to the user information, and then the broadband network access point information corresponding to the user is preliminarily predicted by using the constructed learning model, so that the efficiency of confirming the broadband network access point information is improved.
Fig. 2 is a flowchart illustrating another embodiment of the method for determining broadband network access point information according to the present invention.
In step 210, user account information and user behavior information are extracted from the broadband network data. For example, account numbers, login time, browsing behavior, internet access devices, and the like are collected in massive broadband network data. In one embodiment, user account information, such as a QQ, a mobile phone number, a microblog account, and the like, is matched from a URL (Uniform Resource Locator) of broadband network data through a regular expression of the user account; or acquiring user behavior information from the URL of the broadband network data based on the primary domain name, for example, collecting a website, an APP, browsing time, used equipment, and the like, which are concerned by the user, from the URL of the network data by using summarized websites and APP rules.
In step 220, the labeled broadband network access point information is relabeled according to the strong assumption condition, and a training weight is given. For example, if the number of devices of the broadband network access point is greater than or equal to a first threshold, marking the type of the broadband network access point as an enterprise, and giving a first training weight; if the equipment number of the broadband access point is less than or equal to a second threshold value, marking the type of the broadband network access point as a family, and giving a second training weight; if the equipment number of the broadband access point is larger than a second threshold value and smaller than a first threshold value, keeping the marked type of the broadband network access point, and giving a third training weight; wherein the first training weight and the second training weight are greater than the third training weight. For example, the first threshold is 50, the second threshold is 1, etc., and the modified sample full weight is given higher importance, and once the error is mistakenly divided, the resulting error is doubled, so as to reduce the influence of the mistakenly marked sample on the training.
At step 230, user account characteristics and user behavior characteristics are determined based on the user account information and the user behavior information. The determination of the characteristics is derived from the account login behaviors based on people and the user group behaviors based on the access point, and mainly starts from the number of the characterization people, the number of the characterization behaviors and the behavior characteristic direction in different time periods. The features are established on the generally accepted common sense and the life rule, are concise and understandable, and the feature extraction is suitable for distributed computation on massive network data.
In step 240, the newly labeled broadband network access point information is used to train a classifier, the user account number characteristics and the user behavior characteristics are used as the input quantity of the classifier, and the binary classification model is used to preliminarily predict the broadband network access point information corresponding to the user.
In the embodiment, the user information is counted, a part of broadband types are corrected according to strong hypothesis, a binary classification model is constructed after user network behavior characteristics are extracted, a weighted sample set is fused, the types of broadband network access points can be identified more accurately, and data are provided for filling missing information and correcting information.
Fig. 3 is a flowchart illustrating a method for determining broadband network access point information according to still another embodiment of the present invention.
In step 310, user account information and user behavior information are extracted from the broadband network data. There is no definite user individual identification under the broadband network, and the extraction of the account number is generally extracted from the URL character string by rules, and the account number includes but is not limited to a QQ, a mobile phone number, a microblog account number, and the like. In addition, the type, time of occurrence, etc. of the account number should be preserved.
The account is matched from the URL by using a regular expression of the account, as shown in table 1, the regular expression of the mobile phone number is ([1] [3-8] {1} \ d {9} ($ | [ < lambda > 0-9] {1})), and the eric art (iqiyi) is taken as an example, where one regular expression is iqiyi.commname ([1] [3-8] {1} \ d {9} ($ [ < lambda > 0-9] {1}), the mobile phone account can be found from a click record of the user on the eric art. While the regular expression of QQ is a string of numbers, once placed in the URL to which the QQ relates, it is probably considered a QQ number, such as: qq. compt2gguin ═ o (\ d +).
Account type Examples of rules
QQ qq.compt2gguin=o(\d+)
Mobile phone number iqiyi.comname=([1][3-8]{1}\d{9}($|[^0-9]{1}))
Microblog number api.weibo.cn(.*)friends_timeline(.*)fromlog=(\d{10,})
TABLE 1
In addition, the network access and browsing behaviors are also matched through rules such as website Host and the like. Its access time, PV volume, specific host, equipment used, etc. should be recorded. However, it is to be distinguished whether the click is manual or automatic (e.g. a large number of js, css, picture requests). The former can truly represent the behavior record of the user, so the obtained behavior time and the access content are the real embodiment of the user behavior; the latter is a machine behavior, which is huge in quantity and cannot reflect the real behavior of the user. The data source can be cleaned, automatically loaded URLs are filtered, and on the basis of the data, user behaviors under some main domain names are recorded through the primary domain name. Since the user does not need to know the content accessed by the user, the URL does not need to be finely parsed.
In step 320, the labeled broadband network access point types are re-labeled according to the strong assumption condition, and the training weight is given. The specific implementation may be as shown in fig. 4:
in step 321, the labeled type of the broadband network access point is obtained through the CRM table.
In step 322, the number of devices under the broadband network access point is calculated.
In step 323, it is determined whether the number of devices is greater than or equal to 50, if so, step 324 is performed, otherwise, step 325 is performed.
At step 324, the broadband network access point type is determined to be enterprise and given a higher weight.
In step 325, it is determined whether the number of devices is less than or equal to 1, if yes, step 326 is performed, otherwise, step 327 is performed.
At step 326, the broadband network access point type is determined to be home and given a higher weight.
At step 327, the original broadband network access point type is maintained and given a lower weight.
That is, the broadband network with the extremely large number of devices appearing in the access point is a non-home network such as an enterprise, the broadband network with the extremely small number of devices is a home network, and the weight of the type of the broadband network access point is readjusted. By changing the weight and punishment measures, the influence of artificial error labeling on the model is reduced, and the purpose of correcting the original labeled data is achieved.
In step 330, user account characteristics and user behavior characteristics are determined based on the number of user accounts, the user behavior enrichment and the access point active time period.
For example, the user account characteristics include the login frequency of the account of the access point, wherein generally, the number of members under the family access point is 1-4, the number of members under the non-family access point is more, the login frequency can reflect the number of members, and for one access point, the account login record is added by one every time.
The account number characteristics also include an account entropy, wherein the account entropy represents liveness of the access point. For an access point, entropy is calculated for the number and frequency of the appearing access point, as follows
Figure BDA0001364786810000093
Shown as P (a)i) Is account number aiFrequency of occurrence, F, under access point x2(x) Is the account entropy.
The account characteristics also include a day-to-night account log-in ratio, wherein, considering that the user uses the enterprise network more to work in the daytime, the user uses the home access point to access the network when coming home at night, therefore, the user can use the home access point to access the network day-to-dayThe night account login record can reflect the conditions of working time and rest time. For example, for an access point, the ratio of the log-in record at 7 times of 10:00-16:00 to the log-in record sum at 6 times of 19: 00-24: 00 is calculated. Such as
Figure BDA0001364786810000091
Wherein, N (a)i) Is account number aiNumber of logins under access point x, F3(x) And logging in the day and night ratio for the account.
The account characteristics also comprise the ratio of the working day and weekend of account login, wherein if the account login is active under a non-home access point on the working day and the network access amount under the home access point is larger on the weekend, the ratio of the login record of 5 days on the working day to the login record sum of two days on the weekend is calculated for one access point. Such as
Figure BDA0001364786810000092
Wherein, N (a)i) Is account number aiNumber of logins under access point x, F4(x) The ratio of the weekday to the weekend on which the account is logged in.
The user behavior profile may include the PV total of the access points, where the number of group behaviors in the enterprise network should be generally larger than that in the home network, e.g., for one access point, the total number of clicks in a specified time may be counted.
The user behavior characteristics may also include browsing richness, wherein the group browsing behavior in the enterprise network should be richer than that in the home network in general, for example, for an access point, it can be counted how many websites or APPs are involved in a specified time.
The user behavior characteristics may also include a daily-to-nighttime ratio of the PV volume, e.g., for an access point, a ratio of the PV volume at 7 times of 10:00-16:00 to the PV volume at 6 times of 19: 00-24: 00 may be calculated.
The user behavior characteristics may also include PV weekday to weekend ratios, e.g., for one access point, calculate a ratio of PV volume for 5 days weekday to PV volume for two days weekend.
In one embodiment, step 330 may be performed first, and then step 320 may be performed, or step 320 and step 330 may be performed simultaneously.
At step 340, the classifier is trained with the re-labeled broadband network access point type. Considering the performance requirement of rapid processing of mass data in practical application, a Logistic regression model can be selected, and the loss function can be set as:
Figure BDA0001364786810000101
cost(h(xi),yi)=-yi log(h(xi))-(1-yi)log(1-h(xi))
wiis the weight of the ith sample, with the re-labeled broadband network access point type as the sample, yiIs the correct labeling result for sample i, h (x)i) Is the result of the prediction of sample i; cost is the error value between the ith sample and the correct result; j (theta) is a loss value used for measuring the inaccuracy of a prediction result, and the smaller the loss value is, the closer the model is to the real situation.
In step 350, the user account characteristics and the user behavior characteristics are used as input quantities of the classifier, and the binary classification model is used for preliminarily predicting the type of the broadband network access point corresponding to the user.
Steps 320-350 are subsequently repeated.
In step 360, based on the voting mechanism of time lapse, training the classifier in units of predetermined time, and updating the prediction result of the type of the broadband network access point corresponding to the user by using a relative majority voting method. In order to adapt to the change of the data source and capture the change information of the broadband use condition, a voting mechanism based on time lapse is adopted, a sub-classifier is trained by taking the week as a unit, and a relative majority voting method is adopted, namely, the classification category of the highest vote is selected as a final result. As shown in fig. 5, the specific steps may be to perform sample extraction, feature extraction, model training and prediction on the broadband access point data with week as a time window; a plurality of weeks are a time period, and voting is carried out once; the new voting result updates the old result once and records the change process of the use condition of the access point.
The use condition of the broadband network changes, for example, the number of accounts and behaviors changes with the update of the data acquisition rule; the interference of holidays of festivals such as the national festival, the characteristics, the model and the like are changed. And training a sub-model every week, replacing the old model, predicting the new access point type, better adapting to data change, capturing the broadband network use information in time and acquiring the latest access point state.
Because the related work of correcting the broadband information based on the network user behavior is less at present, the invention uses a machine learning method to automatically identify the recorded broadband type information, supplements the missing information and reduces the human intervention; in addition, the invention does not depend on the marked data completely, and adopts a method of correcting samples and weighting to reduce the influence caused by the original error data, thereby really achieving the purpose of error correction; in addition, the invention provides a simple and effective feature extraction method, which is suitable for distributed computation on big data; meanwhile, the invention adopts a voting mechanism based on time accumulation, ensures the adaptability of the classifier to data changes, improves the reliability of results, and simultaneously tracks the changes of the access points in time.
Fig. 6 is a schematic structural diagram of an embodiment of the apparatus for determining broadband network access point information according to the present invention. The apparatus comprises a data collector 610, a feature extractor 620 and a classifier 630, wherein:
the data collector 610 is used for extracting user information in the broadband network data. The user information may include user account information, user behavior information, and the like.
The feature extractor 620 is configured to extract user features according to the user information. For example, user account characteristics, user behavior characteristics and the like are obtained through the user account information and the user behavior information, wherein the user account characteristics may include access point account login frequency, account entropy, account login day-night ratio, account login weekday-weekend ratio and the like, and the user behavior characteristics may include PV total amount of the access point, browsing richness, PV volume day-night ratio, PV volume weekend-weekend ratio and the like.
The classifier 630 is configured to use the user characteristics as an input quantity of the classifier, and preliminarily predict broadband network access point information corresponding to the user by using the classifier, where the classifier performs training by using the labeled broadband network access point information to construct a machine learning model. For example, the broadband network access point with the type labeled is used as a sample, the classifier is trained, the user characteristics are used as the input quantity of the classifier, and the type of the broadband network access point corresponding to the user is preliminarily predicted.
In the embodiment, the user information is extracted from the massive broadband network data, the user characteristics are extracted according to the user information, and then the broadband network access point information corresponding to the user is preliminarily predicted by using the constructed learning model, so that the efficiency of confirming the broadband network access point information is improved.
Fig. 7 is a schematic structural diagram of another embodiment of the apparatus for determining broadband network access point information according to the present invention. The apparatus comprises a data collector 710, a sample resetter 720, a feature extractor 730 and a classifier 740, wherein:
the data collector 710 is configured to extract user account information and user behavior information from the broadband network data. For example, account numbers, login time, browsing behavior, internet access devices, and the like are collected in massive broadband network data. In one embodiment, user account information, such as a QQ, a mobile phone number, a microblog account number and the like, is matched from a URL of broadband network data through a regular expression of the user account; or acquiring user behavior information from the URL of the broadband network data based on the primary domain name, for example, collecting a website, an APP, browsing time, used equipment, and the like, which are concerned by the user, from the URL of the network data by using summarized websites and APP rules.
The sample resetter 720 is used for re-labeling the labeled broadband network access point information according to the strong hypothesis condition and giving the training weight. For example, if the number of devices of the broadband network access point is greater than or equal to a first threshold, marking the type of the broadband network access point as an enterprise, and giving a first training weight; if the equipment number of the broadband access point is less than or equal to a second threshold value, marking the type of the broadband network access point as a family, and giving a second training weight; if the equipment number of the broadband access point is larger than a second threshold value and smaller than a first threshold value, keeping the marked type of the broadband network access point, and giving a third training weight; wherein the first training weight and the second training weight are greater than the third training weight. For example, the first threshold is 50, the second threshold is 1, etc., and the modified sample full weight is given higher importance, and once the error is mistakenly divided, the resulting error is doubled, so as to reduce the influence of the mistakenly marked sample on the training.
The feature extractor 730 is configured to determine user account features and user behavior features based on the user account information and the user behavior information. For example, user account characteristics and user behavior characteristics are determined based on the number of user accounts, the user behavior richness, and the access point active time period.
The classifier 740 is configured to train using the re-labeled broadband network access point information, use the user account characteristics and the user behavior characteristics as input quantities of the classifier, and preliminarily predict the broadband network access point information corresponding to the user by using a binary classification model.
The apparatus may further include a result voter 750 for training the classifier in units of a predetermined time based on a voting mechanism of time lapse, and updating the prediction result of the type of the broadband network access point corresponding to the user by using a relative majority voting method. In order to adapt to the change of the data source and capture the change information of the broadband use condition, a voting mechanism based on time lapse is adopted, a sub-classifier is trained by taking the week as a unit, and a relative majority voting method is adopted, namely, the classification category of the highest vote is selected as a final result.
In the embodiment, the method for identifying the real use of the broadband network based on the user network behavior can achieve the purpose of automatic information completion; in addition, the method weakens the influence of the existing labeled data on model training, namely re-labeling the sample data by using strong hypothesis conditions, and reducing the influence of artificial error labeling on the model by changing weight and punishment measures so as to achieve the aim of correcting the error of the original labeled data. In addition, the method extracts the network user account behavior and the browsing behavior from the broadband network data, has a simple characteristic extraction method, and is suitable for distributed computation under mass data; in addition, the method and the device adopt a voting mechanism based on time accumulation, improve the accuracy of classification results, enhance the adaptability of the model to data changes, and effectively track the state change of the broadband access point.
Fig. 8 is a schematic structural diagram of an apparatus for determining broadband network access point information according to still another embodiment of the present invention. The apparatus includes a memory 810 and a processor 820. Wherein:
the memory 810 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is used to store instructions in the embodiments corresponding to fig. 1-3.
Processor 820 is coupled to memory 810 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 820 is configured to execute instructions stored in the memory to improve the efficiency of broadband network access point information validation.
In one embodiment, as also shown in fig. 9, the apparatus 900 includes a memory 910 and a processor 920. Processor 920 is coupled to memory 910 by a BUS 930. The system 900 may also be coupled to an external storage device 950 via a storage interface 940 for facilitating retrieval of external data, and may also be coupled to a network or another computer system (not shown) via a network interface 960, which will not be described in detail herein.
In the embodiment, the data instruction is stored in the memory, and the processor processes the instruction, so that the type of the broadband network access point can be more accurately identified, and data is provided for filling missing information.
In another embodiment, a computer-readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of the method in the corresponding embodiment of fig. 1-3. As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Thus far, the present invention has been described in detail. Some details well known in the art have not been described in order to avoid obscuring the concepts of the present invention. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
Although some specific embodiments of the present invention have been described in detail by way of illustration, it should be understood by those skilled in the art that the above illustration is only for the purpose of illustration and is not intended to limit the scope of the invention. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. A method of determining broadband network access point information, comprising:
extracting user information in broadband network data;
extracting user characteristics according to the user information;
training a classifier by using the labeled broadband network access point information to construct a machine learning model;
re-labeling the labeled broadband network access point information according to a strong hypothesis condition, and giving a training weight, wherein if the number of devices of the broadband network access point is greater than or equal to a first threshold value, the type of the broadband network access point is labeled as an enterprise, and the first training weight is given, if the number of devices of the broadband access point is less than or equal to a second threshold value, the type of the broadband network access point is labeled as a family, and the second training weight is given, if the number of devices of the broadband access point is greater than the second threshold value and less than the first threshold value, the labeled type of the broadband network access point is maintained, and a third training weight is given, wherein the first training weight and the second training weight are greater than the third training weight;
and training the classifier by using the re-labeled broadband network access point information, taking the user characteristics as the input quantity of the classifier, and preliminarily predicting the broadband network access point information corresponding to the user by using a binary classification model.
2. The method of claim 1, further comprising:
and training the classifier by taking the preset time as a unit based on a voting mechanism of time lapse, and updating the prediction result of the broadband network access point information corresponding to the user by adopting a relative majority voting method.
3. The method according to claim 1 or 2, wherein the user information comprises user account information and user behavior information;
the extracting the user information in the broadband network data comprises the following steps:
matching the user account information from a Uniform Resource Locator (URL) of the broadband network data through a regular expression of the user account;
and acquiring the user behavior information from the URL of the broadband network data based on the primary domain name.
4. The method of claim 3, wherein the user characteristics include user account characteristics and user behavior characteristics;
determining the user account characteristics and the user behavior characteristics based on the number of user accounts, the user behavior enrichment and/or the access point active time period.
5. An apparatus for determining broadband network access point information, comprising:
the data collector is used for extracting user information in the broadband network data;
a feature extractor for extracting user features according to the user information;
the sample resetter is used for re-labeling the labeled broadband network access point information according to a strong hypothesis condition and giving a training weight, wherein if the equipment number of the broadband network access point is larger than or equal to a first threshold value, the type of the broadband network access point is labeled as an enterprise and the first training weight is given, if the equipment number of the broadband access point is smaller than or equal to a second threshold value, the type of the broadband network access point is labeled as a family and the second training weight is given, if the equipment number of the broadband access point is larger than the second threshold value and smaller than the first threshold value, the labeled type of the broadband network access point is maintained and the third training weight is given, and the first training weight and the second training weight are larger than the third training weight;
and the classifier is used for training by using the labeled broadband network access point information, constructing a machine learning model, training by using the newly labeled broadband network access point information, taking the user characteristics as the input quantity of the classifier, and preliminarily predicting the broadband network access point information corresponding to the user by using a binary classification model.
6. The apparatus of claim 5, further comprising:
and the result voter is used for training the classifier by taking the preset time as a unit based on a voting mechanism of time lapse and updating the prediction result of the broadband network access point information corresponding to the user by adopting a relative majority voting method.
7. The apparatus according to claim 5 or 6, wherein the user information comprises user account information and user behavior information;
the data collector is also used for matching the user account information from the uniform resource locator URL of the broadband network data through the regular expression of the user account; and acquiring the user behavior information from the URL of the broadband network data based on the primary domain name.
8. The apparatus of claim 7, wherein the user characteristics comprise user account characteristics and user behavior characteristics;
the feature extractor is further configured to determine the user account features and the user behavior features based on a number of user accounts, a user behavior richness, and/or an access point active time period.
9. An apparatus for determining broadband network access point information, comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the method of any of claims 1-4 based on instructions stored in the memory.
10. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 4.
CN201710635652.6A 2017-07-31 2017-07-31 Method and device for determining broadband network access point information Active CN110020234B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710635652.6A CN110020234B (en) 2017-07-31 2017-07-31 Method and device for determining broadband network access point information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710635652.6A CN110020234B (en) 2017-07-31 2017-07-31 Method and device for determining broadband network access point information

Publications (2)

Publication Number Publication Date
CN110020234A CN110020234A (en) 2019-07-16
CN110020234B true CN110020234B (en) 2021-09-03

Family

ID=67186021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710635652.6A Active CN110020234B (en) 2017-07-31 2017-07-31 Method and device for determining broadband network access point information

Country Status (1)

Country Link
CN (1) CN110020234B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8205239B1 (en) * 2007-09-29 2012-06-19 Symantec Corporation Methods and systems for adaptively setting network security policies
CN106503015A (en) * 2015-09-07 2017-03-15 国家计算机网络与信息安全管理中心 A kind of method for building user's portrait

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8205239B1 (en) * 2007-09-29 2012-06-19 Symantec Corporation Methods and systems for adaptively setting network security policies
CN106503015A (en) * 2015-09-07 2017-03-15 国家计算机网络与信息安全管理中心 A kind of method for building user's portrait

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
a Charaeterization of Broadband User Behavior and Their E-Business Activities;Humberto T. Marques Neto;《ACM》;20041231;全文 *
根据多维特征的网络用户分类研究;窦伊男;《中国博士学位论文全文数据库信息科技辑》;20101115;第34-112页 *

Also Published As

Publication number Publication date
CN110020234A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN108304410B (en) Method and device for detecting abnormal access page and data analysis method
WO2015074503A1 (en) Statistical method and apparatus for webpage access data
CN111783016B (en) Website classification method, device and equipment
US20190266206A1 (en) Data processing method, server, and computer storage medium
CN107578263A (en) A kind of detection method, device and the electronic equipment of advertisement abnormal access
CN105183873A (en) Malicious clicking behavior detection method and device
CN112508638B (en) Data processing method and device and computer equipment
CN112149352B (en) Prediction method for marketing activity clicking by combining GBDT automatic characteristic engineering
CN110807050B (en) Performance analysis method, device, computer equipment and storage medium
CN110020234B (en) Method and device for determining broadband network access point information
CN112685618A (en) User feature identification method and device, computing equipment and computer storage medium
CN115292571B (en) App data acquisition method and system
CN104376021A (en) File recommending system and method
CN110674839B (en) Abnormal user identification method and device, storage medium and electronic equipment
Tsai et al. Object architected design and efficient dynamic adjustment mechanism of distributed web crawlers
CN113065058A (en) Family member identification method and device, electronic equipment and readable storage medium
CN112084408A (en) List data screening method and device, computer equipment and storage medium
CN111209397A (en) Method for determining enterprise industry category
CN108153817B (en) Intelligent web page data acquisition method
CN115314404B (en) Service optimization method, device, computer equipment and storage medium
CN105912736A (en) URL classifying method and device
US11875374B2 (en) Automated auditing and recommendation systems and methods
CN114140140B (en) Scene screening method, device and equipment
CN110968785B (en) Target account identification method and device, storage medium and electronic device
CN109241428B (en) Method, device, server and storage medium for determining gender of user

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant