CN102833085B - Based on communication network message categorizing system and the method for mass users behavioral data - Google Patents
Based on communication network message categorizing system and the method for mass users behavioral data Download PDFInfo
- Publication number
- CN102833085B CN102833085B CN201110162097.2A CN201110162097A CN102833085B CN 102833085 B CN102833085 B CN 102833085B CN 201110162097 A CN201110162097 A CN 201110162097A CN 102833085 B CN102833085 B CN 102833085B
- Authority
- CN
- China
- Prior art keywords
- message
- data
- disaggregated model
- sorting algorithm
- communication network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Abstract
The invention provides a kind of communication network message categorizing system based on mass users behavioral data and method, it is characterized in that: comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, described disaggregated model exports final for the model with message comparison by model output module, this system and method accurately can identify all kinds of message, meet the fine granularity demand of data in message analysis, can effectively to user behavior data by message classification, comprise the access of user, search data carries out careful analysis.
Description
Technical field
The field of the present invention relates to comprises, the analysis of the communication network message that mass users uses the various network equipment and terminal access network to produce, the behavior derivation message characteristic according to user, usage data excavation and machine learning techniques carry out correct classification prediction to communication network message, a kind of communication network message categorizing system based on mass users behavioral data of special design and method.
Background technology
What the message classification that major part is traditional used is all rule-based system, namely adds up the keyword occurred in different message, then forms a rule base, when next message occurs, just go to mate in rule base, obtain the general classification of outgoing packet.
The shortcoming of this method is clearly: (1) has a large amount of messages to exist, and can not obtain a very accurate rule base; (2) in Different Rule storehouse, the possibility of rule is repeated, and use matching strategy may obtain inaccurate message classification (3) when message amount is huge, matching strategy can not meet temporal validity.
Summary of the invention
The object of the invention is for providing a kind of communication network message categorizing system based on mass users behavioral data and method, this system and method accurately can identify all kinds of message, meet the fine granularity demand of data in message analysis, effectively to user behavior data, the access of user can be comprised, search data carries out careful analysis by message classification.
Technical scheme of the present invention is as follows:
A kind of communication network message categorizing system based on mass users behavioral data, comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, described disaggregated model exports final for the model with message comparison by model output module.
The data of network collection are stored into storage of subscriber data system by described user data acquisition module.
Described sorting algorithm module also receives the data of training dataset, and described disaggregated model also receives the verification msg of assessment data collection.
Based on a communication network message sorting technique for mass users behavioral data, realize message classification as follows:
(1) information in user data acquisition module is imported data cleansing module to clean user data, extract the feature of user communication network message, generating feature matrix, and import in sorting algorithm module and generate disaggregated model;
(2) use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to sorting algorithm module simultaneously, sorting algorithm module learns the disaggregated model about message to training dataset, the eigenmatrix that assessment data collection is produced is input in disaggregated model intermediate object program, verification model Output rusults and artificial annotation results, carry out the accuracy of judgment models according to the accuracy of gained and recall rate;
(3) parameter feedback after being verified by disaggregated model, to sorting algorithm module, is constantly optimized sorting algorithm module, to improve the robustness of system under real complex situations and model accuracy;
(4) set up final mask and exported for being connected with new message by model output module, the classification of prediction communication network message.
The network message classification mark that described manual type is distinguished comprises search engine message, web page browsing message, resource downloading page message, ad material message.
By user data acquisition module user behavior data collected and information is stored into storage of subscriber data system.
Technique effect of the present invention is:
A large amount of type of messages miscellaneous is there is in communication network message, in order to carry out analysis and the excavation of the degree of depth to these messages, all kinds of message of identification that must be correct.Huge due to data volume, so complete this task to become very difficult within the object time and in target accuracy rate.The present invention is by careful analysis communication network message, the feature of message has been extracted according to user behavior, then use from data mining and machine learning technique construction a whole set of accurately to identify the system of all kinds of message, comprise and collect the final online entire flow used from original message, ensure that the accurate identification of message within the object time.
Accompanying drawing explanation
Fig. 1 is the communication network message categorizing system based on mass users behavioral data of the present invention and method step flow chart.
Embodiment
Below in conjunction with accompanying drawing, the present invention will be further described.
As shown in Figure 1, a kind of communication network message categorizing system based on mass users behavioral data, comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, described disaggregated model exports final for the model with message comparison by model output module.
The data of network collection are stored into storage of subscriber data system by described user data acquisition module.
Described sorting algorithm module also receives the data of training dataset, and described disaggregated model also receives the verification msg of assessment data collection.
Based on a communication network message sorting technique for mass users behavioral data, realize message classification as follows:
(1) information in user data acquisition module is imported data cleansing module to clean user data, extract the feature of user communication network message, generating feature matrix, and import in sorting algorithm module and generate disaggregated model;
(2) use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to sorting algorithm module simultaneously, sorting algorithm module learns the disaggregated model about message to training dataset, the eigenmatrix that assessment data collection is produced is input in disaggregated model intermediate object program, verification model Output rusults and artificial annotation results, carry out the accuracy of judgment models according to the accuracy of gained and recall rate;
(3) parameter feedback after being verified by disaggregated model, to sorting algorithm module, is constantly optimized sorting algorithm module, to improve the robustness of system under real complex situations and model accuracy;
(4) set up final mask and exported for being connected with new message by model output module, the classification of prediction communication network message.
The network message classification mark that described manual type is distinguished comprises search engine message, web page browsing message, resource downloading page message, ad material message.
By user data acquisition module user behavior data collected and information is stored into storage of subscriber data system.
Sorting algorithm module optimizing process: described sorting algorithm module receives computer and artificial generated message classification eigenmatrix, and generate disaggregated model, the assessment data collection generation of the artificial input of described disaggregated model reception is all verified and is used message classification eigenmatrix, disaggregated model again by checking after data feedback to sorting algorithm module, to be optimized, to classify more accurately afterwards its sorting algorithm module.
Some noises in data are removed in the effect of cleaning module, comprise two parts: (1) removes some unnecessary samples; (2) some noise information in some sample is removed.
Described training dataset comprises two parts, and one is the artificial network message classification marked, and representing the characteristic vector of network message besides, generally represents by sparse vector, in order to meet the requirement of concrete sorting algorithm, can carry out corresponding format conversion.
Feature mainly can differentiate some information of all kinds of message, is drawn by manual analysis and statistics, and such as advertisement url feature can be made up of three parts: (1) comprises particular keywords, alimama, doubleclick, ad etc.; (2) leaf node of user's access tree is generally in; (3) user directly to input ratio generally smaller.
The matrix that the characteristic value that eigenmatrix refers to each sample is formed.
The performance of classification of assessment system has two aspects, and one is model accuracy, and one is the efficiency of algorithm.The key factor wherein affecting model accuracy is exactly the adequacy of feature, comprises power and the number of feature.The present invention is carrying out on the basis of depth analysis to the communication network message of magnanimity, has carried out careful classification according to user behavior to message, has meticulously extracted the feature of all kinds of message, thus ensure that the precision of model and the accuracy of prediction.In addition on efficiency of algorithm, carry out a large amount of optimization, thus ensure that the actual effect of mass data processing.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.
Claims (6)
1. the communication network message categorizing system based on mass users behavioral data, it is characterized in that: comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to described sorting algorithm module simultaneously, described sorting algorithm module learns the described disaggregated model about message to described training dataset, the eigenmatrix that described assessment data collection is produced is input in described disaggregated model intermediate object program, verify described disaggregated model Output rusults and artificial annotation results, judge the accuracy of described disaggregated model according to the accuracy of gained and recall rate; Described sorting algorithm module receives computer and artificial generated message classification eigenmatrix, and generate disaggregated model, the checking message classification eigenmatrix that the assessment data collection that described disaggregated model receives artificial input generates, the data feedback after checking is given described sorting algorithm module by described disaggregated model again; Described disaggregated model exports final for the model with message comparison by model output module.
2. the communication network message categorizing system based on mass users behavioral data according to claim 1, is characterized in that: the data of network collection are stored into storage of subscriber data system by described user data acquisition system.
3. the communication network message categorizing system based on mass users behavioral data according to claim 1, it is characterized in that: described sorting algorithm module also receives the data of the training dataset of artificial input, and described disaggregated model also receives the verification msg of described assessment data collection.
4., based on a communication network message sorting technique for mass users behavioral data, it is characterized in that: realize message classification as follows:
(1) information in user data acquisition system is imported data cleansing module to clean user data, extract the feature of user communication network message, generating feature matrix, and import in sorting algorithm module and generate disaggregated model;
(2) use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to described sorting algorithm module simultaneously, described sorting algorithm module learns the described disaggregated model about message to described training dataset, the eigenmatrix that described assessment data collection is produced is input in described disaggregated model intermediate object program, verify described disaggregated model Output rusults and artificial annotation results, judge the accuracy of described disaggregated model according to the accuracy of gained and recall rate;
(3) give described sorting algorithm module by the parameter feedback after the checking of described disaggregated model, constantly described sorting algorithm module is optimized, to improve the robustness of system under real complex situations and model accuracy; The process that described sorting algorithm module is optimized for: described sorting algorithm module receives computer and artificial generated message classification eigenmatrix, and generate disaggregated model, the checking message classification eigenmatrix that the assessment data collection that described disaggregated model receives artificial input generates, described disaggregated model again by the data feedback after checking to sorting algorithm module;
(4) set up final mask and exported for being connected with new message by described disaggregated model output module, the classification of prediction communication network message.
5. the communication network message sorting technique based on mass users behavioral data according to claim 4, is characterized in that: the communication network message classification mark that described manual type is distinguished comprises search engine message, web page browsing message, resource downloading page message and ad material message.
6. the communication network message sorting technique based on mass users behavioral data according to claim 4, is characterized in that: to be collected user behavior data by described user data acquisition system and information is stored into storage of subscriber data system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110162097.2A CN102833085B (en) | 2011-06-16 | 2011-06-16 | Based on communication network message categorizing system and the method for mass users behavioral data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110162097.2A CN102833085B (en) | 2011-06-16 | 2011-06-16 | Based on communication network message categorizing system and the method for mass users behavioral data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102833085A CN102833085A (en) | 2012-12-19 |
CN102833085B true CN102833085B (en) | 2015-09-16 |
Family
ID=47336064
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110162097.2A Expired - Fee Related CN102833085B (en) | 2011-06-16 | 2011-06-16 | Based on communication network message categorizing system and the method for mass users behavioral data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102833085B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649455B (en) * | 2016-09-24 | 2021-01-12 | 孙燕群 | Standardized system classification and command set system for big data development |
CN107404398A (en) * | 2017-05-31 | 2017-11-28 | 中山大学 | A kind of networks congestion control judgement system |
CN112016617B (en) * | 2020-08-27 | 2023-12-01 | 中国平安财产保险股份有限公司 | Fine granularity classification method, apparatus and computer readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101540048A (en) * | 2009-04-21 | 2009-09-23 | 北京航空航天大学 | Image quality evaluating method based on support vector machine |
CN101853277A (en) * | 2010-05-14 | 2010-10-06 | 南京信息工程大学 | Vulnerability data mining method based on classification and association analysis |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8583416B2 (en) * | 2007-12-27 | 2013-11-12 | Fluential, Llc | Robust information extraction from utterances |
-
2011
- 2011-06-16 CN CN201110162097.2A patent/CN102833085B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101540048A (en) * | 2009-04-21 | 2009-09-23 | 北京航空航天大学 | Image quality evaluating method based on support vector machine |
CN101853277A (en) * | 2010-05-14 | 2010-10-06 | 南京信息工程大学 | Vulnerability data mining method based on classification and association analysis |
Non-Patent Citations (2)
Title |
---|
Internet网页自动分类技术的研究;谢华;《中国优秀硕士学位论文全文数据库信息科技辑》;20070630;对比文件第9页第1段至第11页第5段,图2-1 * |
刘博等.改进的KNN方法及其在中文文本分类中的应用.《西华大学学报(自然科学版)》.2008,第27卷(第2期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN102833085A (en) | 2012-12-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105260474B (en) | A kind of microblog users influence power computational methods based on information exchange network | |
De Choudhury et al. | How does the data sampling strategy impact the discovery of information diffusion in social media? | |
CN103164427B (en) | News Aggreagation method and device | |
CN108287858A (en) | The semantic extracting method and device of natural language | |
CN102567494B (en) | Website classification method and device | |
CN103530347B (en) | A kind of Internet resources method for evaluating quality based on big data mining and system | |
CN104933622A (en) | Microblog popularity degree prediction method based on user and microblog theme and microblog popularity degree prediction system based on user and microblog theme | |
CN104008203A (en) | User interest discovering method with ontology situation blended in | |
CN107133436A (en) | A kind of multiple sample model training method and device | |
CN104298679A (en) | Application service recommendation method and device | |
CN104657372A (en) | Page operation data processing method and device | |
CN105550253B (en) | Method and device for acquiring type relationship | |
CN105573995A (en) | Interest identification method, interest identification equipment and data analysis method | |
CN105809464A (en) | Method and device for information delivery | |
CN104965905A (en) | Web page classifying method and apparatus | |
CN106528777A (en) | Cross-screen user identification normalizing method and system | |
CN101393555A (en) | Rubbish blog detecting method | |
CN105095419A (en) | Method for maximizing influence of information to specific type of weibo users | |
CN103778200A (en) | Method for extracting information source of message and system thereof | |
CN103838754A (en) | Information searching device and method | |
CN104933475A (en) | Network forwarding behavior prediction method and apparatus | |
CN103136358A (en) | Method for automatically extracting BBS (bulletin board system) data | |
CN103440328B (en) | A kind of user classification method based on mouse behavior | |
CN111767443A (en) | Efficient web crawler analysis platform | |
CN102833085B (en) | Based on communication network message categorizing system and the method for mass users behavioral data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C56 | Change in the name or address of the patentee | ||
CP01 | Change in the name or title of a patent holder |
Address after: 100081, Beijing, Zhongguancun, Haidian District South Avenue, No. 18, International Building, Beijing, block 18, B Patentee after: Izp (China) Network Technology Co. Ltd. Address before: 100081, Beijing, Zhongguancun, Haidian District South Avenue, No. 18, International Building, Beijing, block 18, B Patentee before: Beijing IZP Technologies Co., Ltd. |
|
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20150916 Termination date: 20160616 |