CN102833085B - Based on communication network message categorizing system and the method for mass users behavioral data - Google Patents

Based on communication network message categorizing system and the method for mass users behavioral data Download PDF

Info

Publication number
CN102833085B
CN102833085B CN201110162097.2A CN201110162097A CN102833085B CN 102833085 B CN102833085 B CN 102833085B CN 201110162097 A CN201110162097 A CN 201110162097A CN 102833085 B CN102833085 B CN 102833085B
Authority
CN
China
Prior art keywords
message
data
disaggregated model
sorting algorithm
communication network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110162097.2A
Other languages
Chinese (zh)
Other versions
CN102833085A (en
Inventor
刘晓亮
罗峰
黄苏支
李娜
王琪
张玉波
阎飞飞
刘书良
刘生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Izp (China) Network Technology Co. Ltd.
Original Assignee
BEIJING IZP TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING IZP TECHNOLOGIES Co Ltd filed Critical BEIJING IZP TECHNOLOGIES Co Ltd
Priority to CN201110162097.2A priority Critical patent/CN102833085B/en
Publication of CN102833085A publication Critical patent/CN102833085A/en
Application granted granted Critical
Publication of CN102833085B publication Critical patent/CN102833085B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a kind of communication network message categorizing system based on mass users behavioral data and method, it is characterized in that: comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, described disaggregated model exports final for the model with message comparison by model output module, this system and method accurately can identify all kinds of message, meet the fine granularity demand of data in message analysis, can effectively to user behavior data by message classification, comprise the access of user, search data carries out careful analysis.

Description

Based on communication network message categorizing system and the method for mass users behavioral data
Technical field
The field of the present invention relates to comprises, the analysis of the communication network message that mass users uses the various network equipment and terminal access network to produce, the behavior derivation message characteristic according to user, usage data excavation and machine learning techniques carry out correct classification prediction to communication network message, a kind of communication network message categorizing system based on mass users behavioral data of special design and method.
Background technology
What the message classification that major part is traditional used is all rule-based system, namely adds up the keyword occurred in different message, then forms a rule base, when next message occurs, just go to mate in rule base, obtain the general classification of outgoing packet.
The shortcoming of this method is clearly: (1) has a large amount of messages to exist, and can not obtain a very accurate rule base; (2) in Different Rule storehouse, the possibility of rule is repeated, and use matching strategy may obtain inaccurate message classification (3) when message amount is huge, matching strategy can not meet temporal validity.
Summary of the invention
The object of the invention is for providing a kind of communication network message categorizing system based on mass users behavioral data and method, this system and method accurately can identify all kinds of message, meet the fine granularity demand of data in message analysis, effectively to user behavior data, the access of user can be comprised, search data carries out careful analysis by message classification.
Technical scheme of the present invention is as follows:
A kind of communication network message categorizing system based on mass users behavioral data, comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, described disaggregated model exports final for the model with message comparison by model output module.
The data of network collection are stored into storage of subscriber data system by described user data acquisition module.
Described sorting algorithm module also receives the data of training dataset, and described disaggregated model also receives the verification msg of assessment data collection.
Based on a communication network message sorting technique for mass users behavioral data, realize message classification as follows:
(1) information in user data acquisition module is imported data cleansing module to clean user data, extract the feature of user communication network message, generating feature matrix, and import in sorting algorithm module and generate disaggregated model;
(2) use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to sorting algorithm module simultaneously, sorting algorithm module learns the disaggregated model about message to training dataset, the eigenmatrix that assessment data collection is produced is input in disaggregated model intermediate object program, verification model Output rusults and artificial annotation results, carry out the accuracy of judgment models according to the accuracy of gained and recall rate;
(3) parameter feedback after being verified by disaggregated model, to sorting algorithm module, is constantly optimized sorting algorithm module, to improve the robustness of system under real complex situations and model accuracy;
(4) set up final mask and exported for being connected with new message by model output module, the classification of prediction communication network message.
The network message classification mark that described manual type is distinguished comprises search engine message, web page browsing message, resource downloading page message, ad material message.
By user data acquisition module user behavior data collected and information is stored into storage of subscriber data system.
Technique effect of the present invention is:
A large amount of type of messages miscellaneous is there is in communication network message, in order to carry out analysis and the excavation of the degree of depth to these messages, all kinds of message of identification that must be correct.Huge due to data volume, so complete this task to become very difficult within the object time and in target accuracy rate.The present invention is by careful analysis communication network message, the feature of message has been extracted according to user behavior, then use from data mining and machine learning technique construction a whole set of accurately to identify the system of all kinds of message, comprise and collect the final online entire flow used from original message, ensure that the accurate identification of message within the object time.
Accompanying drawing explanation
Fig. 1 is the communication network message categorizing system based on mass users behavioral data of the present invention and method step flow chart.
Embodiment
Below in conjunction with accompanying drawing, the present invention will be further described.
As shown in Figure 1, a kind of communication network message categorizing system based on mass users behavioral data, comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, described disaggregated model exports final for the model with message comparison by model output module.
The data of network collection are stored into storage of subscriber data system by described user data acquisition module.
Described sorting algorithm module also receives the data of training dataset, and described disaggregated model also receives the verification msg of assessment data collection.
Based on a communication network message sorting technique for mass users behavioral data, realize message classification as follows:
(1) information in user data acquisition module is imported data cleansing module to clean user data, extract the feature of user communication network message, generating feature matrix, and import in sorting algorithm module and generate disaggregated model;
(2) use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to sorting algorithm module simultaneously, sorting algorithm module learns the disaggregated model about message to training dataset, the eigenmatrix that assessment data collection is produced is input in disaggregated model intermediate object program, verification model Output rusults and artificial annotation results, carry out the accuracy of judgment models according to the accuracy of gained and recall rate;
(3) parameter feedback after being verified by disaggregated model, to sorting algorithm module, is constantly optimized sorting algorithm module, to improve the robustness of system under real complex situations and model accuracy;
(4) set up final mask and exported for being connected with new message by model output module, the classification of prediction communication network message.
The network message classification mark that described manual type is distinguished comprises search engine message, web page browsing message, resource downloading page message, ad material message.
By user data acquisition module user behavior data collected and information is stored into storage of subscriber data system.
Sorting algorithm module optimizing process: described sorting algorithm module receives computer and artificial generated message classification eigenmatrix, and generate disaggregated model, the assessment data collection generation of the artificial input of described disaggregated model reception is all verified and is used message classification eigenmatrix, disaggregated model again by checking after data feedback to sorting algorithm module, to be optimized, to classify more accurately afterwards its sorting algorithm module.
Some noises in data are removed in the effect of cleaning module, comprise two parts: (1) removes some unnecessary samples; (2) some noise information in some sample is removed.
Described training dataset comprises two parts, and one is the artificial network message classification marked, and representing the characteristic vector of network message besides, generally represents by sparse vector, in order to meet the requirement of concrete sorting algorithm, can carry out corresponding format conversion.
Feature mainly can differentiate some information of all kinds of message, is drawn by manual analysis and statistics, and such as advertisement url feature can be made up of three parts: (1) comprises particular keywords, alimama, doubleclick, ad etc.; (2) leaf node of user's access tree is generally in; (3) user directly to input ratio generally smaller.
The matrix that the characteristic value that eigenmatrix refers to each sample is formed.
The performance of classification of assessment system has two aspects, and one is model accuracy, and one is the efficiency of algorithm.The key factor wherein affecting model accuracy is exactly the adequacy of feature, comprises power and the number of feature.The present invention is carrying out on the basis of depth analysis to the communication network message of magnanimity, has carried out careful classification according to user behavior to message, has meticulously extracted the feature of all kinds of message, thus ensure that the precision of model and the accuracy of prediction.In addition on efficiency of algorithm, carry out a large amount of optimization, thus ensure that the actual effect of mass data processing.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (6)

1. the communication network message categorizing system based on mass users behavioral data, it is characterized in that: comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to described sorting algorithm module simultaneously, described sorting algorithm module learns the described disaggregated model about message to described training dataset, the eigenmatrix that described assessment data collection is produced is input in described disaggregated model intermediate object program, verify described disaggregated model Output rusults and artificial annotation results, judge the accuracy of described disaggregated model according to the accuracy of gained and recall rate; Described sorting algorithm module receives computer and artificial generated message classification eigenmatrix, and generate disaggregated model, the checking message classification eigenmatrix that the assessment data collection that described disaggregated model receives artificial input generates, the data feedback after checking is given described sorting algorithm module by described disaggregated model again; Described disaggregated model exports final for the model with message comparison by model output module.
2. the communication network message categorizing system based on mass users behavioral data according to claim 1, is characterized in that: the data of network collection are stored into storage of subscriber data system by described user data acquisition system.
3. the communication network message categorizing system based on mass users behavioral data according to claim 1, it is characterized in that: described sorting algorithm module also receives the data of the training dataset of artificial input, and described disaggregated model also receives the verification msg of described assessment data collection.
4., based on a communication network message sorting technique for mass users behavioral data, it is characterized in that: realize message classification as follows:
(1) information in user data acquisition system is imported data cleansing module to clean user data, extract the feature of user communication network message, generating feature matrix, and import in sorting algorithm module and generate disaggregated model;
(2) use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to described sorting algorithm module simultaneously, described sorting algorithm module learns the described disaggregated model about message to described training dataset, the eigenmatrix that described assessment data collection is produced is input in described disaggregated model intermediate object program, verify described disaggregated model Output rusults and artificial annotation results, judge the accuracy of described disaggregated model according to the accuracy of gained and recall rate;
(3) give described sorting algorithm module by the parameter feedback after the checking of described disaggregated model, constantly described sorting algorithm module is optimized, to improve the robustness of system under real complex situations and model accuracy; The process that described sorting algorithm module is optimized for: described sorting algorithm module receives computer and artificial generated message classification eigenmatrix, and generate disaggregated model, the checking message classification eigenmatrix that the assessment data collection that described disaggregated model receives artificial input generates, described disaggregated model again by the data feedback after checking to sorting algorithm module;
(4) set up final mask and exported for being connected with new message by described disaggregated model output module, the classification of prediction communication network message.
5. the communication network message sorting technique based on mass users behavioral data according to claim 4, is characterized in that: the communication network message classification mark that described manual type is distinguished comprises search engine message, web page browsing message, resource downloading page message and ad material message.
6. the communication network message sorting technique based on mass users behavioral data according to claim 4, is characterized in that: to be collected user behavior data by described user data acquisition system and information is stored into storage of subscriber data system.
CN201110162097.2A 2011-06-16 2011-06-16 Based on communication network message categorizing system and the method for mass users behavioral data Expired - Fee Related CN102833085B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110162097.2A CN102833085B (en) 2011-06-16 2011-06-16 Based on communication network message categorizing system and the method for mass users behavioral data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110162097.2A CN102833085B (en) 2011-06-16 2011-06-16 Based on communication network message categorizing system and the method for mass users behavioral data

Publications (2)

Publication Number Publication Date
CN102833085A CN102833085A (en) 2012-12-19
CN102833085B true CN102833085B (en) 2015-09-16

Family

ID=47336064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110162097.2A Expired - Fee Related CN102833085B (en) 2011-06-16 2011-06-16 Based on communication network message categorizing system and the method for mass users behavioral data

Country Status (1)

Country Link
CN (1) CN102833085B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649455B (en) * 2016-09-24 2021-01-12 孙燕群 Standardized system classification and command set system for big data development
CN107404398A (en) * 2017-05-31 2017-11-28 中山大学 A kind of networks congestion control judgement system
CN112016617B (en) * 2020-08-27 2023-12-01 中国平安财产保险股份有限公司 Fine granularity classification method, apparatus and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101540048A (en) * 2009-04-21 2009-09-23 北京航空航天大学 Image quality evaluating method based on support vector machine
CN101853277A (en) * 2010-05-14 2010-10-06 南京信息工程大学 Vulnerability data mining method based on classification and association analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8583416B2 (en) * 2007-12-27 2013-11-12 Fluential, Llc Robust information extraction from utterances

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101540048A (en) * 2009-04-21 2009-09-23 北京航空航天大学 Image quality evaluating method based on support vector machine
CN101853277A (en) * 2010-05-14 2010-10-06 南京信息工程大学 Vulnerability data mining method based on classification and association analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Internet网页自动分类技术的研究;谢华;《中国优秀硕士学位论文全文数据库信息科技辑》;20070630;对比文件第9页第1段至第11页第5段,图2-1 *
刘博等.改进的KNN方法及其在中文文本分类中的应用.《西华大学学报(自然科学版)》.2008,第27卷(第2期),全文. *

Also Published As

Publication number Publication date
CN102833085A (en) 2012-12-19

Similar Documents

Publication Publication Date Title
CN105260474B (en) A kind of microblog users influence power computational methods based on information exchange network
De Choudhury et al. How does the data sampling strategy impact the discovery of information diffusion in social media?
CN103164427B (en) News Aggreagation method and device
CN108287858A (en) The semantic extracting method and device of natural language
CN102567494B (en) Website classification method and device
CN103530347B (en) A kind of Internet resources method for evaluating quality based on big data mining and system
CN104933622A (en) Microblog popularity degree prediction method based on user and microblog theme and microblog popularity degree prediction system based on user and microblog theme
CN104008203A (en) User interest discovering method with ontology situation blended in
CN107133436A (en) A kind of multiple sample model training method and device
CN104298679A (en) Application service recommendation method and device
CN104657372A (en) Page operation data processing method and device
CN105550253B (en) Method and device for acquiring type relationship
CN105573995A (en) Interest identification method, interest identification equipment and data analysis method
CN105809464A (en) Method and device for information delivery
CN104965905A (en) Web page classifying method and apparatus
CN106528777A (en) Cross-screen user identification normalizing method and system
CN101393555A (en) Rubbish blog detecting method
CN105095419A (en) Method for maximizing influence of information to specific type of weibo users
CN103778200A (en) Method for extracting information source of message and system thereof
CN103838754A (en) Information searching device and method
CN104933475A (en) Network forwarding behavior prediction method and apparatus
CN103136358A (en) Method for automatically extracting BBS (bulletin board system) data
CN103440328B (en) A kind of user classification method based on mouse behavior
CN111767443A (en) Efficient web crawler analysis platform
CN102833085B (en) Based on communication network message categorizing system and the method for mass users behavioral data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C56 Change in the name or address of the patentee
CP01 Change in the name or title of a patent holder

Address after: 100081, Beijing, Zhongguancun, Haidian District South Avenue, No. 18, International Building, Beijing, block 18, B

Patentee after: Izp (China) Network Technology Co. Ltd.

Address before: 100081, Beijing, Zhongguancun, Haidian District South Avenue, No. 18, International Building, Beijing, block 18, B

Patentee before: Beijing IZP Technologies Co., Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150916

Termination date: 20160616