CN103902703B - 基于移动互联网访问的文本内容分类方法 - Google Patents
基于移动互联网访问的文本内容分类方法 Download PDFInfo
- Publication number
- CN103902703B CN103902703B CN201410126495.2A CN201410126495A CN103902703B CN 103902703 B CN103902703 B CN 103902703B CN 201410126495 A CN201410126495 A CN 201410126495A CN 103902703 B CN103902703 B CN 103902703B
- Authority
- CN
- China
- Prior art keywords
- knowledge
- url
- reasoning
- page
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000004140 cleaning Methods 0.000 claims abstract description 65
- 230000008878 coupling Effects 0.000 claims description 55
- 238000010168 coupling process Methods 0.000 claims description 55
- 238000005859 coupling reaction Methods 0.000 claims description 55
- 238000004458 analytical method Methods 0.000 claims description 29
- 238000004806 packaging method and process Methods 0.000 claims description 24
- 230000003542 behavioural effect Effects 0.000 claims description 14
- 238000012790 confirmation Methods 0.000 claims description 12
- 238000012217 deletion Methods 0.000 claims description 11
- 230000037430 deletion Effects 0.000 claims description 11
- 238000012795 verification Methods 0.000 claims description 10
- 238000001914 filtration Methods 0.000 claims description 7
- 230000013011 mating Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 21
- 238000012549 training Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 7
- 241001269238 Data Species 0.000 description 4
- 244000089409 Erythrina poeppigiana Species 0.000 description 4
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 4
- 230000003203 everyday effect Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012423 maintenance Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000007689 inspection Methods 0.000 description 3
- 238000004611 spectroscopical analysis Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 108020004705 Codon Proteins 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
Description
索引值 | Hash列表中的“完整URl”清洗规则 | 类别 | 置信度 |
0 | Entry=222.186.14.3/ | 搜索引擎 | 5.78% |
1 | Entry=mob.3g.cn/sorry/404/error.html | 错误 | 4.96% |
2 | Entry=222.186.14.5/ | 搜索引擎 | 4.52% |
3 | Entry=mob.3g.cn/sorry/404/404.wml | 错误 | 3.89% |
4 | Entry=www.umeng.com/check_config_update | 软件升级 | 3.57% |
…… |
索引值 | Hash列表中的“一级域名”清洗规则 | 置信度 |
0 | Entry=qq.com | 9.25% |
1 | Entry=cnzz.net | 8.36% |
2 | Entry=baidu.com | 7.25% |
3 | Entry=taobao.com | 4.37% |
4 | Entry5=qlogo.cn | 3.58% |
…… |
索引值 | Hash列表中的“完整URL”内容分类规则 | 类别 | 置信度 |
0 | launcher.warcraftchina.com/2.0/?locale=zh-CN | 网络游戏 | 3.15% |
1 | www.222tk.com/ | 彩票 | 2.87% |
2 | street.yoka.com/clockbeauty/ | 时尚 | 2.45% |
3 | 3g.eastmoney.com/Money.aspx | 财经 | 1.67% |
4 | house.lsfc.net.cn/sell_info.asp?id=1097356 | 房产 | 1.54% |
…… |
索引值 | Hash列表中的“一级域名”内容分类规则 | 置信度 |
0 | Entry=sina.com.cn | 4.32% |
1 | Entry=sohu.com | 3.98% |
2 | Entry=ifeng.com | 3.45% |
3 | Entry=sina.cn | 2.65% |
4 | Entry=qidian.cn | 2.14% |
…… |
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410126495.2A CN103902703B (zh) | 2014-03-31 | 2014-03-31 | 基于移动互联网访问的文本内容分类方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410126495.2A CN103902703B (zh) | 2014-03-31 | 2014-03-31 | 基于移动互联网访问的文本内容分类方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103902703A CN103902703A (zh) | 2014-07-02 |
CN103902703B true CN103902703B (zh) | 2016-02-10 |
Family
ID=50994025
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410126495.2A Active CN103902703B (zh) | 2014-03-31 | 2014-03-31 | 基于移动互联网访问的文本内容分类方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103902703B (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103838886A (zh) * | 2014-03-31 | 2014-06-04 | 辽宁四维科技发展有限公司 | 基于代表词知识库的文本内容分类方法 |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105528351A (zh) * | 2014-09-29 | 2016-04-27 | 中国电信股份有限公司 | 一种移动终端获取互联网信息的内容去重方法及系统 |
CN106161352A (zh) * | 2015-03-31 | 2016-11-23 | 阿里巴巴集团控股有限公司 | 一种匹配方法和客户端,服务器以及匹配设备 |
CN105117436B (zh) * | 2015-08-10 | 2018-03-30 | 上海晶赞科技发展有限公司 | 网站频道自动挖掘方法 |
CN105930444A (zh) * | 2016-04-20 | 2016-09-07 | 广州精点计算机科技有限公司 | 一种互联网用户分群方法及系统 |
CN105956002A (zh) * | 2016-04-20 | 2016-09-21 | 广州精点计算机科技有限公司 | 一种基于url分析的网页分类方法及装置 |
CN106294861B (zh) * | 2016-08-23 | 2019-08-09 | 武汉烽火普天信息技术有限公司 | 面向大规模数据的情报系统中文本聚合及展现方法及系统 |
CN109241274B (zh) * | 2017-07-04 | 2022-01-25 | 腾讯科技(深圳)有限公司 | 文本聚类方法及装置 |
CN111258969B (zh) * | 2018-11-30 | 2023-08-15 | 中国移动通信集团浙江有限公司 | 一种互联网访问日志解析方法及装置 |
CN109739849B (zh) * | 2019-01-02 | 2021-06-29 | 山东省科学院情报研究所 | 一种数据驱动的网络敏感信息挖掘与预警平台 |
CN110008340A (zh) * | 2019-03-27 | 2019-07-12 | 曲阜师范大学 | 一种多源文本知识表示、获取与融合系统 |
CN110460592B (zh) * | 2019-07-26 | 2021-03-26 | 光通天下网络科技股份有限公司 | Url分析方法、装置、设备及介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101593200A (zh) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | 基于关键词频度分析的中文网页分类方法 |
CN103136372A (zh) * | 2013-03-21 | 2013-06-05 | 陕西通信信息技术有限公司 | 网络可信性行为管理中url快速定位、分类和过滤方法 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100592293C (zh) * | 2007-04-28 | 2010-02-24 | 李树德 | 基于智能本体的知识搜索引擎及其实现方法 |
-
2014
- 2014-03-31 CN CN201410126495.2A patent/CN103902703B/zh active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101593200A (zh) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | 基于关键词频度分析的中文网页分类方法 |
CN103136372A (zh) * | 2013-03-21 | 2013-06-05 | 陕西通信信息技术有限公司 | 网络可信性行为管理中url快速定位、分类和过滤方法 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103838886A (zh) * | 2014-03-31 | 2014-06-04 | 辽宁四维科技发展有限公司 | 基于代表词知识库的文本内容分类方法 |
Also Published As
Publication number | Publication date |
---|---|
CN103902703A (zh) | 2014-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103902703B (zh) | 基于移动互联网访问的文本内容分类方法 | |
US11580104B2 (en) | Method, apparatus, device, and storage medium for intention recommendation | |
CN102831199B (zh) | 建立兴趣模型的方法及装置 | |
CN103914478B (zh) | 网页训练方法及系统、网页预测方法及系统 | |
CN107862022B (zh) | 文化资源推荐系统 | |
CN104850574B (zh) | 一种面向文本信息的敏感词过滤方法 | |
CN103810162B (zh) | 推荐网络信息的方法和系统 | |
CN103838886A (zh) | 基于代表词知识库的文本内容分类方法 | |
CN110688553A (zh) | 基于数据分析的信息推送方法、装置、计算机设备及存储介质 | |
CN103546326B (zh) | 一种网站流量统计的方法 | |
US20080104037A1 (en) | Automated scheme for identifying user intent in real-time | |
CN106202514A (zh) | 基于Agent的突发事件跨媒体信息的检索方法及系统 | |
CN102667761A (zh) | 可扩展的集群数据库 | |
CN103218431A (zh) | 一种能识别网页信息自动采集的系统与方法 | |
CN106874292A (zh) | 话题处理方法及装置 | |
CN112199508B (zh) | 一种基于远程监督的参数自适应农业知识图谱推荐方法 | |
CN111767725A (zh) | 一种基于情感极性分析模型的数据处理方法及装置 | |
CN109783619A (zh) | 一种数据过滤挖掘方法 | |
CN104809252A (zh) | 互联网数据提取系统 | |
CN110134845A (zh) | 项目舆情监控方法、装置、计算机设备及存储介质 | |
CN111767443A (zh) | 一种高效的网络爬虫分析平台 | |
CN103914534B (zh) | 基于专家系统url分类知识库的文本内容分类方法 | |
CN108984514A (zh) | 词语的获取方法及装置、存储介质、处理器 | |
CN105389328B (zh) | 一种大规模开源软件搜索排序优化方法 | |
CN116775972A (zh) | 基于信息技术的远端资源整理服务方法和系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20151228 Address after: 110020 Shenyang, Liaoning, Tiexi District, No. nine small road 12 3-7-1 Applicant after: Guo Lei Address before: 110043, Dadong Road, Dadong District, Liaoning, 134, two gate, two floor, Shenyang Applicant before: LIAONING SIWEI SCIENCE AND TECHNOLOGY DEVELOPMENTCO., Ltd. |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20200110 Address after: 100088 B601, floor 1, building 5, yard 13, Huayuan Road, Haidian District, Beijing Patentee after: Beijing Dongfang Yixin Technology Co.,Ltd. Address before: 110020, No. 12, No. nine, Tiexi Road, Shenyang District, Liaoning, 3-7-1 Patentee before: Guo Lei |
|
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210928 Address after: 1530, Lin 10, No. 84, Wenquan Road, Wenquan Town, Haidian District, Beijing 100095 Patentee after: Beijing yunqi lechuang Technology Co.,Ltd. Address before: 100088 B601, North 1st floor, building 5, yard 13, Huayuan Road, Haidian District, Beijing Patentee before: Beijing Dongfang Yixin Technology Co.,Ltd. |
|
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220224 Address after: 101100 room 252, floor 2, building 7, courtyard 15, Tonghu street, Tongzhou District, Beijing Patentee after: Beijing Zhongding Yixin Technology Co.,Ltd. Address before: 1530, Lin 10, No. 84, Wenquan Road, Wenquan Town, Haidian District, Beijing 100095 Patentee before: Beijing yunqi lechuang Technology Co.,Ltd. |
|
TR01 | Transfer of patent right | ||
PP01 | Preservation of patent right |
Effective date of registration: 20221028 Granted publication date: 20160210 |
|
PP01 | Preservation of patent right |