CN109522562B - 一种基于文本图像融合识别的网页知识抽取方法 - Google Patents
一种基于文本图像融合识别的网页知识抽取方法 Download PDFInfo
- Publication number
- CN109522562B CN109522562B CN201811449829.4A CN201811449829A CN109522562B CN 109522562 B CN109522562 B CN 109522562B CN 201811449829 A CN201811449829 A CN 201811449829A CN 109522562 B CN109522562 B CN 109522562B
- Authority
- CN
- China
- Prior art keywords
- webpage
- data
- service
- knowledge
- cloud
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 66
- 230000004927 fusion Effects 0.000 title claims abstract description 7
- 238000004458 analytical method Methods 0.000 claims abstract description 46
- 230000009193 crawling Effects 0.000 claims abstract description 26
- 238000013135 deep learning Methods 0.000 claims abstract description 12
- 238000011156 evaluation Methods 0.000 claims description 18
- 238000000034 method Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 12
- 238000010191 image analysis Methods 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 6
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000007670 refining Methods 0.000 claims description 3
- 238000004088 simulation Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811449829.4A CN109522562B (zh) | 2018-11-30 | 2018-11-30 | 一种基于文本图像融合识别的网页知识抽取方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811449829.4A CN109522562B (zh) | 2018-11-30 | 2018-11-30 | 一种基于文本图像融合识别的网页知识抽取方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109522562A CN109522562A (zh) | 2019-03-26 |
CN109522562B true CN109522562B (zh) | 2023-04-18 |
Family
ID=65793706
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811449829.4A Active CN109522562B (zh) | 2018-11-30 | 2018-11-30 | 一种基于文本图像融合识别的网页知识抽取方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109522562B (zh) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135414B (zh) * | 2019-05-16 | 2021-07-09 | 京北方信息技术股份有限公司 | 语料库更新方法、装置、存储介质及终端 |
CN110909531B (zh) * | 2019-10-18 | 2024-03-22 | 平安科技(深圳)有限公司 | 信息安全的甄别方法、装置、设备及存储介质 |
CN111858963B (zh) * | 2020-07-28 | 2024-02-23 | 中国银行股份有限公司 | 网页客服知识抽取方法及装置 |
CN112131506B (zh) * | 2020-09-24 | 2022-04-29 | 厦门市美亚柏科信息股份有限公司 | 一种网页分类方法、终端设备及存储介质 |
CN112328858A (zh) * | 2020-11-04 | 2021-02-05 | 中国海洋大学 | 一种基于深度学习的海洋船舶数据采集管理系统及方法 |
CN112765340A (zh) * | 2021-01-26 | 2021-05-07 | 中国电子信息产业集团有限公司第六研究所 | 一种确定云服务资源的方法、装置、电子设备及存储介质 |
CN116049597B (zh) * | 2023-01-10 | 2024-04-19 | 北京百度网讯科技有限公司 | 网页的多任务模型的预训练方法、装置及电子设备 |
CN117521602B (zh) * | 2024-01-04 | 2024-03-22 | 深圳大数信科技术有限公司 | 基于rpa+nlp的多模态文字转换方法、系统及介质 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009061399A1 (en) * | 2007-11-05 | 2009-05-14 | Nagaraju Bandaru | Method for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis |
CN105045838A (zh) * | 2015-07-01 | 2015-11-11 | 华东师范大学 | 基于分布式存储系统的网络爬虫系统 |
-
2018
- 2018-11-30 CN CN201811449829.4A patent/CN109522562B/zh active Active
Also Published As
Publication number | Publication date |
---|---|
CN109522562A (zh) | 2019-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522562B (zh) | 一种基于文本图像融合识别的网页知识抽取方法 | |
CN106874378B (zh) | 基于规则模型的实体抽取与关系挖掘构建知识图谱的方法 | |
Tang et al. | Big data in forecasting research: a literature review | |
Kathuria et al. | Classifying the user intent of web queries using k‐means clustering | |
CN103870973B (zh) | 基于电子信息的关键词提取的信息推送、搜索方法及装置 | |
US11550856B2 (en) | Artificial intelligence for product data extraction | |
TWI695277B (zh) | 自動化網站資料蒐集方法 | |
CN105512687A (zh) | 训练情感分类模型和文本情感极性分析的方法及系统 | |
CN107885793A (zh) | 一种微博热点话题分析预测方法及系统 | |
CN103914478A (zh) | 网页训练方法及系统、网页预测方法及系统 | |
CN111783394A (zh) | 事件抽取模型的训练方法、事件抽取方法和系统及设备 | |
CN111192176B (zh) | 一种支持教育信息化评估的在线数据采集方法及装置 | |
Patnaik et al. | Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks | |
CN103309862A (zh) | 一种网页类型识别方法和系统 | |
CN114238573B (zh) | 基于文本对抗样例的信息推送方法及装置 | |
CN113918794B (zh) | 企业网络舆情效益分析方法、系统、电子设备及存储介质 | |
Assi et al. | FeatCompare: Feature comparison for competing mobile apps leveraging user reviews | |
US20170235835A1 (en) | Information identification and extraction | |
CN107330705A (zh) | 一种根据多数据源防欺诈的方法和系统 | |
Baranowski et al. | Social welfare in the light of topic modelling | |
Bu et al. | An FAR-SW based approach for webpage information extraction | |
US20130332440A1 (en) | Refinements in Document Analysis | |
US10990881B1 (en) | Predictive analytics using sentence data model | |
CN117033654A (zh) | 一种面向科技迷雾识别的科技事件图谱构建方法 | |
CN114238735B (zh) | 一种互联网数据智能采集方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20230317 Address after: 250000 building S02, No. 1036, Langchao Road, high tech Zone, Jinan City, Shandong Province Applicant after: Shandong Inspur Scientific Research Institute Co.,Ltd. Address before: 250100 First Floor of R&D Building 2877 Kehang Road, Sun Village Town, Jinan High-tech Zone, Shandong Province Applicant before: JINAN INSPUR HIGH-TECH TECHNOLOGY DEVELOPMENT Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20190326 Assignee: Shandong Inspur Digital Business Technology Co.,Ltd. Assignor: Shandong Inspur Scientific Research Institute Co.,Ltd. Contract record no.: X2023980053547 Denomination of invention: A web knowledge extraction method based on text image fusion recognition Granted publication date: 20230418 License type: Exclusive License Record date: 20231226 |